Regular Expression To Split On Comma Except If Quoted - c#

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:
max,emily,john = ["max", "emily", "john"]
BUT
max,"emily,kate",john = ["max", "emily,kate", "john"]
Looking to use in C#: Regex.Split(string, "PATTERN-HERE");
Thanks.

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.
You might try something like this instead:
public static IEnumerable<string> SplitCSV(string csvString)
{
var sb = new StringBuilder();
bool quoted = false;
foreach (char c in csvString) {
if (quoted) {
if (c == '"')
quoted = false;
else
sb.Append(c);
} else {
if (c == '"') {
quoted = true;
} else if (c == ',') {
yield return sb.ToString();
sb.Length = 0;
} else {
sb.Append(c);
}
}
}
if (quoted)
throw new ArgumentException("csvString", "Unterminated quotation mark.");
yield return sb.ToString();
}
It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.
Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():
You could use the regex (please don't!)
(?<=^(?:[^"]*"[^"]*")*[^"]*) # assert that there is an even number of quotes before...
\s*,\s* # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$) # as well as after the comma.
if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.
This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

A little late maybe but I hope I can help someone else
String[] cols = Regex.Split("max, emily, john", #"\s*,\s*");
foreach ( String s in cols ) {
Console.WriteLine(s);
}

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.
Here's our simple regex:
"[^"]*"|(,)
The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
string s1 = #"max,""emily,kate"",john";
var myRegex = new Regex(#"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return "SplitHere";
});
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Related

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

Replace one character but not two in a string

I want to replace single occurrences of a character but not two in a string using C#.
For example, I want to replace & by an empty string but not when the ocurrence is &&. Another example, a&b&&c would become ab&&c after the replacement.
If I use a regex like &[^&], it will also match the character after the & and I don't want to replace it.
Another solution I found is to iterate over the string characters.
Do you know a cleaner solution to do that?
To only match one & (not preceded or followed by &), use look-arounds (?<!&) and (?!&):
(?<!&)&(?!&)
See regex demo
You tried to use a negated character class that still matches a character, and you need to use a look-ahead/look-behind to just check for some character absence/presence, without consuming it.
See regular-expressions.info:
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u).
Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt.
You can match both & and && (or any number of repetition) and only replace the single one with an empty string:
str = Regex.Replace(str, "&+", m => m.Value.Length == 1 ? "" : m.Value);
You can use this regex: #"(?<!&)&(?!&)"
var str = Regex.Replace("a&b&&c", #"(?<!&)&(?!&)", "");
Console.WriteLine(str); // ab&&c
You can go with this:
public static string replacement(string oldString, char charToRemove)
{
string newString = "";
bool found = false;
foreach (char c in oldString)
{
if (c == charToRemove && !found)
{
found = true;
continue;
}
newString += c;
}
return newString;
}
Which is as generic as possible
I would use something like this, which IMO should be better than using Regex:
public static class StringExtensions
{
public static string ReplaceFirst(this string source, char oldChar, char newChar)
{
if (string.IsNullOrEmpty(source)) return source;
int index = source.IndexOf(oldChar);
if (index < 0) return source;
var chars = source.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
}
I'll contribute to this statement from the comments:
in this case, only the substring with odd number of '&' will be replaced by all the "&" except the last "&" . "&&&" would be "&&" and "&&&&" would be "&&&&"
This is a pretty neat solution using balancing groups (though I wouldn't call it particularly clean nor easy to read).
Code:
string str = "11&222&&333&&&44444&&&&55&&&&&";
str = Regex.Replace(str, "&((?:(?<2>&)(?<-2>&)?)*)", "$1$2");
Output:
11222&&333&&44444&&&&55&&&&
ideone demo
It always matches the first & (not captured).
If it's followed by an even number of &, they're matched and stored in $1. The second group is captured by the first of the pair, but then it's substracted by the second.
However, if there's there's an odd number of of &, the optional group (?<-2>&)? does not match, and the group is not substracted. Then, $2 will capture an extra &
For example, matching the subject "&&&&", the first char is consumed and it isn't captured (1). The second and third chars are matched, but $2 is substracted (2). For the last char, $2 is captured (3). The last 3 chars were stored in $1, and there's an extra & in $2.
Then, the substitution "$1$2" == "&&&&".

Detect Two Consecutive Single Quotes Inside Single Quotes

I'm struggling to get this regex pattern exactly right, and am open to other options outside of regex if someone has a better alternative.
The situation:
I'm basically looking to parse a T-SQL "in" clause against a text column in C#. So, I need to take a string value like this:
"'don''t', 'do', 'anything', 'stupid'"
And interpret that as a list of values (I'll take care of the double single quotes later):
"don''t"
"do"
"anything"
"stupid"
I have a regex that works for most cases, but I'm struggling to generalize it to the point where it will accept any character OR a doubled-up single quote inside my group: (?:')([a-z0-9\s(?:'(?='))]+)(?:')[,\w]*
I'm fairly experienced with regexes, but have rarely, if ever, found a need for look-arounds (so downgrade my assessment of my regex experience accordingly).
So, to put this another way, I'm wanting to take a string of comma-delimited values, each enclosed in single quotes but can contain doubled single quotes, and output each such value.
EDIT
Here's a non-working example with my current regex (my problem is I need to handle all characters in my grouping and stop when I encounter a single quote not followed by a second single quote):
"'don''t', 'do?', 'anything!', '#stupid$'"
If you still think about a regex-based solution, you can use the following regex:
'(?:''|[^'])*'
Or an "un-rolled" version suggested by #sln:
'[^']*(?:''[^']*)*'
See demo
It is fairly simple, it captures double single quotation marks OR anything that is not a single quotation mark. No need using any look-behinds or look-aheads. It does not take care of any escaped entities, but I do not see this requirement in your question.
Moreover, this regex will return matches that are easy to access and deal with:
var text = "'don''t', 'do', 'anything', 'stupid'";
var re = new Regex(#"'[^']*(?:''[^']*)*'"); // Updated thanks to #sln, previous (#"'(?:''|[^'])*'");
var match_values = re.Matches(text).Cast<Match>().Select(p => p.Value).ToList();
Output:
If you want to use the Capture Collection feature, you can grab them all in a
single pass.
# #"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+"""
"
\s*
(?:
'
( # (1 start)
[^']*
(?:
'' [^']*
)*
) # (1 end)
'
\s*
(?:
, \s*
| (?= " )
)
)+
"
C# code:
string strSrc = "\"'don''t', 'do', 'anything', 'stupid'\"";
Regex rx = new Regex(#"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+""");
Match srcMatch = rx.Match(strSrc);
if (srcMatch.Success)
{
CaptureCollection cc = srcMatch.Groups[1].Captures;
for (int i = 0; i < cc.Count; i++)
Console.WriteLine("{0} = '{1}'", i, cc[i].Value);
}
Output:
0 = 'don''t'
1 = 'do'
2 = 'anything'
3 = 'stupid'
Press any key to continue . . .
Why don't you split on ', ':
Regex regex = new Regex(#"'\s*,\s*'");
string[] substrings = regex.Split(str);
And then take care of the extra single quotes by Trimming
Looks to me like you're over-thinking the problem. A quoted string with an escaped quote looks just like two strings without escaped quotes, one right after the other (not even spaces between them).
(?:'[^']*')+
Of course, you'll have to remove the enclosing quotes, but you probably had to do some post-processing anyway, to unescape the escaped quotes.
Also note that I'm not trying to validate the input or work around possible errors; for example, I don't bother matching the commas between the strings. If the input is well formed, this regex should be all you need.
In the interest of maintainability, I decided against a regex and followed the advice of using a state machine. Here's the crux of my implementation:
string currentTerm = string.Empty;
State currentState = State.BetweenTerms;
foreach (char c in valueToParse)
{
switch (currentState)
{
// if between terms, only need to do something if we encounter a single quote, signalling to start a new term
// encloser is client-specified char to look for (e.g. ')
case State.BetweenTerms:
if (c == encloser)
{
currentState = State.InTerm;
}
break;
case State.InTerm:
if (c == encloser)
{
if (valueToParse.Length > index + 1 && valueToParse[index + 1] == encloser && valueToParse.Length > index + 2)
{
// if next character is also encloser then add it and move on
currentTerm += c;
}
else if (currentTerm.Length > 0 && currentTerm[currentTerm.Length - 1] != encloser)
{
// on an encloser and didn't just add encloser, so we are done
// converterFunc is a client-specified Func<string,T> to return terms in the specified type (to allow for converting to int, for example)
yield return converterFunc(currentTerm);
currentTerm = string.Empty;
currentState = State.BetweenTerms;
}
}
else
{
currentTerm += c;
}
break;
}
index++;
}
if (currentTerm.Length > 0)
{
yield return converterFunc(currentTerm);
}

C# Replace with regex

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.
EDIT: Per comments this code block has been changed.
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
fileContents = fileContents.Replace(fileContents, #"regex", "");
regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
My files are formatted like this:
"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,
So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso
(?<=,")([^"]+,[^"]+)(?=",)
I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?
SOLVED:
Combined [^"]+ with look behind/ahead:
(?<=,"[^"]+)(,)(?=[^"]+",)
FINAL EDIT:
Here's my final complete solution:
//read file contents
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");
//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));
//write result back to file
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",
(?<=,"[^"]+)(,)(?=[^"]+",)
Try to parse out all your columns with this:
Regex regex = new Regex("(?<=\").*?(?=\")");
Then you can just do:
foreach(Match match in regex.Matches(filecontents))
{
fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
}
Might not be as fast but should work.
I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text.
This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.
I find keeping your regexes simple will pay benefits when you're trying to maintain them later.
Note: this is similar to the answer by #Florian, but this replace restricts itself to replacement only in the matched text.
string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp);
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))
What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.
While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.
(?<=,"[^"]*),(?=[^"]*",)
Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:
"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,
In this case you are going to be out of luck with a regular expression.
Thankfully the code to deal with CSV files is very simple:
public static IList<string> ParseCSVLine(string csvLine)
{
List<string> result = new List<string>();
StringBuilder buffer = new StringBuilder();
bool inQuotes = false;
char lastChar = '\0';
foreach (char c in csvLine)
{
switch (c)
{
case '"':
if (inQuotes)
{
inQuotes = false;
}
else
{
if (lastChar == '"')
{
buffer.Append('"');
}
inQuotes = true;
}
break;
case ',':
if (inQuotes)
{
buffer.Append(',');
}
else
{
result.Add(buffer.ToString());
buffer.Clear();
}
break;
default:
buffer.Append(c);
break;
}
lastChar = c;
}
result.Add(buffer.ToString());
buffer.Clear();
return result;
}
PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

Detecting "\" (backslash) using Regex

I have a C# Regex like
[\"\'\\/]+
that I want to use to evaluate and return error if certain special characters are found in a string.
My test string is:
\test
I have a call to this method to validate the string:
public static bool validateComments(string input, out string errorString)
{
errorString = null;
bool result;
result = !Regex.IsMatch(input, "[\"\'\\/]+"); // result is true if no match
// return an error if match
if (result == false)
errorString = "Comments cannot contain quotes (double or single) or slashes.";
return result;
}
However, I am unable to match the backslash. I have tried several tools such as regexpal and a VS2012 extension that both seem to match this regex just fine, but the C# code itself won't. I do realize that C# is escaping the string as it is coming in from a Javascript Ajax call, so is there another way to match this string?
It does match /test or 'test or "test, just not \test
The \ is used even by Regex(es). Try "[\"\'\\\\/]+" (so double escape the \)
Note that you could have #"[""'\\/]+" and perhaps it would be more readable :-) (by using the # the only character you have to escape is the ", by the use of a second "")
You don't really need the +, because in the end [...] means "one of", and it's enough for you.
Don't eat what you can't chew... Instead of regexes use
// result is true if no match
result = input.IndexOfAny(new[] { '"', '\'', '\\', '/' }) == -1;
I don't think anyone ever lost the work because he preferred IndexOf instead of a regex :-)
You can solve this by making the string verbatim like this #:
result = !Regex.IsMatch(input, #"[\""\'\\/]+");
Since backslashes are used as escapes inside regex themselves, I find it best to use verbatim strings when working with the regex library:
string input = #"\test";
bool result = !Regex.IsMatch(input, #"[""'\\]+");
// ^^
// You need to double the double-quotes when working with verbatim strings;
// All other characters, including backslashes, remain unchanged.
if (!result) {
Console.WriteLine("Comments cannot contain quotes (double or single) or slashes.");
}
The only issue with that is that you must double your double-quotes (which is ironically what you need to do in your case).
Demo on ideone.
For the trivial case, I am able to use regexhero.net for your test expression using the simple:
\\
to validate
\test
The code generated by RegExHero:
string strRegex = #"\\";
RegexOptions myRegexOptions = RegexOptions.IgnoreCase;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"\test";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}

Categories

Resources