C# split comma separated values - c#

How can I split comma separated strings with quoted strings that can also contain commas?
Example input:
John, Doe, "Sid, Nency", Smith
Expected output:
John
Doe
Sid, Nency
Smith
Split by commas was ok, but I've got requirement that strings like "Sid, Nency" are allowed. I tried to use regexes to split such values. Regex ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" is from Java question and it is not working good for my .NET code. It doubles some strings, finds extra results etc.
So what is the best way to split such strings?

It's because of the capture group. Just turn it into a non-capture group:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
^^
The capture group is including the captured part in your results.
ideone demo
var regexObj = new Regex(#",(?=(?:[^""]*""[^""]*"")*[^""]*$)");
regexObj.Split(input).Select(s => s.Trim('\"', ' ')).ForEach(Console.WriteLine);
And just trim the results.

Just go through your string. As you go through your string keep track
if you're in a "block" or not. If you're - don't treat the comma as
a comma (as a separator). Otherwise do treat it as such. It's a simple
algorithm, I would write it myself. When you encounter first " you enter
a block. When you encounter next ", you end that block you were, and so on.
So you can do it with one pass through your string.
import java.util.ArrayList;
public class Test003 {
public static void main(String[] args) {
String s = " John, , , , \" Barry, John \" , , , , , Doe, \"Sid , Nency\", Smith ";
StringBuilder term = new StringBuilder();
boolean inQuote = false;
boolean inTerm = false;
ArrayList<String> terms = new ArrayList<String>();
for (int i=0; i<s.length(); i++){
char ch = s.charAt(i);
if (ch == ' '){
if (inQuote){
if (!inTerm) {
inTerm = true;
}
term.append(ch);
}
else {
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else if (ch== '"'){
term.append(ch); // comment this out if you don't need it
if (!inTerm){
inTerm = true;
}
inQuote = !inQuote;
}else if (ch == ','){
if (inQuote){
if (!inTerm){
inTerm = true;
}
term.append(ch);
}else{
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else{
if (!inTerm){
inTerm = true;
}
term.append(ch);
}
}
if (inTerm){
terms.add(term.toString());
}
for (String t : terms){
System.out.println("|" + t + "|");
}
}
}

I use the following code within my Csv Parser class to achieve this:
private string[] ParseLine(string line)
{
List<string> results = new List<string>();
bool inQuotes = false;
int index = 0;
StringBuilder currentValue = new StringBuilder(line.Length);
while (index < line.Length)
{
char c = line[index];
switch (c)
{
case '\"':
{
inQuotes = !inQuotes;
break;
}
default:
{
if (c == ',' && !inQuotes)
{
results.Add(currentValue.ToString());
currentValue.Clear();
}
else
currentValue.Append(c);
break;
}
}
++index;
}
results.Add(currentValue.ToString());
return results.ToArray();
} // eo ParseLine

If you find the regular expression too complex you can do it like this:
string initialString = "John, Doe, \"Sid, Nency\", Smith";
IEnumerable<string> splitted = initialString.Split('"');
splitted = splitted.SelectMany((str, index) => index % 2 == 0 ? str.Split(',') : new[] { str });
splitted = splitted.Where(str => !string.IsNullOrWhiteSpace(str)).Select(str => str.Trim());

Related

c# get the first ';' after parentheses

i feel dumb for asking a most likely silly question.
I am helping someone getting the results he wishes for his custom compiler that reads all lines of an xml file in one string so it will look like below, and since he wants it to "Support" to call variables inside the array worst case scenario would look like below:
"Var1 = [5,4,3,2]; Var2 = [2,8,6,Var1;4];"
What i need is to find the first ";" after "[" and "]" and split it, so i stand with this:
"Var1 = [5,4,3,2];
It will also have to support multiple "[", "]" for example:
"Var2 = [5,Var1,[4],2];"
EDIT: There may also be Data in between the last "]" and ";"
For example:
"Var2 = [5,[4],2]Var1;
What can i do here? Im kind of stuck.
You can try regular expressions, e.g.
string source = "Var1 = [5,4,3,2]; Var2 = [2,8,6,Var1;4];";
// 1. final (or the only) chunk doesn't necessary contain '];':
// "abc" -> "abc"
// 2. chunk has at least one symbol except '];'
string pattern = ".+?(][a-zA-Z0-9]*;|$)";
var items = Regex
.Matches(source, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.Write(string.Join(Environment.NewLine, items));
Outcome:
Var1 = [5,4,3,2]abc123;
Var2 = [2,8,6,Var1;4];
^([^;]+);
This regex should work for all.
You can use it like here:
string[] lines =
{
"Var1 = [5,4,3,2]; Var2 = [2,8,6,Var1;4];",
"Var2 = [5,[4],2]Var1; Var2 = [2,8,6,Var1;4];"
};
Regex pattern = new Regex(#"^([^;]+);");
foreach (string s in lines){
Match match = pattern.Match(s);
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
The explanation is:
^ means starts with and is [^;] anything but a semicolon
+ means repeated one or more times and is ; followed by a semicolon
This will find Var1 = [5,4,3,2]; as well as Var1 = [5,4,3,2];
You can see the output HERE
public static string Extract(string str, char splitOn)
{
var split = false;
var count = 0;
var bracketCount = 0;
foreach (char c in str)
{
count++;
if (split && c == splitOn)
return str.SubString(0, count);
if (c == '[')
{
bracketCount++;
split = false;
}
else if (c == ']')
{
bracketCount--;
if (bracketCount == 0)
{
split = true;
}
else if (bracketCount < 0)
throw new FormatException(); //?
}
}
return str;
}

Split a string, but ignoring delimit in brackets or braces

I have a string like
a,[1,2,3,{4,5},6],b,{c,d,[e,f],g},h
After split by , I expect getting 5 items, the , in the braces or brackets are ignored.
a
[1,2,3,{4,5},6]
b
{c,d,[e,f],g}
h
There are no whitespaces in the string. Is there a regular expression can make it happen?
You could use this:
var input = "a,[1,2,3,{4,5}],b,{c,d,[e,f]},g";
var result =
(from Match m in Regex.Matches(input, #"\[[^]]*]|\{[^}]*}|[^,]+")
select m.Value)
.ToArray();
This will find any matches like:
[ followed by any characters other than ], then terminated by ]
{ followed by any characters other than }, then terminated by }
One or more characters other than ,
This will work, for you sample input, but it cannot handle nested groups like [1,[2,3],4] or {1,{2,3},4}. For that, I'd recommend something a bit more powerful regular expressions. Since you've mentioned in your comments that you're trying to parse Json, I'd recommend you check out the excellent Json.NET library.
Regular expressions * cannot be used to parse nested structures **.
( ∗ True regular expressions without non-regular extensions )
( ∗∗ Nested structures of arbitrary depth and interleaving )
But parsing by hand is not that difficult. First you need to find the , that are not in brackets or braces.
string input = "a,[1,2,3,{4,5},6],b,{c,d,[e,f],g},h";
var delimiterPositions = new List<int>();
int bracesDepth = 0;
int bracketsDepth = 0;
for (int i = 0; i < input.Length; i++)
{
switch (input[i])
{
case '{':
bracesDepth++;
break;
case '}':
bracesDepth--;
break;
case '[':
bracketsDepth++;
break;
case ']':
bracketsDepth--;
break;
default:
if (bracesDepth == 0 && bracketsDepth == 0 && input[i] == ',')
{
delimiterPositions.Add(i);
}
break;
}
}
And then split the string at these positions.
public List<string> SplitAtPositions(string input, List<int> delimiterPositions)
{
var output = new List<string>();
for (int i = 0; i < delimiterPositions.Count; i++)
{
int index = i == 0 ? 0 : delimiterPositions[i - 1] + 1;
int length = delimiterPositions[i] - index;
string s = input.Substring(index, length);
output.Add(s);
}
string lastString = input.Substring(delimiterPositions.Last() + 1);
output.Add(lastString);
return output;
}
Even if it looks ugly and there is no regex involved (not sure if it's a requirement or a nice-to-have in the original question), this alternative should work:
class Program
{
static void Main(string[] args)
{
var input = "a,[1,2,3,{4,5}],b,{c,d,[e,f]},g";
var output = "<root><n>" +
input.Replace(",", "</n><n>")
.Replace("[", "<n1><n>")
.Replace("]", "</n></n1>")
.Replace("{", "<n2><n>")
.Replace("}", "</n></n2>") +
"</n></root>";
var elements = XDocument
.Parse(output, LoadOptions.None)
.Root.Elements()
.Select(e =>
{
if (!e.HasElements)
return e.Value;
else
{
return e.ToString()
.Replace(" ", "")
.Replace("\r\n", "")
.Replace("</n><n>", ",")
.Replace("<n1>", "[")
.Replace("</n1>", "]")
.Replace("<n2>", "{")
.Replace("</n2>", "}")
.Replace("<n>", "")
.Replace("</n>", "")
.Replace("\r\n", "")
;
}
}).ToList();
}
}

How to split a space-delimited list of paths where paths can include spaces in .NET 2?

For instance:
c:\dir1 c:\dir2 "c:\my files" c:\code "old photos" "new photos"
Should be read as a list:
c:\dir1
c:\dir2
c:\my files
c:\code
old photos
new photos
I can write a function which parses the string linearly but wondered if the .NET 2.0 toolbox has any cool tricks one could use?
Since you have to hit every character I think a brute force is going to give you the best performance.
That way you hit every character exactly once.
And it limits the number of comparisons performed.
static void Main(string[] args)
{
string input = #"c:\dir1 c:\dir2 ""c:\my files"" c:\code ""old photos"" ""new photos""";
List<string> splitInput = MySplit(input);
foreach (string s in splitInput)
{
System.Diagnostics.Debug.WriteLine(s);
}
System.Diagnostics.Debug.WriteLine(input);
}
public static List<string> MySplit(string input)
{
List<string> split = new List<string>();
StringBuilder sb = new StringBuilder();
bool splitOnQuote = false;
char quote = '"';
char space = ' ';
foreach (char c in input.ToCharArray())
{
if (splitOnQuote)
{
if (c == quote)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
splitOnQuote = false;
}
else { sb.Append(c); }
}
else
{
if (c == space)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
}
else if (c == quote)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
splitOnQuote = true;
}
else { sb.Append(c); }
}
}
if (sb.Length > 0) split.Add(sb.ToString());
return split;
}
Usually for this type of problem one could develop a regular expression to parse out the fields. ( "(.*?)" ) would give you all the string values in quotes. You could strip all those values from your string, and then do a simple split on space after all the quoted items are out.
static void Main(string[] args)
{
string myString = "\"test\" test1 \"test2 test3\" test4 test6 \"test5\"";
string myRegularExpression = #"""(.*?)""";
List<string> listOfMatches = new List<string>();
myString = Regex.Replace(myString, myRegularExpression, delegate(Match match)
{
string v = match.ToString();
listOfMatches.Add(v);
return "";
});
var array = myString.Split(' ');
foreach (string s in array)
{
if(s.Trim().Length > 0)
listOfMatches.Add(s);
}
foreach (string match in listOfMatches)
{
Console.WriteLine(match);
}
Console.Read();
}
Unfortunately, I don't think there is any sort of C# kungfu that makes it much simpler. I should add that obviously, this algorithm gives you the items out of order... so if that matters... this isn't a good solution.
Here's a regex-only solution which captures both space-delimited and quoted paths. Quoted paths are stripped of the quotes, multiple spaces don't cause empty list entries. Edge case of mixing a quoted path with a non-quoted path without intervening space is interpreted as multiple entries.
It can be optimized by disabling captures for unused groups but I opted for more readability instead.
static Regex re = new Regex(#"^([ ]*((?<r>[^ ""]+)|[""](?<r>[^""]*)[""]))*[ ]*$");
public static IEnumerable<string> RegexSplit(string input)
{
var m = re.Match(input ?? "");
if(!m.Success)
throw new ArgumentException("Malformed input.");
return from Capture capture in m.Groups["r"].Captures select capture.Value;
}
Assuming that a space acts as a delimiter between except when enclosed in quotes (to allow paths to contain spaces), I'd recommend the following algorithm:
ignore_space = false;
i = 0;
list_of_breaks=[];
while(i < input_length)
{
if(charat(i) is a space and ignore_space is false)
{
add i to list_of_breaks;
}
else if(charat(i) is a quote)
{
ignore_space = ! ignore_space
}
}
split the input at the indices listed in list_of_breaks

regex split with exceptions

This is an extension to this SO question. This question considers two different enclosing characters, in contrast to the original question.
I would like to split by (white)spaces of any number but ignore everything between <> AND "". So this string:
string Line = "1 2 <1 2> \"hello world\" 3";
Should result in this:
1, 2, <1 2>, \"hello world\", 3
Instead of Split, I'll use Matches
string Line = "1 2 <1 2> \"hello world\" 3";
var parts = Regex.Matches(Line, #"[<\""]{1}[\w \d]+?[>\""]{1}|[\w\d]+")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
PS: This would also match "abc def>. But I ignored it to make the regex shorter
This is what I came up with so far:
public static string[] GetSplitStrings(string input)
{
IList<string> splitStrings = new List<string>();
var counter = 0;
var sb = new StringBuilder();
var inLessGreater = false; // sometimes <> can contain "
foreach (var character in input)
{
if (character.Equals('<'))
{
inLessGreater = true;
counter++;
}
if (character.Equals('>'))
{
inLessGreater = false;
counter++;
}
if (character.Equals('"') && !inLessGreater)
{
counter++;
}
if ((character.Equals(' ') && counter == 0) || (counter == 2))
{
if (sb.ToString().Equals("") == false)
{
if (character.Equals('"') || character.Equals('>'))
{
sb.Append(character);
}
splitStrings.Add(sb.ToString());
}
sb.Clear();
counter = 0;
}
else
{
sb.Append(character);
}
}
return splitStrings.ToArray();
}
Would prefer a neat regex solution.

C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas

In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?
using System ;
using System.Text.RegularExpressions ;
class Example
{
public static void Main ( )
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
Regex regex = new Regex ( "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)" ) ;
Match match = regex.Match ( myString ) ;
int j = 0 ;
while ( match.Success )
{
Console.WriteLine ( j++ + " \t" + match ) ;
match = match.NextMatch() ;
}
}
}
Output (in part) appears as follows:
0 cat
1 dog
2 "0 = OFF
3 1 = ON"
4 lion
5 tiger
6 'R = red
7 G = green
8 B = blue'
9 bear
However, desired output is:
0 cat
1 dog
2 0 = OFF, 1 = ON
3 lion
4 tiger
5 R = red, G = green, B = blue
6 bear
Try with this Regex:
"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*
Regex regexObj = new Regex(#"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
Match matchResults = regexObj.Match(input);
while (matchResults.Success)
{
Console.WriteLine(matchResults.Value);
matchResults = matchResults.NextMatch();
}
Ouputs:
cat
dog
"0 = OFF, 1 = ON"
lion
tiger
'R = red, G = green, B = blue'
bear
Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.
Why not heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.
it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.
Ah, RegEx. Now you have two problems. ;)
I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.
This works, for example:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
class Program
{
static void Main(string[] args)
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear";
Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
CsvParser parser = new CsvParser(myString);
Int32 lineNumber = 0;
foreach (string s in parser)
{
Console.WriteLine(lineNumber + ": " + s);
}
Console.ReadKey();
}
}
internal enum TokenType
{
Comma,
Quote,
Value
}
internal class Token
{
public Token(TokenType type, string value)
{
Value = value;
Type = type;
}
public String Value { get; private set; }
public TokenType Type { get; private set; }
}
internal class StreamTokenizer : IEnumerable<Token>
{
private TextReader _reader;
public StreamTokenizer(TextReader reader)
{
_reader = reader;
}
public IEnumerator<Token> GetEnumerator()
{
String line;
StringBuilder value = new StringBuilder();
while ((line = _reader.ReadLine()) != null)
{
foreach (Char c in line)
{
switch (c)
{
case '\'':
case '"':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Quote, c.ToString());
break;
case ',':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Comma, c.ToString());
break;
default:
value.Append(c);
break;
}
}
// Thanks, dpan
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
internal class CsvParser : IEnumerable<String>
{
private StreamTokenizer _tokenizer;
public CsvParser(Stream data)
{
_tokenizer = new StreamTokenizer(new StreamReader(data));
}
public CsvParser(String data)
{
_tokenizer = new StreamTokenizer(new StringReader(data));
}
public IEnumerator<string> GetEnumerator()
{
Boolean inQuote = false;
StringBuilder result = new StringBuilder();
foreach (Token token in _tokenizer)
{
switch (token.Type)
{
case TokenType.Comma:
if (inQuote)
{
result.Append(token.Value);
}
else
{
yield return result.ToString();
result.Length = 0;
}
break;
case TokenType.Quote:
// Toggle quote state
inQuote = !inQuote;
break;
case TokenType.Value:
result.Append(token.Value);
break;
default:
throw new InvalidOperationException("Unknown token type: " + token.Type);
}
}
if (result.Length > 0)
{
yield return result.ToString();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.
CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.
As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net
Function:
private List<string> ParseDelimitedString (string arguments, char delim = ',')
{
bool inQuotes = false;
bool inNonQuotes = false; //used to trim leading WhiteSpace
List<string> strings = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in arguments)
{
if (c == '\'' || c == '"')
{
if (!inQuotes)
inQuotes = true;
else
inQuotes = false;
}else if (c == delim)
{
if (!inQuotes)
{
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)
{
if (!inNonQuotes) inNonQuotes = true;
sb.Append(c);
}
}
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
return strings;
}
Usage
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear, text";
List<string> strings = ParseDelimitedString(myString);
foreach( string s in strings )
Console.WriteLine( s );
Output:
cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text
I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.
And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.
So I've updated the code to the following and thought I'd share...
static public List<string> ParseDelimitedString(string value, char delimiter)
{
bool inQuotes = false;
bool inNonQuotes = false;
bool secondQuote = false;
char curQuote = '\0';
List<string> results = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (inNonQuotes)
{
// then quotes are just characters
if (c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if (inQuotes)
{
// then quotes need to be double escaped
if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
{
if (secondQuote)
{
secondQuote = false;
sb.Append(c);
}
else
secondQuote = true;
}
else if (secondQuote && c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inQuotes = false;
}
else if (!secondQuote)
{
sb.Append(c);
}
else
{
// bad,as,"user entered something like"this,poorly escaped,value
// just ignore until second delimiter found
}
}
else
{
// not yet parsing a field
if (c == '\'' || c == '"')
{
curQuote = c;
inQuotes = true;
inNonQuotes = false;
secondQuote = false;
}
else if (c == delimiter)
{
// blank field
inQuotes = false;
inNonQuotes = false;
results.Add(string.Empty);
}
else
{
inQuotes = false;
inNonQuotes = true;
sb.Append(c);
}
}
}
if (inQuotes || inNonQuotes)
results.Add(sb.ToString());
return results;
}
since this question: Regex to to parse csv with nested quotes
reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)
here is a simple parser implementation in Python as well
def csv_to_array(string):
stack = []
match = []
matches = []
for c in string:
# do we have a quote or double quote?
if c == "\"":
# is it a closing match?
if len(stack) > 0 and stack[-1] == c:
stack.pop()
else:
stack.append(c)
elif (c == "," and len(stack) == 0) or (c == "\n"):
matches.append("".join(match))
match = []
else:
match.append(c)
return matches

Categories

Resources