Related
I am trying to convert camel case to snake case.
Like this:
"LiveKarma" -> "live_karma"
"youGO" -> "you_g_o"
I cannot seem to get the second example working like that. It always outputs as 'you_go' . How can I get it to output 'you_g_o'
My code:
(Regex.Replace(line, "(?<=[a-z0-9])[A-Z]", "_$0", RegexOptions.Compiled)).ToLowerInvariant()
Here is an extension method that transforms the text into a snake case:
using System.Text;
public static string ToSnakeCase(this string text)
{
if(text == null) {
throw new ArgumentNullException(nameof(text));
}
if(text.Length < 2) {
return text;
}
var sb = new StringBuilder();
sb.Append(char.ToLowerInvariant(text[0]));
for(int i = 1; i < text.Length; ++i) {
char c = text[i];
if(char.IsUpper(c)) {
sb.Append('_');
sb.Append(char.ToLowerInvariant(c));
} else {
sb.Append(c);
}
}
return sb.ToString();
}
Put it into a static class somewhere (named for example StringExtensions) and use it like this:
string text = "LiveKarma";
string snakeCaseText = text.ToSnakeCase();
// snakeCaseText => "live_karma"
Since the option that converts abbreviations as separate words is not suitable for many, I found a complete solution in the EF Core codebase.
Here are a couple of examples of how the code works:
TestSC -> test_sc
testSC -> test_sc
TestSnakeCase -> test_snake_case
testSnakeCase -> test_snake_case
TestSnakeCase123 -> test_snake_case123
_testSnakeCase123 -> _test_snake_case123
test_SC -> test_sc
I rewrote it a bit so you can copy it as a ready-to-use string extension:
using System;
using System.Globalization;
using System.Text;
namespace Extensions
{
public static class StringExtensions
{
public static string ToSnakeCase(this string text)
{
if (string.IsNullOrEmpty(text))
{
return text;
}
var builder = new StringBuilder(text.Length + Math.Min(2, text.Length / 5));
var previousCategory = default(UnicodeCategory?);
for (var currentIndex = 0; currentIndex < text.Length; currentIndex++)
{
var currentChar = text[currentIndex];
if (currentChar == '_')
{
builder.Append('_');
previousCategory = null;
continue;
}
var currentCategory = char.GetUnicodeCategory(currentChar);
switch (currentCategory)
{
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.TitlecaseLetter:
if (previousCategory == UnicodeCategory.SpaceSeparator ||
previousCategory == UnicodeCategory.LowercaseLetter ||
previousCategory != UnicodeCategory.DecimalDigitNumber &&
previousCategory != null &&
currentIndex > 0 &&
currentIndex + 1 < text.Length &&
char.IsLower(text[currentIndex + 1]))
{
builder.Append('_');
}
currentChar = char.ToLower(currentChar, CultureInfo.InvariantCulture);
break;
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
if (previousCategory == UnicodeCategory.SpaceSeparator)
{
builder.Append('_');
}
break;
default:
if (previousCategory != null)
{
previousCategory = UnicodeCategory.SpaceSeparator;
}
continue;
}
builder.Append(currentChar);
previousCategory = currentCategory;
}
return builder.ToString();
}
}
}
You can find the original code here:
https://github.com/efcore/EFCore.NamingConventions/blob/main/EFCore.NamingConventions/Internal/SnakeCaseNameRewriter.cs
UPD 27.04.2022:
Also, you can use Newtonsoft library if you're looking for a ready to use third party solution. The output of the code is the same as the code above.
// using Newtonsoft.Json.Serialization;
var snakeCaseStrategy = new SnakeCaseNamingStrategy();
var snakeCaseResult = snakeCaseStrategy.GetPropertyName(text, false);
Simple Linq based solution... no idea if its faster or not. basically ignores consecutive uppercases
public static string ToUnderscoreCase(this string str)
=> string.Concat((str ?? string.Empty).Select((x, i) => i > 0 && i < str.Length - 1 && char.IsUpper(x) && !char.IsUpper(str[i-1]) ? $"_{x}" : x.ToString())).ToLower();
using Newtonsoft package
public static string? ToCamelCase(this string? str) => str is null
? null
: new DefaultContractResolver() { NamingStrategy = new CamelCaseNamingStrategy() }.GetResolvedPropertyName(str);
public static string? ToSnakeCase(this string? str) => str is null
? null
: new DefaultContractResolver() { NamingStrategy = new SnakeCaseNamingStrategy() }.GetResolvedPropertyName(str);
RegEx Solution
A quick internet search turned up this site which has an answer using RegEx, which I had to modify to grab the Value portion in order for it to work on my machine (but it has the RegEx you're looking for). I also modified it to handle null input, rather than throwing an exception:
public static string ToSnakeCase2(string str)
{
var pattern =
new Regex(#"[A-Z]{2,}(?=[A-Z][a-z]+[0-9]*|\b)|[A-Z]?[a-z]+[0-9]*|[A-Z]|[0-9]+");
return str == null
? null
: string
.Join("_", pattern.Matches(str).Cast<Match>().Select(m => m.Value))
.ToLower();
}
Non-RegEx Solution
For a non-regex solution, we can do the following:
Reduce all whitespace to a single space by
using string.Split to split with an empty array as the first parameter to split on all whitespace
joining those parts back together with the '_' character
Prefix all upper-case characters with '_' and lower-case them
Split and re-join the resulting string on the _ character to remove any instances of multiple concurrent underscores ("__") and to remove any leading or trailing instances of the character.
For example:
public static string ToSnakeCase(string str)
{
return str == null
? null
: string.Join("_", string.Concat(string.Join("_", str.Split(new char[] {},
StringSplitOptions.RemoveEmptyEntries))
.Select(c => char.IsUpper(c)
? $"_{c}".ToLower()
: $"{c}"))
.Split(new[] {'_'}, StringSplitOptions.RemoveEmptyEntries));
}
pseudo code below. In essence check if each char is upper case, then if it is add a _, then add the char to lower case
var newString = s.subString(0,1).ToLower();
foreach (char c in s.SubString(1,s.length-1))
{
if (char.IsUpper(c))
{
newString = newString + "_";
}
newString = newString + c.ToLower();
}
if you're into micro-optimaizations and want to prevent unneccessary conversions wherever possible, this one might also work:
public static string ToSnakeCase(this string text)
{
static IEnumerable<char> Convert(CharEnumerator e)
{
if (!e.MoveNext()) yield break;
yield return char.ToLower(e.Current);
while (e.MoveNext())
{
if (char.IsUpper(e.Current))
{
yield return '_';
yield return char.ToLower(e.Current);
}
else
{
yield return e.Current;
}
}
}
return new string(Convert(text.GetEnumerator()).ToArray());
}
There is a well maintained EF Core community project that implements a number of naming convention rewriters called EFCore.NamingConventions. The rewriters don't have any internal dependencies, so if you don't want to bring in an EF Core related package you can just copy the rewriter code out.
Here is the snake case rewriter: https://github.com/efcore/EFCore.NamingConventions/blob/main/EFCore.NamingConventions/Internal/SnakeCaseNameRewriter.cs
May has well toss this one out. Very simple and worked for me.
public static string ToSnakeCase(this string text)
{
text = Regex.Replace(text, "(.)([A-Z][a-z]+)", "$1_$2");
text = Regex.Replace(text, "([a-z0-9])([A-Z])", "$1_$2");
return text.ToLower();
}
Testing it with some samples (borrowed from #GeekInside's answer):
var samples = new List<string>() { "TestSC", "testSC", "TestSnakeCase", "testSnakeCase", "TestSnakeCase123", "_testSnakeCase123", "test_SC" };
var results = new List<string>() { "test_sc", "test_sc", "test_snake_case", "test_snake_case", "test_snake_case123", "_test_snake_case123", "test_sc" };
for (int i = 0; i < samples.Count; i++)
{
Console.WriteLine("Test success: " + (val.ToSnakeCase() == results[i] ? "true" : "false"));
}
Produced the following output:
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true
I have a string like
a,[1,2,3,{4,5},6],b,{c,d,[e,f],g},h
After split by , I expect getting 5 items, the , in the braces or brackets are ignored.
a
[1,2,3,{4,5},6]
b
{c,d,[e,f],g}
h
There are no whitespaces in the string. Is there a regular expression can make it happen?
You could use this:
var input = "a,[1,2,3,{4,5}],b,{c,d,[e,f]},g";
var result =
(from Match m in Regex.Matches(input, #"\[[^]]*]|\{[^}]*}|[^,]+")
select m.Value)
.ToArray();
This will find any matches like:
[ followed by any characters other than ], then terminated by ]
{ followed by any characters other than }, then terminated by }
One or more characters other than ,
This will work, for you sample input, but it cannot handle nested groups like [1,[2,3],4] or {1,{2,3},4}. For that, I'd recommend something a bit more powerful regular expressions. Since you've mentioned in your comments that you're trying to parse Json, I'd recommend you check out the excellent Json.NET library.
Regular expressions * cannot be used to parse nested structures **.
( ∗ True regular expressions without non-regular extensions )
( ∗∗ Nested structures of arbitrary depth and interleaving )
But parsing by hand is not that difficult. First you need to find the , that are not in brackets or braces.
string input = "a,[1,2,3,{4,5},6],b,{c,d,[e,f],g},h";
var delimiterPositions = new List<int>();
int bracesDepth = 0;
int bracketsDepth = 0;
for (int i = 0; i < input.Length; i++)
{
switch (input[i])
{
case '{':
bracesDepth++;
break;
case '}':
bracesDepth--;
break;
case '[':
bracketsDepth++;
break;
case ']':
bracketsDepth--;
break;
default:
if (bracesDepth == 0 && bracketsDepth == 0 && input[i] == ',')
{
delimiterPositions.Add(i);
}
break;
}
}
And then split the string at these positions.
public List<string> SplitAtPositions(string input, List<int> delimiterPositions)
{
var output = new List<string>();
for (int i = 0; i < delimiterPositions.Count; i++)
{
int index = i == 0 ? 0 : delimiterPositions[i - 1] + 1;
int length = delimiterPositions[i] - index;
string s = input.Substring(index, length);
output.Add(s);
}
string lastString = input.Substring(delimiterPositions.Last() + 1);
output.Add(lastString);
return output;
}
Even if it looks ugly and there is no regex involved (not sure if it's a requirement or a nice-to-have in the original question), this alternative should work:
class Program
{
static void Main(string[] args)
{
var input = "a,[1,2,3,{4,5}],b,{c,d,[e,f]},g";
var output = "<root><n>" +
input.Replace(",", "</n><n>")
.Replace("[", "<n1><n>")
.Replace("]", "</n></n1>")
.Replace("{", "<n2><n>")
.Replace("}", "</n></n2>") +
"</n></root>";
var elements = XDocument
.Parse(output, LoadOptions.None)
.Root.Elements()
.Select(e =>
{
if (!e.HasElements)
return e.Value;
else
{
return e.ToString()
.Replace(" ", "")
.Replace("\r\n", "")
.Replace("</n><n>", ",")
.Replace("<n1>", "[")
.Replace("</n1>", "]")
.Replace("<n2>", "{")
.Replace("</n2>", "}")
.Replace("<n>", "")
.Replace("</n>", "")
.Replace("\r\n", "")
;
}
}).ToList();
}
}
I write an rpn, with a struktogram.
Newest Problem: It is'nt work correctly now.
If input string is "5 + ((1 + 2) * 4) - 3"
My output is:
5 1 2 + 4 * 3 - +
I have to got this result:
5 1 2 + 4 * + 3 -
Edited the source
*That was the original problem, but helped me, and now the original mistakes fixed: *,
At the debug when the loop or
int i = 12, the c value is 0\0 or something else
and this value is added to output (name: formula) string as a '(' bracket. And I don't know why.
And the last '-' operation symbol, don't added to (or not look) at the end of output string (formula)
I misgave this problem cause by the '('.
I tried the program with other string input value, but always put an '(' to my string, and I don't know why... I saw that It was independt about the numbers of bracket. Always only one '(' add to my string...*)
Yes, in english LengyelFormula = rpn (it is hungarian)*
static void Main(string[] args)
{
String str = "5 + ( ( 1 + 2 ) * 4 ) −3";
String result=LengyelFormaKonvertalas(str);
Console.WriteLine(result.ToString());
Console.ReadLine();
}
static String LengyelFormaKonvertalas(String input) // this is the rpn method
{
Stack stack = new Stack();
String str = input.Replace(" ",string.Empty);
StringBuilder formula = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
char x=str[i];
if (x == '(')
stack.Push(x);
else if (IsOperandus(x)) // is it operand
{
formula.Append(x);
}
else if (IsOperator(x)) // is it operation
{
if (stack.Count>0 && (char)stack.Peek()!='(' && Prior(x)<=Prior((char)stack.Peek()) )
{
char y = (char)stack.Pop();
formula.Append(y);
}
if (stack.Count > 0 && (char)stack.Peek() != '(' && Prior(x) < Prior((char)stack.Peek()))
{
char y = (char)stack.Pop();
formula.Append(y);
}
stack.Push(x);
}
else
{
char y=(char)stack.Pop();
if (y!='(')
{
formula.Append(y);
}
}
}
while (stack.Count>0)
{
char c = (char)stack.Pop();
formula.Append(c);
}
return formula.ToString();
}
static bool IsOperator(char c)
{
return (c=='-'|| c=='+' || c=='*' || c=='/');
}
static bool IsOperandus(char c)
{
return (c>='0' && c<='9' || c=='.');
}
static int Prior(char c)
{
switch (c)
{
case '=':
return 1;
case '+':
return 2;
case '-':
return 2;
case '*':
return 3;
case '/':
return 3;
case '^':
return 4;
default:
throw new ArgumentException("Rossz paraméter");
}
}
}
using System;
using System.Collections.Generic;
using System.Text;
class Sample {
static void Main(string[] args){
String str = "5 + ( ( 1 + 2 ) * 4 ) -3";
String result=LengyelFormaKonvertalas(str);
Console.WriteLine(result);
Console.ReadLine();
}
static String LengyelFormaKonvertalas(String input){
Stack<char> stack = new Stack<char>();
String str = input.Replace(" ", string.Empty);
StringBuilder formula = new StringBuilder();
for (int i = 0; i < str.Length; i++){
char x=str[i];
if (x == '(')
stack.Push(x);
else if (x == ')'){
while(stack.Count>0 && stack.Peek() != '(')
formula.Append(stack.Pop());
stack.Pop();
} else if (IsOperandus(x)){
formula.Append(x);
} else if (IsOperator(x)) {
while(stack.Count>0 && stack.Peek() != '(' && Prior(x)<=Prior(stack.Peek()) )
formula.Append(stack.Pop());
stack.Push(x);
}
else {
char y= stack.Pop();
if (y!='(')
formula.Append(y);
}
}
while (stack.Count>0) {
formula.Append(stack.Pop());
}
return formula.ToString();
}
static bool IsOperator(char c){
return (c=='-'|| c=='+' || c=='*' || c=='/');
}
static bool IsOperandus(char c){
return (c>='0' && c<='9' || c=='.');
}
static int Prior(char c){
switch (c){
case '=':
return 1;
case '+':
return 2;
case '-':
return 2;
case '*':
return 3;
case '/':
return 3;
case '^':
return 4;
default:
throw new ArgumentException("Rossz parameter");
}
}
}
In IsOperator, you check c == '-'.
But in the string, you write −3.
− isn't the same character than -
I don't know about Polish stuff so maybe I'm missing something, but that's why no '-' operator is printed, it fails the IsOperator check and goes into the else clause, which doesn't add it to formula.
When you get a ), you should pop all operators and add them to your formula until you reach a (, and pop that '(' as well.
When you get an operator, you should only pop the stack and add this operator to the formula if its priority is greater than or equal to that of x. Your second check is redundant because it is already covered by the first.
As a general rule: try your program with some simple inputs like 1+2+3, 1+2-3, 1*2+3 and 1+2*3 and see if you get the right result. Testing systematically like that should help you find errors faster.
In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?
using System ;
using System.Text.RegularExpressions ;
class Example
{
public static void Main ( )
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
Regex regex = new Regex ( "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)" ) ;
Match match = regex.Match ( myString ) ;
int j = 0 ;
while ( match.Success )
{
Console.WriteLine ( j++ + " \t" + match ) ;
match = match.NextMatch() ;
}
}
}
Output (in part) appears as follows:
0 cat
1 dog
2 "0 = OFF
3 1 = ON"
4 lion
5 tiger
6 'R = red
7 G = green
8 B = blue'
9 bear
However, desired output is:
0 cat
1 dog
2 0 = OFF, 1 = ON
3 lion
4 tiger
5 R = red, G = green, B = blue
6 bear
Try with this Regex:
"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*
Regex regexObj = new Regex(#"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
Match matchResults = regexObj.Match(input);
while (matchResults.Success)
{
Console.WriteLine(matchResults.Value);
matchResults = matchResults.NextMatch();
}
Ouputs:
cat
dog
"0 = OFF, 1 = ON"
lion
tiger
'R = red, G = green, B = blue'
bear
Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.
Why not heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.
it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.
Ah, RegEx. Now you have two problems. ;)
I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.
This works, for example:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
class Program
{
static void Main(string[] args)
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear";
Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
CsvParser parser = new CsvParser(myString);
Int32 lineNumber = 0;
foreach (string s in parser)
{
Console.WriteLine(lineNumber + ": " + s);
}
Console.ReadKey();
}
}
internal enum TokenType
{
Comma,
Quote,
Value
}
internal class Token
{
public Token(TokenType type, string value)
{
Value = value;
Type = type;
}
public String Value { get; private set; }
public TokenType Type { get; private set; }
}
internal class StreamTokenizer : IEnumerable<Token>
{
private TextReader _reader;
public StreamTokenizer(TextReader reader)
{
_reader = reader;
}
public IEnumerator<Token> GetEnumerator()
{
String line;
StringBuilder value = new StringBuilder();
while ((line = _reader.ReadLine()) != null)
{
foreach (Char c in line)
{
switch (c)
{
case '\'':
case '"':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Quote, c.ToString());
break;
case ',':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Comma, c.ToString());
break;
default:
value.Append(c);
break;
}
}
// Thanks, dpan
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
internal class CsvParser : IEnumerable<String>
{
private StreamTokenizer _tokenizer;
public CsvParser(Stream data)
{
_tokenizer = new StreamTokenizer(new StreamReader(data));
}
public CsvParser(String data)
{
_tokenizer = new StreamTokenizer(new StringReader(data));
}
public IEnumerator<string> GetEnumerator()
{
Boolean inQuote = false;
StringBuilder result = new StringBuilder();
foreach (Token token in _tokenizer)
{
switch (token.Type)
{
case TokenType.Comma:
if (inQuote)
{
result.Append(token.Value);
}
else
{
yield return result.ToString();
result.Length = 0;
}
break;
case TokenType.Quote:
// Toggle quote state
inQuote = !inQuote;
break;
case TokenType.Value:
result.Append(token.Value);
break;
default:
throw new InvalidOperationException("Unknown token type: " + token.Type);
}
}
if (result.Length > 0)
{
yield return result.ToString();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.
CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.
As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net
Function:
private List<string> ParseDelimitedString (string arguments, char delim = ',')
{
bool inQuotes = false;
bool inNonQuotes = false; //used to trim leading WhiteSpace
List<string> strings = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in arguments)
{
if (c == '\'' || c == '"')
{
if (!inQuotes)
inQuotes = true;
else
inQuotes = false;
}else if (c == delim)
{
if (!inQuotes)
{
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)
{
if (!inNonQuotes) inNonQuotes = true;
sb.Append(c);
}
}
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
return strings;
}
Usage
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear, text";
List<string> strings = ParseDelimitedString(myString);
foreach( string s in strings )
Console.WriteLine( s );
Output:
cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text
I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.
And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.
So I've updated the code to the following and thought I'd share...
static public List<string> ParseDelimitedString(string value, char delimiter)
{
bool inQuotes = false;
bool inNonQuotes = false;
bool secondQuote = false;
char curQuote = '\0';
List<string> results = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (inNonQuotes)
{
// then quotes are just characters
if (c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if (inQuotes)
{
// then quotes need to be double escaped
if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
{
if (secondQuote)
{
secondQuote = false;
sb.Append(c);
}
else
secondQuote = true;
}
else if (secondQuote && c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inQuotes = false;
}
else if (!secondQuote)
{
sb.Append(c);
}
else
{
// bad,as,"user entered something like"this,poorly escaped,value
// just ignore until second delimiter found
}
}
else
{
// not yet parsing a field
if (c == '\'' || c == '"')
{
curQuote = c;
inQuotes = true;
inNonQuotes = false;
secondQuote = false;
}
else if (c == delimiter)
{
// blank field
inQuotes = false;
inNonQuotes = false;
results.Add(string.Empty);
}
else
{
inQuotes = false;
inNonQuotes = true;
sb.Append(c);
}
}
}
if (inQuotes || inNonQuotes)
results.Add(sb.ToString());
return results;
}
since this question: Regex to to parse csv with nested quotes
reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)
here is a simple parser implementation in Python as well
def csv_to_array(string):
stack = []
match = []
matches = []
for c in string:
# do we have a quote or double quote?
if c == "\"":
# is it a closing match?
if len(stack) > 0 and stack[-1] == c:
stack.pop()
else:
stack.append(c)
elif (c == "," and len(stack) == 0) or (c == "\n"):
matches.append("".join(match))
match = []
else:
match.append(c)
return matches
I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.