Regular expression for pipe delimited and double quoted string

Regular expression for pipe delimited and double quoted string - c#

I have a string something like this:
"2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112"
I would like to split by pipe apart from anything wrapped in double quotes so I have something like (similar to how csv is done):
[0] => 2014-01-23 09:13:45
[1] => 10002112|TR0859657|25-DEC-2013>0000000000000001
[2] => 10002112
I would like to know if there is a regular expression that can do this?

I think you may need to write your own parser.
Yo will need:
custom collection to keep results
boolean flag to decide whether pipe is inside quotation or outside quotation marks
string (or StringBuilder) to keep current word
The idea is that you read string char by char. Each char is appended to the word. If there is a pipe outside quotation marks you add the word to your result collection. If there is a quote you switch a flag so you don't treat the pipe as a divider anymore but you append it as a part of the word. Then if there is another quotation you switch the flag back again. So next pipe will result in adding the whole word (with pipes within quotation marks) to the collection. I tested the code below on your example and it worked.
private static List<string> ParseLine(string yourString)
{
bool ignorePipe = false;
string word = string.Empty;
List<string> divided = new List<string>();
foreach (char c in yourString)
{
if (c == '|' &&
!ignorePipe)
{
divided.Add(word);
word = string.Empty;
}
else if (c == '"')
{
ignorePipe = !ignorePipe;
}
else
{
word += c;
}
}
divided.Add(word);
return divided;
}

How about this Regular Expression:
/((["|]).*\2)/g
Online Demo
It looks like it could be used as valid split expression.

I'm going to blatantly ignore the fact that you want a RegEx, because I think that making your own IEnumerable will be easier. Plus, you get instant access to Linq.
var line = "2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112";
var data = GetPartsFromLine(line).ToList();
private static IEnumerable<string> GetPartsFromLine(string line)
{
int position = -1;
while (position < line.Length)
{
position++;
if (line[position] == '"')
{
//go find the next "
int endQuote = line.IndexOf('"', position + 1);
yield return line.Substring(position + 1, endQuote - position - 1);
position = endQuote;
if (position < line.Length && line[position + 1] == '|')
{
position++;
}
}
else
{
//go find the next |
int pipe = line.IndexOf('|', position + 1);
if (pipe == -1)
{
//hit the end of the line
yield return line.Substring(position);
position = line.Length;
}
else
{
yield return line.Substring(position, pipe - position);
position = pipe;
}
}
}
}
This hasn't been fully tested, but it works with your example.

Related

Interpreting and formatting user input

I'm designing a command line interpreter for my software and need to be able to format user input. Currently I have a system which basically splits everything by spaces, the problem is that I need to not split anything inside double quotes.
As you can probably tell, my current implementation won't handle quoted paths very well.
This is my current interpreting and formatting logic (contained in a non static method which gets called when the user presses enter, in case anyone was wondering):
var command = ConsoleInput.Text;
ConsoleInput.Text = String.Empty;
string command_main = command.Split(new char[] { ' ' }).First();
string[] synatx = command.Split(new char[] { ' ' }).Skip(1).ToArray();
if (lCommands.ContainsKey(command_main))
{
Action<string[]> commandfunction;
lCommands.TryGetValue(command_main, out commandfunction);
commandfunction(synatx);
}
else
ConsoleOut($"Invalid Command - {command_main} {string.Join(" ", synatx)}");
I need quoted paths to be taken in as a single argument, instead of being split by spacing.
for example, (disclaimer: this is just an example and not actual code)
this is what I don't want: with an input of: "this is a test" and some more text it turns out to be something like this: syntax[0] = "this syntax[1] = is, and so on.
The expected outcome would be (what I want to happen): syntax[0] = "this is a test" syntax[1] = and syntax[2] = some, and so on.
I'm stuck here, anyone have a solution? Thank you.

Here's a solution. It's a hacked together state machine that handles quoted strings that may contain spaces. It throws away extraneous whitespace between arguments, and considers a doubled up double-quote as if it were a single double-quote (but without any special meaning; as if it were any other character).
public IEnumerable<string> ParseLine(string toParse)
{
var result = new List<string>();
bool inQuotedString = false;
bool parsingDoubleQuote = false;
bool inWhiteSpace = false;
int length = toParse.Length;
var argBuffer = new StringBuilder();
for (var index = 0; index < length; ++index)
{
//if looking ahead for a double quote succeeded, just add the quote to the current arguemnt
if (parsingDoubleQuote)
{
parsingDoubleQuote = false;
argBuffer.Append('"');
//and we are done with this character, so...
continue; //done with this character, time to just loop again
}
if (toParse[index] == '"')
{
inWhiteSpace = false;
//look ahead one character to see if there's a double quote
if (index < length - 1 && toParse[index + 1] == '"')
{
parsingDoubleQuote = true;
continue; //done with this character, time to just loop again
}
if (!inQuotedString)
{
inQuotedString = true;
continue; //done with this character, time to just loop again
}
else
{
//it's not a double quote, and we are in quotes string, so
inQuotedString = false;
//we don't add the buffer to the output args until a space or the end, so
continue; //done with this character, time to just loop again
}
}
//if we are here, there's no quote, so...
if (toParse[index] == ' ' || toParse[index] == '\t')
{
if (inQuotedString)
{
argBuffer.Append(toParse[index]);
continue; //done with this character, time to just loop again
}
if (inWhiteSpace)
{
//nothing to do
continue; //out of the for loop
}
else
{
inWhiteSpace = true;
if (argBuffer.Length > 0)
{
result.Add(argBuffer.ToString());
argBuffer.Clear();
continue; //done with this character, time to just loop again
}
}
}
else
{
inWhiteSpace = false;
//no quote, no space, so...
argBuffer.Append(toParse[index]);
continue; //done with this character, time to just loop again
}
} //end of for loop
if (argBuffer.Length > 0)
{
result.Add(argBuffer.ToString());
}
return result;
}
I've given it cursory testing - you'll want to test it harder

Don't split by escaped string - C# [duplicate]

This question already has answers here:
How can I Split(',') a string while ignore commas in between quotes?
(3 answers)
C# Regex Split - commas outside quotes
(7 answers)
Closed 5 years ago.
I need to split a csv file by comma apart from where the columns is between quote marks. However, what I have here does not seem to be achieving what I need and comma's in columns are being split into separate array items.
public List<string> GetData(string dataFile, int row)
{
try
{
var lines = File.ReadAllLines(dataFile).Select(a => a.Split(';'));
var csv = from line in lines select (from piece in line select piece.Split(',')).ToList();
var foo = csv.ToList();
var result = foo[row][0].ToList();
return result;
}
catch
{
return null;
}
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
public static string Escape(string s)
{
if (s.Contains(QUOTE))
s = s.Replace(QUOTE, ESCAPED_QUOTE);
if (s.IndexOfAny(CHARACTERS_THAT_MUST_BE_QUOTED) > -1)
s = QUOTE + s + QUOTE;
return s;
}
I am not sure where I can use my escape function in this case.
Example:
Degree,Graduate,08-Dec-17,Level 1,"Advanced, Maths"
The string Advanced, Maths are being split into two different array items which I don't want

You could use regex, linq or just loop through each character and use Booleans to figure out what the current behaviour should be. This question actually got me thinking, as I'd previously just looped through and acted on each character. Here is Linq way of breaking an entire csv document up, assuming the end of line can be found with ';':
private static void Main(string[] args)
{
string example = "\"Hello World, My name is Gumpy!\",20,male;My sister's name is Amy,29,female";
var result1 = example.Split(';')
.Select(s => s.Split('"')) // This will leave anything in abbreviation marks at odd numbers
.Select(sl => sl.Select((ss, index) => index % 2 == 0 ? ss.Split(',') : new string[] { ss })) // if it's an even number split by a comma
.Select(sl => sl.SelectMany(sc => sc));
Console.WriteLine("Press any key to continue.");
Console.ReadKey();
}

Not sure how this performes - but you can solve that with Linq.Aggregate like this:
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
public static IEnumerable<string> SplitIt(
char[] splitters,
string text,
StringSplitOptions opt = StringSplitOptions.None)
{
bool inside = false;
var result = text.Aggregate(new List<string>(), (acc, c) =>
{
// this will check each char of your given text
// and accumulate it in the (empty starting) string list
// your splitting chars will lead to a new item put into
// the list if they are not inside. inside starst as false
// and is flipped anytime it hits a "
// at end we either return all that was parsed or only those
// that are neither null nor "" depending on given opt's
if (!acc.Any()) // nothing in yet
{
if (c != '"' && (!splitters.Contains(c) || inside))
acc.Add("" + c);
else if (c == '"')
inside = !inside;
else if (!inside && splitters.Contains(c)) // ",bla"
acc.Add(null);
return acc;
}
if (c != '"' && (!splitters.Contains(c) || inside))
acc[acc.Count - 1] = (acc[acc.Count - 1] ?? "") + c;
else if (c == '"')
inside = !inside;
else if (!inside && splitters.Contains(c)) // ",bla"
acc.Add(null);
return acc;
}
);
if (opt == StringSplitOptions.RemoveEmptyEntries)
return result.Where(r => !string.IsNullOrEmpty(r));
return result;
}
public static void Main()
{
var s = ",,Degree,Graduate,08-Dec-17,Level 1,\"Advanced, Maths\",,";
var spl = SplitIt(new[]{','}, s);
var spl2 = SplitIt(new[]{','}, s, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine(string.Join("|", spl));
Console.WriteLine(string.Join("|", spl2));
}
}
Output:
|Degree|Graduate|08-Dec-17|Level 1|Advanced, Maths||
Degree|Graduate|08-Dec-17|Level 1|Advanced, Maths

The function gets comma separated fields within a string, excluding commas embedded in a quoted field
The assumptions
It should return empty fields ,,
There are no quotes within a quote field (as per the example)
The method
I uses a for loop with i as a place holder of the current field
It scans for the next comma or quote and if it finds a quote it scans for the next comma to create the field
It needed to be efficient otherwise we would use regex or Linq
The OP didn't want to use a CSV library
Note : There is no error checking, and scanning each character would be faster this was just easy to understand
Code
public List<string> GetFields(string line)
{
var list = new List<string>();
for (var i = 0; i < line.Length; i++)
{
var firstQuote = line.IndexOf('"', i);
var firstComma = line.IndexOf(',', i);
if (firstComma >= 0)
{
// first comma is before the first quote, then its just a standard field
if (firstComma < firstQuote || firstQuote == -1)
{
list.Add(line.Substring(i, firstComma - i));
i = firstComma;
continue;
}
// We have found quote so look for the next comma afterwards
var nextQuote = line.IndexOf('"', firstQuote + 1);
var nextComma = line.IndexOf(',', nextQuote + 1);
// if we found a comma, then we have found the end of this field
if (nextComma >= 0)
{
list.Add(line.Substring(i, nextComma - i));
i = nextComma;
continue;
}
}
list.Add(line.Substring(i)); // if were are here there are no more fields
break;
}
return list;
}
Tests 1
Degree,Graduate,08-Dec-17,Level 1,"Advanced, Maths",another
Degree
Graduate
08-Dec-17
Level 1
"Advanced, Maths"
another
Tests 2
,Degree,Graduate,08-Dec-17,\"asdasd\",Level 1,\"Advanced, Maths\",another
<Empty Line>
Degree
Graduate
08-Dec-17
"asdasd"
Level 1
"Advanced, Maths"
another

Remove a defined part from a string

Lets say I have this string:
string text = "Hi my name is <crazy> Bob";
I want to take away everything within the brackets so it turns out like this:
"Hi my name is Bob".
So for I've tried with this and I know I've been think wrong with the while-loop but I just can't figure it out.
public static string Remove(string text)
{
char[] result = new char[text.Length];
for (int i = 0; i < text.Length; i++ )
{
if (text[i] == '<')
{
while (text[i] != '>')
{
result[i] += text[i];
}
}
else
{
result[i] += text[i];
}
}
return result.ToString();
}

Try this Regex:
public static string Remove(string text)
{
return Regex.Replace(text, "<.*?>","");
}

Look at this loop:
while (text[i] != '>')
{
result[i] += text[i];
}
That will continue executing until the condition isn't met. Given that you're not changing text[i], it's never going to stop...
Additionally, you're calling ToString on a char[] which isn't going to do what you want, and even if it did you'd have left-over characters.
If you wanted to loop like this, I'd use a StringBuilder, and just keep track of whether you're "in" an angle bracket or not:
public static string RemoveAngleBracketedContent(string text)
{
var builder = new StringBuilder();
int depth = 0;
foreach (var character in text)
{
if (character == '<')
{
depth++;
}
else if (character == '>' && depth > 0)
{
depth--;
}
else if (depth == 0)
{
builder.Append(character);
}
}
return builder.ToString();
}
Alternatively, use a regular expression. It would be relatively tricky to get it to cope with nested angle brackets, but if you don't need that, it's really simple:
// You can reuse this every time
private static Regex AngleBracketPattern = new Regex("<[^>]*>");
...
text = AngleBracketPattern.Replace(text, "");
One last problem though - after removing the angle-bracketed-text from "Hi my name is <crazy> Bob" you actually get "Hi my name is Bob" - note the double space.

use
string text = "Hi my name is <crazy> Bob";
text = System.Text.RegularExpressions.Regex.Replace(text, "<.*?>",string.Empty);

I recommend regex.
public static string DoIt(string content, string from, string to)
{
string regex = $"(\\{from})(.*)(\\{to})";
return Regex.Replace(content, regex, "");
}

Special characters Regex

Hello I'm try to remove special characters from user inputs.
public void fd()
{
string output = "";
string input = Console.ReadLine();
char[] charArray = input.ToCharArray();
foreach (var item in charArray)
{
if (!Char.IsLetterOrDigit(item))
{
\\\CODE HERE }
}
output = new string(trimmedChars);
Console.WriteLine(output);
}
At the end I'm turning it back to a string. My code only removes one special character in the string. Does anyone have any suggestions on a easier way instead

You have a nice implementation, just consider using next code, which is only a bit shorter, but has a little bit higher abstractions
var input = " th#ere's! ";
Func<char, bool> isSpecialChar = ch => !char.IsLetter(ch) && !char.IsDigit(ch);
for (int i = 1; i < input.Length - 1; i++)
{
//if current character is a special symbol
if(isSpecialChar(input[i]))
{
//if previous or next character are special symbols
if(isSpecialChar(input[i-1]) || isSpecialChar(input[i+1]))
{
//remove that character
input = input.Remove(i, 1);
//decrease counter, since we removed one char
i--;
}
}
}
Console.WriteLine(input); //prints " th#ere's "
A new string would be created each time you would call Remove. Use a StringBuilder for a more memory-performant solution.

The problem with your code is that you are taking the data from charArray and putting the result in trimmedChars for each change that you make, so each change will ignore all previous changes and work with the original. At the end you only have the last change.
Another problem with the code is that you are using IndexOf to get the index of a character, but that will get the index of the first occurance of that character, not the index where you got that character. For example when you are at the second ! in the string "foo!bar!" you will get the index of the first one.
You don't need to turn the string into an array to work with the characters in the string. You can just loop through the index of the characters in the string.
Note that you should also check the value of the index when you are looking at the characters before and after, so that you don't try to look at characters that are outside the string.
public void fd() {
string input = Console.ReadLine();
int index = 0;
while (index < input.Length) {
if (!Char.IsLetterOrDigit(input, index) && ((index == 0 || !Char.IsLetterOrDigit(input, index - 1)) || (index == input.Length - 1 || !Char.IsLetterOrDigit(input, index + 1)))) {
input = input.Remove(index, 1);
} else {
index++;
}
}
Console.WriteLine(input);
}

Been awhile since I've hit on C#, but a reg ex might be helpful
string input = string.Format("{0}! ", Console.ReadLine());
Regex rgx = new Regex("(?i:[^a-z]?)[.](?i:[^a-z]?)");
string output = rgx.Replace(input, "$1$2");
The regex looks for a character with a non-alpha character on left or right and replaces it with nothing.

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.

A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found

void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}

private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}

The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).

Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.

Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.

You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression for pipe delimited and double quoted string - c#

How about this Regular Expression: /((["|]).*\2)/g Online Demo It looks like it could be used as valid split expression.

Related

Interpreting and formatting user input

Don't split by escaped string - C# [duplicate]

Remove a defined part from a string

Special characters Regex

What is the best algorithm for arbitrary delimiter/escape character processing?

Categories

Resources