C# - Splitting on a pipe with an escaped pipe in the data?

C# - Splitting on a pipe with an escaped pipe in the data? - c#

I've got a pipe delimited file that I would like to split (I'm using C#). For example:
This|is|a|test
However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:
This|is|a|pip\|ed|test (this is a pip|ed test)
I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.

Just use String.IndexOf() to find the next pipe. If the previous character is not a backslash, then use String.Substring() to extract the word. Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.
I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.
EDIT
In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.
public List<string> ParseWords(string s)
{
List<string> words = new List<string>();
int pos = 0;
while (pos < s.Length)
{
// Get word start
int start = pos;
// Get word end
pos = s.IndexOf('|', pos);
while (pos > 0 && s[pos - 1] == '\\')
{
pos++;
pos = s.IndexOf('|', pos);
}
// Adjust for pipe not found
if (pos < 0)
pos = s.Length;
// Extract this word
words.Add(s.Substring(start, pos - start));
// Skip over pipe
if (pos < s.Length)
pos++;
}
return words;
}

This oughta do it:
string test = #"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, #"(?<!(?<!\\)*\\)\|");
The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.
EDIT
In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.
With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.
UPDATE
Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)
The results were the same using the extended regular expression from the post I linked to.

I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|") . This is how i have handled.
string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
si.Replace("¬", "\\|");
}
*/

Here is another solution.
One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:
string text = #"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string
bool flag = false;
foreach (var x in text.Split('|')) {
bool endsWithArroba = x.EndsWith(#"\");
parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
flag = endsWithArroba;
}

Cory's solution is pretty good. But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|" and replacing it with some other character, then doing your split, then replace it again with the "\|".
Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.
Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".
Overall, I lean towards regex though.
Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing. And they have a great story about why you shouldn't write this stuff yourself.

You can do this with a regex. Once you decide to use a backslash as your escape character, you have two escape cases to account for:
Escaping a pipe: \|
Escaping a backslash that you want interpreted literally.
Both of these can be done in the same regex. Escaped backslashes will always be two \ characters together. Consecutive, escaped backslashes will always be even numbers of \ characters. If you find an odd-numbered sequence of \ before a pipe, it means you have several escaped backslashes, followed by an escaped pipe. So you want to use something like this:
/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/
Confusing, perhaps, but it should work. Explanation:
^ #The start of a line
(?:...
[^|\\] #A character other than | or \ OR
(?:\\{2})* #An even number of \ characters OR
\\\| #A literal \ followed by a literal |
...)+ #Repeat the preceding at least once
(?:$|\|) #Either a literal | or the end of a line

Related

Regex to extract string between parentheses which also contains other parentheses

I've been trying to figure this out, but I don't think I understand Regex well enough to get to where I need to.
I have string that resemble these:
filename.txt(1)attribute, 2)attribute(s), more!)
otherfile.txt(abc, def)
Basically, a string that always starts with a filename, then has some text between parentheses. And I'm trying to extract that part which is between the main parentheses, but the text that's there can contain absolutely anything, even some more parentheses (it often does.)
Originally, there was a 'hacky' expression made like this:
/\(([^#]+)\)\g
And it worked, until we ran into a case where the input string contained a # and we were stuck. Obviously...
I can't change the way the strings are generated, it's always a filename, then some parentheses and something of unknown length and content inside.
I'm hoping for a simple Regex expression, since I need this to work in both C# and in Perl -- is such a thing possible? Or does this require something more complex, like its own parsing method?

You can change exception for # symbol in your regex to regex matches any characters and add quantifier that matches from 0 to infinity symbols. And also simplify your regex by deleting group construction:
\(.*\)
Here is the explanation for the regular expression:
Symbol \( matches the character ( literally.
.* matches any character (except for line terminators)
* quantifier matches between zero and unlimited times, as many times
as possible, giving back as needed (greedy)
\) matches the character ) literally.
You can use regex101 to compose and debug your regular expressions.

Regex seems overkill to me in this case. Can be more reliably achieved using string manipulation methods.
int first = str.IndexOf("(");
int last = str.LastIndexOf(")");
if (first != -1 && last != -1)
{
string subString = str.Substring(first + 1, last - first - 1);
}
I've never used Perl, but I'll venture a guess that it has equivalent methods.

Failure To Get Specific Text From Regex Group

My example is working fine with greedy when I use to capture the whole value of a string and a group(in group[1] ONLY) enclose with a pair of single quote.
But when I want to capture the whole value of a string and a group(in group[1] ONLY) enclose with multiple pair of single quote , it only capture the value of string enclose with last pair but not the string between first and last single quotes.
string val1 = "Content:abc'23'asad";
string val2 = "Content:'Scale['#13212']'ta";
Match match1 = Regex.Match(val1, #".*'(.*)'.*");
Match match2 = Regex.Match(val2, #".*'(.*)'.*");
if (match1.Success)
{
string value1 = match1.Value;
string GroupValue1 = match1.Groups[1].Value;
Console.WriteLine(value1);
Console.WriteLine(GroupValue1);
string value2 = match2.Value;
string GroupValue2 = match2.Groups[1].Value;
Console.WriteLine(value2);
Console.WriteLine(GroupValue2);
Console.ReadLine();
// using greedy For val1 i am getting perfect value for-
// value1--->Content:abc'23'asad
// GroupValue1--->23
// BUT using greedy For val2 i am getting the string elcosed by last single quote-
// value2--->Content:'Scale['#13212']'ta
// GroupValue2---> ]
// But i want GroupValue2--->Scale['#13212']
}

The problem with your existing regex is that you are using too many greedy modifiers. That first one is going to grab everything it can until it runs into the second to last apostrophe in the string. That's why your end result of the second example is just the stuff within the last pair of quotes.
There are a few ways to approach this. The simplest way is to use Slai's suggestion - just a pattern to grab anything and everything within the most "apart" apostrophes available:
'(.*)'
A more explicitly defined approach would be to slightly tweak the pattern you are currently using. Just change the first greedy modifier into a lazy one:
.*?'(.*)'.*
Alternatively, you could change the dot in that first and last section to instead match every character other than an apostrophe:
[^']*'(.*)'[^']*
Which one you end up using depends on what you're personally going after. One thing of note, though, is that according to Regex101, the first option involves the fewest steps, so it will be the most efficient method. However, it also dumps the rest of the string, but I don't know if that matters to you.

First off use named match capture groups such as (?<Data> ... ) then you can access that group by its name in C# such as match1.Groups["Data"].Value.
Secondly try not to use * which means zero to many. Is there really going to be no data? For a majority of the cases, that answer is no, there is data.
Use the +, one to many instead.
IMHO * screws up more patterns because it has to find zero data, when it does that, it skips ungodly amounts of data. When you know there is data use +.
It is better to match on what is known, than unknown and we will create a pattern to what is known. Also in that light use the negation set [^ ] to capture text such as [^']+ which says capture everything that is not a ', one to many times.
Pattern
Content:\x27?[^\x27?]+\x27(?<Data>[^\27]+?)\x27
The results on your two sets of data are 23 and #13212 and placed into match capture group[1] and group["Data"].
Note \x27 is the hex escape of the single quote '. \x22 is for the double quote ", which I bet is what you are really running into.
I use the hex escapes when dealing with quotes so not to have to mess with the C# compiler thinking they are quotes while parsing.

Check String Only Contain Characters

I know this has been asked before, but my code isn't working.
The senario is I need to check if a string ONLY contains letters, numbers and spaces. I need to fail if it contains any thing else.
I've tried the RegEx method, but I don't understand regular expressions, so I need to use a LINQ method for my assessment.
Here is my code:
if (!CSVItemArray[count].All(Char.IsLetterOrDigit) && !CSVItemArray[count].Contains(" "))
{
return false;
}

Just combine the check for letter, digit, or whitespace in the All query:
if (!CSVItemArray[count].All(c => Char.IsLetterOrDigit(c) || Char.IsWhiteSpace(c)))
{
return false;
}

Your logic is a little confused. The following returns true if the string in CSVItemArray[count] only contains letters, digits and white spaces:
return CSVItemArray[count].All(c => Char.IsLetterOrDigit(c) || Char.IsWhiteSpace(c));

Doing something (e.g. using Regex) because not understanding it is a bad thing - at least for developers. In particular if what you want to do can easily be achieved by using a regex.
Having said this you may simply use this:
Regex r = new Regex("^[A-Za-z0-9\\s]*$");
var valid = r.IsMatch(myString);
This will look for any number of upper- or lowercase characters, digits and whitespace-characters. The sequence itself is embraced by [], the following * sets the number of times the sequence can occur in the string (in your case none uo to infinite times). The ^ and $ are just for marking the start and end of your string repsectivly. This avoids that %asdfgh12345 // will match for instance.
EDIT: If you need Umlauts also (ä, ö, ü, ß, ...) you may have a look at this post which handles special characters also.

How can I split the following string into a string array

I want to split the following:
name[]address[I]dob[]nationality[]occupation[]
So my results would be:
name[]
address[I]
dob[]
nationality[]
occupation[]
I have tried using Regex.Split but can't get these results.

You can use Regex.Split with the following regex:
(?<=])(?=[a-z])
which will split between a closing square bracket on the left and a letter on the right. This is done using lookaround assertions. They don't consume any characters of the match so in this constellation they're pretty handy to match between letters.
Basically it means exactly what I wrote: (?<=]) will match a point in the string preceded by a closing bracket, while (?=[a-z]) matches a point in the string (both zero-width, i.e. between characters) where a letter follows. You can tweak that a little if your input data looks different from what you gave us in the question.
You could also simplify it a little, at the expense of readability, by using (?<=])\b. But I would advise against that since \b is tied to \w which is a really ugly thing, usually. It would work roughly the same, but not quite, as \b in this context amounts to (?=[\w]) and \w matches a lot more things, namely decimal digits and an underscore too.
Quick PowerShell test (it uses the same regex implementation since it's .NET underneath):
PS> 'name[]address[I]dob[]nationality[]occupation[]' -split '(?<=])(?=[a-z])'
name[]
address[I]
dob[]
nationality[]
occupation[]
Just for completeness, there is also another option. You can either split the string between the tokens you want to retain, or you could just collect all matches of tokens you want to keep. In the latter case you'll need a pattern that matches what you need, such as
[a-z]+\[[^\]]*]
or what Dennis gave as an answer (I just tend to avoid \w and \b except for quick and dirty hacks or golfing since I maintain that they have no useful application). You can use that with Regex.Matches.
Generally both approaches can work fine, it then depends on whether the split or the match pattern is easier to understand. And for Regex.Matches you'll get Match objects so you don't actually end up with a string[] if you need that, so that'd require .Select(m => m.Value) as well.
In this case I guess neither regex should be left alone without a comment explaining what it does. I can read them just fine, but many developers are a little uneasy around regexes and especially more advanced concepts like lookaround often warrant an explanation.

text.Split(new Char[] { ']' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s + "]").ToArray();

Use this regex pattern:
\w*\[\w*\]

Regular expression should be fine. You can also consider to catch the opening and the closing square brackets with string.IndexOf, for example:
IEnumerable<string> Results(string input)
{
int currentIndex = -1;
while (true)
{
currentIndex++;
int openingBracketIndex = input.IndexOf("[", currentIndex);
int closingBracketIndex = input.IndexOf("]", currentIndex);
if (openingBracketIndex == -1 || closingBracketIndex == -1)
yield break;
yield return input.Substring(currentIndex, closingBracketIndex - currentIndex + 1);
currentIndex = closingBracketIndex;
}
}

string inputString = "name[]address[I]dob[]nationality[]occupation[]";
var result = Regex.Matches(inputString, #".*?\[I?\]").Cast<Match>().Select(m => m.Groups[0].Value).ToArray();

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?

You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.

This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.

Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)

Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.

This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo

It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - Splitting on a pipe with an escaped pipe in the data? - c#

Related

Regex to extract string between parentheses which also contains other parentheses

Failure To Get Specific Text From Regex Group

Check String Only Contain Characters

How can I split the following string into a string array

Regex which ensures no character is repeated

Categories

Resources