C# Regex Split Quotes and Comma Syntax Error [duplicate] - c#

This question already has answers here:
Can I escape a double quote in a verbatim string literal?
(6 answers)
How to split csv whose columns may contain comma
(9 answers)
Closed 4 years ago.
I have the a text file as follows:
"0","Column","column2","Column3"
I have managed to get the data down to split to the following:
"0"
"Column"
"Column2"
"Column3"
with ,(?=(?:[^']*'[^']*')*[^']*$), now I want to remove the quotes. I have tested the expression [^\s"']+|"([^"]*)"|\'([^\']*) an online regex tester, which gives the correct output im looking for. However, I am getting a syntax error when using the expression:
String[] columns = Regex.Split(dataLine, "[^\s"']+|"([^"]*)"|\'([^\']*)");
Syntax error ',' expected
I've tried escaping characters but to no avail, am I missing something?
Any help would be greatly appreciated!
Thanks.

C# might be escaping the backslash. Try:
String[] columns = Regex.Split(dataLine, #"[^\s""']+|"([^""]*)""|\'([^\']*)");

The problems are the double quotes inside the regex, the compiler chokes on them, think they are the end of string.
You must escape them, like this:
"[^\s\"']+|\"([^\"]*)\"|\'([^\']*)"
Edit:
You can actually do all, that you want with one regex, without first splitting:
#"(?<=[""])[^,]*?(?=[""])"
Here I use an # quoted string where double quotes are doubled instead of escaped.
The regex uses look behind to look for a double quote, then matching any character except comma ',' zero ore more times, then looks ahead for a double quote.
How to use:
string test = #"""0"",""Column"",""column2"",""Column3""";
Regex regex = new Regex(#"(?<=[""])[^,]*?(?=[""])");
foreach (Match match in regex.Matches(test))
{
Console.WriteLine(match.Value);
}

You need to escape the double quotes inside of your regular expression, as they're closing the string literal. Also, to handle 'unrecognized escape sequences', you'll need to escape the \ in \s.
Two ways to do this:
Escape all the characters of concern using backslashes: "[^\\s\"']+|\"([^\"]*)\"|\'([^\']*)"
Use the # syntax to denote a "verbatim" string literal. Double quotes still need to be escaped, but instead using "" for every ": #"[^\s""']+|""([^""]*)""|'([^']*)"
Regardless, when I test out your new regular expression it appears to be capturing some empty groups as well, see here: https://dotnetfiddle.net/1WQE4R

Related

What does regex expression match pattern "\\[.*\\]" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I am new to regex. What does regex expression match pattern "\[.*\]" mean?
If I have a text like "Hello [Here]", then success is returned in the match. And match contain [Here].
I read that:
. indicates Any except \n (newline),
* indicates 0 or more times
I don't understand the "\". It believe it is just escape sequence for "\".
So, is the expression "\[.*\]" trying to match a pattern like \[Any text\]?
Yes, you are right. It will match any characters enclosed in []. The .* imply any or no characters enclosed in [].
Also you should try this link which is a very helpful regex tool. You can input the regex pattern and check for matches easily.
I have tried this on regexr, here is a screen shot:

Unable to remove certain characters between values in c# [duplicate]

This question already has answers here:
Remove text in-between delimiters in a string (using a regex?)
(5 answers)
Closed 6 years ago.
I am trying to remove characters starting from (and including) rgm up to (and including) ;1..
Example input string:
Sum ({rgmdaerudsb;1.Total_Value}, {rgmdaerub;1.Major_Value})
Code:
string strEx = "Sum ({rgmdaerudsb;1.Total_Value}, {rgmdaerub;1.Major_Value})";
strEx = strEx.Substring(0, strEx.LastIndexOf("rgm")) +
strEx.Substring(strEx.LastIndexOf(";1.") + 3);
Result:
Sum ({rgmdaerub;1.Total_Value}, {.Major_Value})
Expected result:
Sum ({Total_Value}, {Major_Value})
Note: only rgm and ;1. will remain static and characters between them will vary.
I would recommend to use Regex for this purpose. Try this:
string input = "Sum ({rgmdaerudsb;1.Total_Value}, {rgmdaerub;1.Major_Value})";
string result = Regex.Replace(input, #"rgm.*?;1\.", "");
Explanation:
The second parameter of Regex.Replace takes the pattern that consists of the following:
rgm (your starting string)
. (dot - meaning any character)
*? (the preceding symbol can occure zero or more times, but stops at the first possible match (shortest))
;1. (your ending string - the dot needed to be escaped, otherwise it would mean any character)
You need to use RegEx, with an expression like "rgm(.);1\.". That's just off the top of my head, you will have to verify the exact regular expression that matches your pattern. Then, use RegEx.Replace() with it.

What is the difference between # and \ operators in string? [duplicate]

This question already has answers here:
What is the difference between a regular string and a verbatim string?
(6 answers)
Closed 7 years ago.
There is # operator that you place infornt of the string to allow special characters in string and there is \. Well I am aware that with # you can use reserved names for variables, but I am curious just about difference using these two operators with string.
Search on the web indicated that these two are the same but I still believe there has to be something different between # and \.
Code to test:
string _string0 = #"Just a ""qoute""";
string _string1 = "Just a \"qoute\"";
Console.WriteLine("{0} | {1}",_string0, _string1);
Question: what is the difference between #"Just a ""qoute"""; and "Just a \"qoute\""; only regarding strings?
Edit: Question is already answered here.
Using # (which denotes a verbatim string literal) you can put any character into the string, even line breaks. The only character you need to escape is the double quote. The usual \* escape sequences and Unicode escape sequences are not processed in such string literals.
Without # (in a regular string literal), you need to escape every special character, such as line breaks.
You can read more about it at the C# Programming Guide:
https://msdn.microsoft.com/en-us/library/ms228362.aspx#Anchor_3
# is a verbatim string, it allows you not to escape every special character at a time, but all of them in the string.While \ just allows you to escape one certain character.
More info about strings: https://msdn.microsoft.com/en-us/library/aa691090%28v=vs.71%29.aspx

Filter out alphabetic with regex using C# [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regex - Only letters?
I try to filter out alphabetics ([a-z],[A-Z]) from text.
I tried "^\w$" but it filters alphanumeric (alpha and numbers).
What is the pattern to filter out alphabetic?
Thanks.
To remove all letters try this:
void Main()
{
var str = "some junk456456%^&%*333";
Console.WriteLine(Regex.Replace(str, "[a-zA-Z]", ""));
}
For filtering out only English alphabets use:
[^a-zA-Z]+
For filtering out alphabets regardless of the language use:
[^\p{L}]+
If you want to reverse the effect remove the hat ^ right after the opening brackets.
If you want to find whole lines that match the pattern then enclose the above patterns within ^ and $ signs, otherwise you don't need them. Note that to make them effect for every line you'll need to create the Regex object with the multi-line option enabled.
try this simple way:
var result = Regex.Replace(inputString, "[^a-zA-Z\s]", "");
explain:
+
Matches the previous element one or more times.
[^character_group]
Negation: Matches any single character that is not in character_group.
\s
Matches any white-space character.
To filter multiple alpha characters use
^[a-zA-Z]+$

How do I write a regex to match a string that doesn't contain a word? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regular expression to match string not containing a word?
To not match a set of characters I would use e.g. [^\"\\r\\n]*
Now I want to not match a fixed character set, e.g. "|="
In other words, I want to match: ( not ", not \r, not \n, and not |= ).
EDIT: I am trying to modify the regex for parsing data separated with delimiters. The single-delimiter solution I got form a CSV parser, but now I want to expand it to include multi-character delimiters. I do not think lookaheads will work, because I want to consume, not just assert and discard, the matching characters.
I figured it out, it should be: ((?![\"\\r\\n]|[|][=]).)*
The full regex, modified from the CSV parser link in the original post, will be: ((?<field>((?![\"\\r\\n]|[|][=]).)*)|\"(?<field>([^\"]|\"\")*)\")([|][=]|(?<rowbreak>\\r\\n|\\n|$))
This will match any amount of characters of ( not ", not \r, not \n, and not |= ), or a quoted string, followed by ( "|=" or end of line )

Categories

Resources