.NET's Regex class and newline - c#

Why doesn't .NET regex treat \n as end of line character?
Sample code:
string[] words = new string[] { "ab1", "ab2\n", "ab3\n\n", "ab4\r", "ab5\r\n", "ab6\n\r" };
Regex regex = new Regex("^[a-z0-9]+$");
foreach (var word in words)
{
Console.WriteLine("{0} - {1}", word, regex.IsMatch(word));
}
And this is the response I get:
ab1 - True
ab2
- True
ab3
- False
- False
ab5
- False
ab6
- False
Why does the regex match ab2\n?
Update:
I don't think Multiline is a good solution, that is, I want to validate login to match only specified characters, and it must be single line. If I change the constructor for MultiLine option ab1, ab2, ab3 and ab6 match the expression, ab4 and ab5 don't match it.

If the string ends with a line break the RegexOptions.Multiline will not work. The $ will just ignore the last line break since there is nothing after that.
If you want to match till the very end of the string and ignore any line breaks use \z
Regex regex = new Regex(#"^[a-z0-9]+\z", RegexOptions.Multiline);
This is for both MutliLine and SingleLine, that doesn't matter.

The .NET regex engine does treat \n as end-of-line. And that's a problem if your string has Windows-style \r\n line breaks. With RegexOptions.Multiline turned on $ matches between \r and \n rather than before \r.
$ also matches at the very end of the string just like \z. The difference is that \z can match only at the very end of the string, while $ also matches before a trailing \n. When using RegexOptions.Multiline, $ also matches before any \n.
If you're having trouble with line breaks, a trick is to first to a search-and-replace to replace all \r with nothing to make sure all your lines end with \n only.

From RegexOptions:
Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
So basically if you pass a RegexOptions.Multiline to the Regex constructor you are instructing that instance to treat the final $ as a match for newline characters - not simply the end of the string itself.

Use regex options, System.Text.RegularExpressions.RegexOptions:
string[] words = new string[] { "ab1", "ab2\n", "ab3\n\n", "ab4\r", "ab5\r\n", "ab6\n\r" };
Regex regex = new Regex("^[a-z0-9]+$");
foreach (var word in words)
{
Console.WriteLine("{0} - {1}", word,
regex.IsMatch(word,"^[a-z0-9]+$",
System.Text.RegularExpressions.RegexOptions.Singleline |
System.Text.RegularExpressions.RegexOptions.IgnoreCase |
System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace));
}

Could be the ususal windows/linux line ending differences. But it's still strange that \n\n gets a false this way... Did you try with the RegexOptions.Multiline flag set?

Just to give more details to Smazy answer. This an extract from:
Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. Copyright 2009 Jan Goyvaerts and Steven Levithan, 978-0-596-2068-7
The difference between ‹\Z› and ‹\z›
comes into play when the last
character in your subject text is a
line break. In that case, ‹\Z› can
match at the very end of the subject
text, after the final line break, as
well as immediately before that line
break. The benefit is that you can
search for ‹omega\Z› without having to
worry about stripping off a trailing
line break at the end of your subject
text. When reading a file line by
line, some tools include the line
break at the end of the line, whereas
others don’t; ‹\Z› masks this
difference. ‹\z› matches only at the
very end of the subject text, so it
will not match text if a trailing line
break follows. The anchor ‹$› is
equivalent to ‹\Z›, as long as you do
not turn on the “^ and $ match at line
breaks” option. This option is off by
default for all regex flavors except
Ruby. Ruby does not offer a way to
turn this option off. Just like ‹\Z›,
‹$› matches at the very end of the
subject text, as well as before the
final line break, if any.
Of course, I wouldn't have found it without Smazy answer.

Related

Regex Conditional Values

If I have a string like the following that can have two possible values (although the value JB37 can be variable)
String One\r\nString Two\r\n
String One\r\nJB37\r\n
And I only want to capture the string if the value following String One\r\n does NOT equal String Two\r\n, how would I code that in Regex?
So normally without any condition, this is what I want:
String One\r\n(.+?)\r\n
With regex, you may resort to a negative lookahead:
String One\r\n(?!String Two(?:\r\n|$))(.*?)(?:\r\n|$)
See the regex demo
You may also use [^\r\n] instead of .:
String One\r\n(?!String Two(?:\r\n|$))([^\r\n]*)
If you use RegexOptions.Multiline, you will also be able to use
(?m)String One\r\n(?!String Two\r?$)(.*?)\r?$
See yet another demo.
Details
(?m) - a RegexOptions.Multiline option that makes ^ match start of a line and $ end of line positions
String One\r\n - String One text followed with a CRLF line ending
(?!String Two\r?$) - a negative lookahead that fails the match if immediately to the right of the current location, there is String Two at the end of the line
(.*?) - Capturing group 1: any zero or more chars other than line break chars, as few as possible, up to the leftmost occurrence of
\r?$ - an optional CR and end of the line (note that in a .NET regex, $ matches only in front of LF, not CR, in the multiline mode, thus, \r? is necessary).
C# demo:
var m = Regex.Match(s, #"(?m)String One\r\n(?!String Two\r?$)(.*?)\r?$");
if (m.Success)
{
Console.WriteLine(m.Groups[1].Value);
}
If CR can be missing, add ? after each \r in the pattern.

Regex option "Multiline"

I have a regex to match date format with comma.
yyyy/mm/dd or yyyy/mm
For example:
2016/09/02,2016/08,2016/09/30
My code:
string data="21535300/11/11\n";
Regex reg = new Regex(#"^(20\d{2}/(0[1-9]|1[012])(/(0[1-9]|[12]\d|30|31))?,?)*$",
RegexOptions.Multiline);
if (!reg.IsMatch(data))
"Error".Dump();
else
"True".Dump();
I use option multiline.
If string data have "\n".
Any character will match this regex.
For example:
string data="test\n"
string data="2100/1/1"
I find option definition in MSDN. It says:
It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.
I didn't understand why this problem has happened.
Anyone can explan it?
Thanks.
Your regex can match an empty line that you get once you add a newline at the end of the string. "test\n" contains 2 lines, and the second one gets matched.
See your regex pattern in a free-spacing mode:
^ # Matches the start of a line
( # Start of Group 1
20\d{2}/
(0[1-9]|1[012])
(/
(0[1-9]|[12]\d|30|31)
)?,?
)* # End of group 1 - * quantifier makes it match 0+ times
$ # End of line
If you do not want it to match an empty line, replace the last )* with )+.
An alternative is to use a more unrolled pattern like
^20\d{2}/(0[1-9]|1[012])(/(0[1-9]|[12]\d|3[01]))?(,20\d{2}/(0[1-9]|1[012])(/(0[1-9]|[12]\d|3[01]))?)*$
See the regex demo. Inside the code, it is advisable to use a block and build the pattern dynamically:
string date = #"20\d{2}/(0[1-9]|1[012])(/(0[1-9]|[12]\d|3[01]))?";
Regex reg = new Regex(string.Format("^{0}(,{0})*$", date), RegexOptions.Multiline);
As you can see, the first block (after the start of the line ^ anchor) is obligatory here, and thus an empty line will never get matched.

C# - Removing single word in string after certain character

I have string that I would like to remove any word following a "\", whether in the middle or at the end, such as:
testing a\determiner checking test one\pronoun
desired result:
testing a checking test one
I have tried a simple regex that removes anything between the backslash and whitespace, but it gives the following result:
string input = "testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*\s");
string output = regex.Replace(input, " ");
Result:
testing a one\pronoun
It looks like this regex matches from the backslash until the last whitespace in the string. I cannot seem to figure out how to match from the backlash to the next whitespace. Also, I am not guaranteed a whitespace at the end, so I would need to handle that. I could continue processing the string and remove any text after the backslash, but I was hoping I could handle both cases with one step.
Any advice would be appreciated.
Change .* which match any characters, to \w*, which only match word characters.
Regex regex = new Regex(#"\\\w*");
string output = regex.Replace(input, "");
".*" matches zero or more characters of any kind. Consider using "\w+" instead, which matches one or more "word" characters (not including whitespace).
Using "+" instead of "*" would allow a backslash followed by a non-"word" character to remain unmatched. For example, no matches would be found in the sentence "Sometimes I experience \ an uncontrollable compulsion \ to intersperse backslash \ characters throughout my sentences!"
With your current pattern, .* tells the parser to be "greedy," that is, to take as much of the string as possible until it hits a space. Adding a ? right after that * tells it instead to make the capture as small as possible--to stop as soon as it hits the first space.
Next, you want to end at not just a space, but at either a space or the end of the string. The $ symbol captures the end of the string, and | means or. Group those together using parentheses and your group collectively tells the parser to stop at either a space or the end of the string. Your code will look like this:
string input = #"testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*?(\s|$)");
string output = regex.Replace(input, " ");
Try this regex (\\[^\s]*)
(\\[^\s]*)
1st Capturing group (\\[^\s]*)
\\ matches the character \ literally
[^\s]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ].

Why doesn't $ in .NET multiline regular expressions match CRLF?

I have noticed the following:
var b1 = Regex.IsMatch("Line1\nLine2", "Line1$", RegexOptions.Multiline); // true
var b2 = Regex.IsMatch("Line1\r\nLine2", "Line1$", RegexOptions.Multiline); // false
I'm confused. The documentation of RegexOptions says:
Multiline:
Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
Since C# and VB.NET are mainly used in the Windows world, I would guess that most files processed by .NET applications use CRLF linebreaks (\r\n) rather than LF linebreaks (\n). Still, it seems that the .NET regular expression parser does not recognize a CRLF linebreak as an end of line.
I know that I could workaround this, for example, by matching Line1\r?$, but it still strikes me as strange. Is this really the intended behaviour of the .NET regexp parser or did I miss some hidden UseWindowsLinebreaks option?
From MSDN:
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline
So I can't say why (compatibility with regular expressions from other languages?), but at the very least it's intended.

Regex issue with SingleLine Option and \n before end of string

I am trying to match a certain end of a file, where the "certain end of the file" could go over multiple lines.
My regex looks like follows:
"\s\w$"
What I want to do: Find all files that end with a whitespace character, followed by a "human readable character" at the very end of the file.
Regex.IsMatch("arbitrarytext a\n",#"\s\w$")
Problem is it matches the following string also:
"arbitrarytext a\n"
I also tried RegexOptions.SingleLine - although this should only change the matching behavior of a dot ".".
How can I rewrite my regex that it still fulfills my needs but does not match the example given above.
Secondly I'm also interested in an explanation why it matches the example at all.
Using: .Net 3.5 SP1 if that is of interest.
The problem is that $ matches at the end of the string before the final newline character (if there is one). Unless you use RegexOptions.Multiline, $ means the same as \Z.
Use \z instead:
Regex.IsMatch("arbitrarytext a\n",#"\s\w\z")
will fail.
See also this tutorial about anchors, specifically the section "Strings Ending with a Line Break".
A short overview:
Symbol means... if multiline mode is...
------------------------------------------------------------------------------
^ Start of string off (default*)
^ Start of current line on
\A Start of string irrelevant
$ End of string, before final newline, if any off
$ End of current line, before newline, if any on
\Z End of string, before final newline, if any irrelevant
\z End of string irrelevant
*: In Ruby, multiline mode is always on. Use \A or \Z to get ^ or $ behavior.

Categories

Resources