I am trying to match a certain end of a file, where the "certain end of the file" could go over multiple lines.
My regex looks like follows:
"\s\w$"
What I want to do: Find all files that end with a whitespace character, followed by a "human readable character" at the very end of the file.
Regex.IsMatch("arbitrarytext a\n",#"\s\w$")
Problem is it matches the following string also:
"arbitrarytext a\n"
I also tried RegexOptions.SingleLine - although this should only change the matching behavior of a dot ".".
How can I rewrite my regex that it still fulfills my needs but does not match the example given above.
Secondly I'm also interested in an explanation why it matches the example at all.
Using: .Net 3.5 SP1 if that is of interest.
The problem is that $ matches at the end of the string before the final newline character (if there is one). Unless you use RegexOptions.Multiline, $ means the same as \Z.
Use \z instead:
Regex.IsMatch("arbitrarytext a\n",#"\s\w\z")
will fail.
See also this tutorial about anchors, specifically the section "Strings Ending with a Line Break".
A short overview:
Symbol means... if multiline mode is...
------------------------------------------------------------------------------
^ Start of string off (default*)
^ Start of current line on
\A Start of string irrelevant
$ End of string, before final newline, if any off
$ End of current line, before newline, if any on
\Z End of string, before final newline, if any irrelevant
\z End of string irrelevant
*: In Ruby, multiline mode is always on. Use \A or \Z to get ^ or $ behavior.
Related
In .NET System.Text.RegularExpressions.Regex if ^ and $ are added to the Regex to look for exact matches, it still returns true for IsMatch if a terminating \n is appended to the string being verified.
For example, the following code:
Regex regexExact = new Regex(#"^abc$");
Console.WriteLine(regexExact.IsMatch("abc"));
Console.WriteLine(regexExact.IsMatch("abcdefg"));
Console.WriteLine(regexExact.IsMatch("abc\n"));
Console.WriteLine(regexExact.IsMatch("abc\n\n"));
returns:
true
false
true
false
What is the Regex that will return false for all of the above except the first?
Solution for the current .NET regex
You should use the very end of string anchor that is \z in .NET regex:
Regex regexExact = new Regex(#"^abc\z");
See Anchors in Regular Expressions:
$ The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.
\Z The match must occur at the end of the string, or before \n at the end of the string. For more information, see End of String or Before Ending Newline.
\z The match must occur at the end of the string only. For more information, see End of String Only.
The same anchor can be used in .net, java, pcre, delphi, ruby and php. In python, use \Z. In JavaScript RegExp (ECMAScript) compatible patterns, the $ anchor matches the very end of string (if no /m modifier is defined).
Background
see Strings Ending with a Line Break at regular-expressions.info:
Because Perl returns a string with a newline at the end when reading a line from a file, Perl's regex engine matches $ at the position before the line break at the end of the string even when multi-line mode is turned off. Perl also matches $ at the very end of the string, regardless of whether that character is a line break. So ^\d+$ matches 123 whether the subject string is 123 or 123\n.
Most modern regex flavors have copied this behavior. That includes .NET, Java, PCRE, Delphi, PHP, and Python. This behavior is independent of any settings such as "multi-line mode".
In all these flavors except Python, \Z also matches before the final line break. If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A\d+\z does not match 123\n. \z matches after the line break, which is not matched by the shorthand character class.
In Python, \Z matches only at the very end of the string. Python does not support \z.
What is the difference between "\\w+#\\w+[.]\\w+" and "^\\w+#\\w+[.]\\w+$"? I have tried to google for it but no luck.
^ means "Match the start of the string" (more exactly, the position before the first character in the string, so it does not match an actual character).
$ means "Match the end of the string" (the position after the last character in the string).
Both are called anchors and ensure that the entire string is matched instead of just a substring.
So in your example, the first regex will report a match on email#address.com.uk, but the matched text will be email#address.com, probably not what you expected. The second regex will simply fail.
Be careful, as some regex implementations implicitly anchor the regex at the start/end of the string (for example Java's .matches(), if you're using that).
If the multiline option is set (using the (?m) flag, for example, or by doing Pattern.compile("^\\w+#\\w+[.]\\w+$", Pattern.MULTILINE)), then ^ and $ also match at the start and end of a line.
Try the Javadoc:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
^ and $ match the beginnings/endings of a line (without consuming them)
I want to find if word terminate with 's or 'm or 're using regular expression in c#.
if (Regex.IsMatch(word, "/$'s|$'re|$'m/"))
textbox1.text=word;
The /$'s|$'re|$'m/ .NET regex matches 3 alternatives:
/$'s - / at the end of a string after which 's should follow (this will never match as there can be no text after the end of a string)
$'re - end of string and then 're must follow (again, will never match)
$'m/ - end of string with 'm/ to follow (again, will never match).
In a .NET regex, regex delimiters are not used, thus the first and last / are treated as literal chars that the engine tries to match.
The $ anchor signalize the end of a string and using anything after it makes the pattern match no string (well, unless you have a trailing \n after it, but that is an edge case that rarely causes any trouble). Just FYI: to match the very end of string in a .NET regex, use \z.
What you attempted to write was
Regex.IsMatch(word, "'(?:s|re|m)$")
Or, if you put single character alternatives into a single character class:
Regex.IsMatch(word, "'(?:re|[sm])$")
See the regex demo.
Details
' - a single quote
(?: - start of a non-capturing group:
re - the re substring
| - or
[sm] - a character class matching s or m
) - end of the non-capturing group
$ - end of string.
Here is my test regex with options IgnoreCase and Singleline :
^\s*((?<test1>[-]?\d{0,10}.\d{3})(?<test2>\d)?(?<test3>\d)?){1,}$
and input data:
24426990.568 128364695.70706 -1288.460
If I omit ^ (match start of line) and $ (match end of line)
\s*((?<test1>[-]?\d{0,10}.\d{3})(?<test2>\d)?(?<test3>\d)?){1,}
then everything works perfectly.
Why it doesn't work with string start/end markers (^/$)?
Thanks in advance.
The start and end is literally the start and end of the input string when in single line mode.
It only means the start of the line and the end of the line in multiline mode.
Please note that this means the entire input string.
So if you use:
24426990.568 128364695.70706 -1288.460
as your input string, then the beginning is the first white space and the end of the string will be the 0
As your pattern matches exactly one instance of what you are looking for the regex will fail when used with ^ and $. This is because it is looking for one instance of that pattern in the input string, but there are three.
You have two options:
Remove the ^ and $
Change the pattern to match at least one time
I have noticed the following:
var b1 = Regex.IsMatch("Line1\nLine2", "Line1$", RegexOptions.Multiline); // true
var b2 = Regex.IsMatch("Line1\r\nLine2", "Line1$", RegexOptions.Multiline); // false
I'm confused. The documentation of RegexOptions says:
Multiline:
Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
Since C# and VB.NET are mainly used in the Windows world, I would guess that most files processed by .NET applications use CRLF linebreaks (\r\n) rather than LF linebreaks (\n). Still, it seems that the .NET regular expression parser does not recognize a CRLF linebreak as an end of line.
I know that I could workaround this, for example, by matching Line1\r?$, but it still strikes me as strange. Is this really the intended behaviour of the .NET regexp parser or did I miss some hidden UseWindowsLinebreaks option?
From MSDN:
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline
So I can't say why (compatibility with regular expressions from other languages?), but at the very least it's intended.