Star vs. plus quantifier in the variable-width negative lookbehind - c#

Silly question here... I'm trying to match white-space inside the line while ignoring the leading spaces/tabs and came up with these regex strings, but I can't figure out why only one is working (C# regex engine):
(?<!^[ \t]*)[ \t]+ // regex 1. (with *)
(?<!^[ \t]+)[ \t]+ // regex 2. (with +)
Note the star vs. plus repetitions in the negative look-ahead. When matching these against " word1 word2" (2 leading spaces):
⎵⎵word1⎵word2
^ // 1 match for regex 1. (*)
⎵⎵word1⎵word2
^^ ^ // 2 matches for regex 2. (+)
^ ^ // why not match like this?
Why does only version 1. (star) work here and version 2. (plus) not match the second leading space?
I presume that it's because of the higher priority of the greedy + from [ \t]+ over the look-ahead's, but how can I rationalize to expect this?

In short:
The negative lookbehind just checks if the current position is not preceded with the lookbehind pattern and the result of the check is either true (yes, go on matching) or false (stop processing the pattern, go for the next match). The check is not affecting the regex index, the engine remains at one and the same location after performing the check.
In the current expressions, the lookbehind pattern is checked first (as the pattern is parsed from left to right, not vice versa), and only if the lookbehind check returns true the [ \t]+ pattern is tried. In the first expression, the negative lookbehind returns false as the lookbehind pattern finds a match (the start of string). The second expression negative lookbehind returns true because there is no start of string followed with 1 or more spaces/tabs at the beginning of a string.
Here is the logic behind the 2 expressions:
The lookbehind check is performed first. In the first expression, (?<!^[ \t]*) is trying to match at the beginning of a string. A beginning of a string has no beginning of a string (^) followed with 0+ spaces or tabs. It is important to note that a lookbehind implementation in .NET checks the string in the opposite direction, flips the string, and searches for zero or more tabs and the string boundary. In case of (?<!^[ \t]*), the lookbehind returns false because there is a start position before 0 spaces or tabs (note we are still at the beginning of a string). The second expression lookbehind, (?<!^[ \t]+), returns true, because there is no tab or space before the start of string at the 0th index in the string, and thus, the [ \t]+ consuming pattern grabs the leading horizontal whitespace. That moves the regex index further and another match is found later in the string.
After failure at the beginning of the string, the first expression tries to match after the first space. However, the (?<!^[ \t]*) returns false because there is beginning of string followed with 1 space (the first one). Same story repeats with the position after the second space. The only spaces matched with the first (?<!^[ \t]*)[ \t]+ expression are those that are not at the beginning of the string.
Lookahead analogy
Check the analogous lookahead patterns: a [ \t]+(?![ \t]+$) pattern will find both whitespace chunks in "bb bb ", while [ \t]+(?![ \t]*$) will not match those at the end of the string. The same logic applies: 1) the * version allows matching an empty string, so the end of string is found and the negative lookahead returns false, the match is failed. When the + version encounters and consumes the trailing whitespaces, the regex engine, staying at the end of string, cannot find 1 or more spaces/tabs followed with another end of string, thus, the negative lookahead returns true and the trailing whitespaces are matched.

Related

Regex match a string that is not part of a larger word [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm stumped on how to even go about this.
I am trying to match the string "ashi" but not if the word containing it is in a small list of known false positives like "flashing", "lashing", "smashing". The false positive words can appear in the string as well as long as the string "ashi" (not as part of one of the false positive words) is in the string it should return true.
I'm using C# and I was trying to go about it without using regular expressions, but I am having no luck.
These strings should return true
...somethingashisomething...
...something2!ashi*&something...
... something ashi something flashing...
These strings should return false
...somethingflashingsomething...
...smashingthesomething...
...the lashings are too tight...
Another option might be to use a negative lookbehind with a nested lookahead to match words that start with fl but not if they are followed by ashing to match ashi but not flashing.
(?<!\bfl(?=ashing\b))ashi
Explanation
(?<! Negative lookbehind, assert what is directly on the right is not
\bfl Word boundary, match fb
(?= Positive lookahead, assert what is directly on the right is
ashing\b Match ashing and word boundary
) Close positive lookahead
) Close positive lookbehind.
ashi Match literally
.NET Regex demo
Update
If you want to match and not match the updated values, you could use an alternation (?:sm|f?l) in the negative lookbehind to match sm or an optional f followed by l
(?<!(?:sm|f?l)(?=ashing))ashi
.NET regex demo | C# demo
You can make use of a capturing group:
(flashing)|ashi
If the first group is not empty, you matched flashing literally
The following will match ashi but not within flashing. I interpreted "word" loosely, so flashing is not required to be isolated as a separate word with space/punctuation delimiters.
(?<=(?<prefix>fl)|)ashi(?(prefix)(?!ng))
It is sufficient to return true/false over the entire pattern and won't require checking specific capture groups. In other words, it is usable with Regex.IsMatch().
Pattern details:
(?<= # Zero-width positive lookbehind: match but don't consume characters
(?<prefix>fl) # Named capture group to match "fl" at start of "flashing"
| # Alternate blank capture - will succeed if "fl" is not present
) # End lookbehind
ashi # match literal "ashi"
(?(prefix) # Conditional: Only match if named group prefix has successful capture (i.e. "fl" was matched)
(?!ng) # Zero-width negative loohahead: Fail match if "ng" follows
) # Close conditional (there is no false part, so match succeeds if "fl" was not present)
If flashing is only excluded as an isolated word, just add word boundary operators. This will match something like flashingwithnospace, whereas the first pattern would fail on that string:
(?<=(?<prefix>\bfl)|)ashi(?(prefix)(?!ng\b))
(FYI, the pattern will work in isolation, but if it is combined within another pattern, especially inside a repeating construction, it may not work due to the conditional on the named capture group. Once the named capture group has succeeded, the conditional will remain true while matching the larger pattern, even if it were to encounter another occurrence of ashi.)
The question gives the examples
...somethingashisomething...
...something2!ashi*&something...
... something ashi something...
The second and third examples can be found by including the word boundary \b in the search, i.e. search for \bashi\b. Finding the first example requires more knowledge of what the two enclosing somethings are. If they are alphanumeric then you need to specify the problem in much more detail.

Regular expression to match following criterias [duplicate]

I am using the following regular expression without restricting any character length:
var test = /^(a-z|A-Z|0-9)*[^$%^&*;:,<>?()\""\']*$/ // Works fine
In the above when I am trying to restrict the characters length to 15 as below, it throws an error.
var test = /^(a-z|A-Z|0-9)*[^$%^&*;:,<>?()\""\']*${1,15}/ //**Uncaught SyntaxError: Invalid regular expression**
How can I make the above regular expression work with the characters limit to 15?
You cannot apply quantifiers to anchors. Instead, to restrict the length of the input string, use a lookahead anchored at the beginning:
// ECMAScript (JavaScript, C++)
^(?=.{1,15}$)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*$
^^^^^^^^^^^
// Or, in flavors other than ECMAScript and Python
\A(?=.{1,15}\z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\z
^^^^^^^^^^^^^^^
// Or, in Python
\A(?=.{1,15}\Z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\Z
^^^^^^^^^^^^^^^
Also, I assume you wanted to match 0 or more letters or digits with (a-z|A-Z|0-9)*. It should look like [a-zA-Z0-9]* (i.e. use a character class here).
Why not use a limiting quantifier, like {1,15}, at the end?
Quantifiers are only applied to the subpattern to the left, be it a group or a character class, or a literal symbol. Thus, ^[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']{1,15}$ will effectively restrict the length of the second character class [^$%^&*;:,<>?()\"'] to 1 to 15 characters. The ^(?:[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*){1,15}$ will "restrict" the sequence of 2 subpatterns of unlimited length (as the * (and +, too) can match unlimited number of characters) to 1 to 15 times, and we still do not restrict the length of the whole input string.
How does the lookahead restriction work?
The (?=.{1,15}$) / (?=.{1,15}\z) / (?=.{1,15}\Z) positive lookahead appears right after ^/\A (note in Ruby, \A is the only anchor that matches only start of the whole string) start-of-string anchor. It is a zero-width assertion that only returns true or false after checking if its subpattern matches the subsequent characters. So, this lookahead tries to match any 1 to 15 (due to the limiting quantifier {1,15}) characters but a newline right at the end of the string (due to the $/\z/\Z anchor). If we remove the $ / \z / \Z anchor from the lookahead, the lookahead will only require the string to contain 1 to 15 characters, but the total string length can be any.
If the input string can contain a newline sequence, you should use [\s\S] portable any-character regex construct (it will work in JS and other common regex flavors):
// ECMAScript (JavaScript, C++)
^(?=[\s\S]{1,15}$)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*$
^^^^^^^^^^^^^^^^^
// Or, in flavors other than ECMAScript and Python
\A(?=[\s\S]{1,15}\z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\z
^^^^^^^^^^^^^^^^^^
// Or, in Python
\A(?=[\s\S]{1,15}\Z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\Z
^^^^^^^^^^^^^^^^^^

Regex -Check if string start with zero + space or zero alone

I have this regular expression to math:
String start with a zero + white space + anything else
String is a zero
"0 fkvjdm" // Must Match
"0" // Must match
"0.56" // NOT match
Here is the regular expression I'm using:
^([0]$|([0]\s+.))
Is there a way to improve it? or, is it has a bug?
Thanks a lot for your help.
Environment
VS 2010 .net 4
First of all, there is no need to put 0 in a character class.
Secondly your regex will not match more than a single character after whitespace. As you don't have any quantifier on dot - . in 2nd part of your regex. To match more characters after whitespace, you should use .* (0 or more) or .+ (1 or more).
To improve in clarity, you can make use of optional quantifier here:
^0(\s+.*)?$
Seems like the second character is what causes the match to fail. If the second character is a period, then don't match; otherwise match. ?! says if what it matches succeeds, fail the whole match. Hence if the second character is a period, it will fail.
^0(?!\.).*

Regex for special case

I need to create a regex expression for the following scenario.
It can have only numbers and only one dot or comma.
First part can have one to three digits.
The second part can be a dot or a comma.
The third part can have one to two digits.
The valid scenarios are
123,12
123.12
123,1
123
12,12
12.12
1,12
1.12
1,1
1.1
1
I came up so far with this expression
\d{1,3}(?:[.,]\d{1,2})?
but it doesn't work well. For example the input is 11:11 is marked as valid.
You need to put anchors around your expression:
^\d{1,3}(?:[.,]\d{1,2})?$
^ will match the start of the string
$ will match the end of the string
If those anchors are missing, it will partially match on your string, since the last part is optional, means on "11:11" it can match on the digits before the colon and a second match will be on the digits after the colon.
Try to use ^ and $:
^\d{1,3}(?:[.,]\d{1,2})?$
^ The match must start at the beginning of the string or line.
$ The match must occur at the end of the string or before \n at the end of the line or string.

Regex.Replace without line start and end terminators has some very strange effects.... What is going on here?

While answering this question C# Regex Replace and * the point was raised as to why the problem exists. When playing I produced the following code:
string s = Regex.Replace(".A.", "\w*", "B");
Console.Write(s);
This has the output: B.BB.B
I get that the 0 length string is match before and after the . character, but why is A replaced by 2 Bs.
I could understand B.BBB.B as replacing zero-length strings either side of A or B.B.B
But the actual result confuses me - any help appreciated.
Or as AakashM has put it:
Why is Regex.Matches("A", "\w*").Count equal to 2, not 1 or 3 ?
There is a star after \w
It means "zero or many" so that means:
First symbol is a dot, it is NOT \w so there is zero \w here, replace by B
Next we have a dot itself, which is not replaceable
A gets replaced by B
zero \w before the next dot, replace by B
dot, not replaceable
Line end, zero \w so replace by B again.
Expression \w{0,} will have the same effect.
If you want to avoid it, use 'plus' which means 'at least one': \w+
Thats the same behaviour than
Regex.Replace("", "\w*", "B") results in B
Regex.Replace("A", "\w*", "B") results in BB
See it here on Regexr
For the string ".A." \w* matches before the first dot the empty string, then on the "A", after the "A" the empty string and after the last dot the empty string.
Explanation
You can think of the pattern eating the characters, \w* has eaten the "A", the next char is a dot, so this match is complete and replaced. But the start position for the pattern to continue matching is still between the A and the dot. The dot can not be matched, so it matches the empty string before the dot, but then this position is done and the next start position is after the dot.
because \w* is a greedy regex and it tries to find biggest sequence. So it matches "nothing" before dot, then "nothing"A between two dots then "nothing" before second dot and finally "nothing" after the second dot.
By default it's greedy match, so it search's maximum of matches. There is why you get that result.
If you do with reluctant way, like this
string s = Regex.Replace(".A.", "\\w*?", "B");
You will get this result, because it finding minimum matches.
B.BAB.B

Categories

Resources