Regex performance issue - can anyone explain way this regex is slow - c#

I'm trying to implement the String.Contains() method with regex. I noticed that this pattern #".\*foo.\*" takes much longer then this #"\A.\*foo.\*\Z".
Can anyone explain why?

Because without the anchor the regex engine has to make more tries to match before it can conclude that a match is impossible. Consider an example with the anchor:
Regex: \A.*foo.*\Z
Input: 123456789abcdef
The regular expression sees the start of input anchor and takes that into account. It now tries to match the first .* pattern, and since it's greedy it attempts to match all the input. Then it tries to match foo and fails, so it releases f from the .* match and attempts again. It fails again, releases e from the .* match, attempts again, fails, etc.
The end result is that the number of attempts taken until the whole expression fails to match is linear to the length of the input.
Now consider the non-anchored case:
Regex: .*foo.*
Input: 123456789abcdef
This time the regex engine attempts to match from the start of the string, as above (making a linear to the length of the string amount of attempts). But when that fails, it begins the process again starting from the second character of the input.
That is, it attempts to match the first .* successively with:
123456789abcdef
123456789abcde
123456789abcd
...
1
(empty string due to the * quantifier)
Up till now this is the same as with the anchored regex. But while the anchor would cause matching to fail at this point, the non-anchored regex continues to try with
23456789abcdef
23456789abcde
23456789abcd
...
2
(empty string due to the * quantifier)
As you see, this time the number of attempts taken until the whole expression fails to match is quadratic to the length of the input.

\A and \Z means beginning and the end of the string. Therefore the regex is more limited and has less searching to do. For example if your text has newlines in it, the 2nd regex is way faster since it only searches the first new line where the 1st regex keeps searching

Related

How to match regular expression starting exactly at a given index?

With the .NET Regex class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
regular expression ab
input string: ababab
Now, I can search for matches for the regular expression (named expr in the following) in the input string, for instance, starting at character index 2:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2.
If I pass index 1, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this.
As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
#
Work-Around
Drawback
1
I could check the resulting match to see whether its Index property matches the supplied index.
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached).
2
I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index.
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.)
3
I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group.
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean.
Lastly, the Match overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.
It appears you can use the \G operator, \Gab pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(#"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(#"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."

How do I select all including sensitive case (regex) in c#?

I have a problem with a regex command,
I have a file with a tons of lines and with a lot of sensitive characters,
this is an Example with all sensitive case 0123456789/*-+.&é"'(-è_çà)=~#{[|`\^#]}²$*ù^%µ£¨¤,;:!?./§<>AZERTYUIOPMLKJHGFDSQWXCVBNazertyuiopmlkjhgfdsqwxcvbn
I tried many regex commands but never get the expected result,
I have to select everything from Example to the end
I tried this command on https://www.regextester.com/ :
\sExample(.*?)+
Image of the result here
And when I tried it in C# the only result I get was : Example
I don't understand why --'
Here's a quick chat about greedy and pessimistic:
Here is test data:
Example word followed by another word and then more
Here are two regex:
Example.*word
Example.*?word
The first is greedy. Regex will match Example then it will take .* which consumes everything all the way to the END of the string and the works backwards spitting a character at a time back out, trying to make the match succeed. It will succeed when Example word followed by another word is matched, the .* having matched word followed by another (and the spaces at either end)
The second is pessimistic; it nibbled forwards along the string one character at a time, trying to match. Regex will match Example then it'll take one more character into the .*? wildcard, then check if it found word - which it did. So pessimistic matching will only find a single space and the full match in pessimistic mode is Example word
Because you say you want the whole string after Example I recommend use of a greedy quantifier so it just immediately takes the whole string that remains and declares a match, rather than nibbling forwards one at a time (slow)
This, then, will match (and capture) everything after Example:
\sExample(.*)
The brackets make a capture group. In c# we can name the group using ?<namehere> at the start of the brackets and then everything that .* matches can be retrieved with:
Regex r = new Regex("\sExample(?<x>.*)");
Match m = r.Match("Exampleblahblah");
Console.WriteLine(m.Groups["x"].Value); //prints: blahblah
Note that if your data contains newlines you should note that . doesn't match a newline, unless you enable RegexOptions.SingleLine when you create the regex

How to tell a RegEx to be greedy on an 'Or' Expression

Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).
I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])
You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.

Regex only returns value when anchor is provided

I'm using the following pattern to match numbers in a string. I know there is only one number in a given string I'm trying to match.
var str = "Store # 100";
var regex = new Regex(#"[0-9]*");
When I call regex.Match.Value, this returns an empty string. However, if I change it to:
var regex = new Regex(#"[0-9]*$");
It does return the value I wanted. What gives?
Ok I figured it out.
The problem with [0-9]* or let's make it simpler: \d* is that * makes it optional so it will also result in zero-length match for every character before the '100'.
To rectify this you could use \d\d*, this will cause at least one mandatory digit before the rest and clear out zero-length matches.
Edit: The dollar version, e.g. \d*$ will only work if your number is at the end of the input string.
More information here!
Aaaaand One more link for yet even more info (what a time to be alive).
According to MSDN,
The quantifiers *, +, and {n,m} and their lazy counterparts never
repeat after an empty match when the minimum number of captures has
been found. This rule prevents quantifiers from entering infinite
loops on empty subexpression matches when the maximum number of
possible group captures is infinite or near infinite.
So, as the minimum number of captures is zero, the [0-9]* pattern returns so many NULLs. And [0-9]+ will capture 100 without any problems.

Matching a number preceeded by a know string, followed by an unknown number of characters

[SOME_WORDS:200:1000]
Trying to match just the last 1000 part. Both numbers are variable and can contain an unknown number of characters (although they are expected to contain digits, I cannot rule out that they may also contain other characters). The SOME_WORDS part is known and does not change.
So I begin by doing a positive lookbehind for [SOME_WORDS: followed by a positive lookahead for the trailing ]
That gives us the pattern (?<=\[SOME_WORDS:).*(?=])
And captures the part 200:1000
Now because I don't know how many characters are after SOME_WORDS:, but I know that it ends with another : I use .*: to indicate any character any amount of time followed by :
That gives us the pattern (?<=\[SOME_WORDS:.*:).*(?=])
However at this point the pattern no longer matches anything and this is where I become confused. What am I doing wrong here?
If I assume that the first number will always be 3 characters long I can replace .* with ... to get the pattern (?<=\[SOME_WORDS:...:).*(?=]) and this correctly captures just the 1000 part. However I don't understand why replacing ... with .* makes the pattern not capture anything.
EDIT:
It seems like the online tool I was using to test the regex pattern wasn't working correctly. The pattern (?<=\[SOME_WORDS:.*:).*(?=]) matches the 1000 with no issues when actually done in .net
You usually cannot use a + or a * in a lookbehind, only in a lookahead.
If c# does allow these than you could use a .*? instead of a .* as the .* will eat the second :
Try this:
(?<=\[SOME_WORDS:)(?=\d+:(\d+)])
The match wil be in the first capture group
Quote from http://www.regular-expressions.info/lookaround.html
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.
As Robert Smit mentions this is due to the * being a greedy operator. Greedy operators consume as many characters as they possibly can when they are matched first. They only give up characters if the match fails. If you make the greedy operator lazy(*?), then matching consumes as little number of characters as possible for the match to succeed, so the : is not consumed by *. You can also use [^:]* which is match any character other than :.

Categories

Resources