.NET RegExp engine search performance optimization

.NET RegExp engine search performance optimization - c#

I have List collection with around 35,000 strings
Typical string looks like this:
"<i>füüs</i>ampri tähis;lüh ld-st<i>anno</i>, aastal;<i>maj</i> lüh pr-st<i>argent</i>, raha (kursisedelitel)"
Basically this string contains bunch of words in Estonian :)
I need to allow user to perform RegExp search on 35,000 strings
If I perform search using /ab.*/ expression, then search takes less than a second
If I perform search using /.*ab/ expression, then search takes around 10 seconds
My question is: How can I make second search faster (less then 1.5 seconds)?
Thank You very much

It’s how regular expressions are processed that makes them perform so different. To explain that based on your examples:
/.*ab/   This expression consists on two sub-expressions, the .* and the literal ab. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .* will first match the whole string. Then it will try to match the following ab. But as it is not possible (we’re already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .* is reduced by one character and again the ab is tested. If the whole expression cannot be matched, the match of .* again will be reduced by one character until the whole expression is matched. In the worst case there is no ab in the string and the algorithm will do n+1 backtracks and additional tests for ab until it can determine that a match is impossible.
/ab.*/   This expression consists of two sub-expressions too. But here the order is changed, first the literal ab and than the .*. This is processed as follows: The algorithm first tries to find the literal ab by comparing one character by another. If there is a match, it tries to find a match for .* that is obvious easy.
The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.
Try some regular expression analysis tool such as RegexBuddy to see the difference visually.

Use compiled regular expressions for better performance
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—Compiling_Regular_Expressions
copy paste the full url, looks like there is rendering problem with this link.

There are two possible modifications I can suggest for the slow .*ab expression.
I performed my tests with this test string "1234567890 ab 1234567890" using the benchmarking feature in Regex Hero.
A. 5X faster than original
^.*ab
RegexOptions.None
or
B. 8X faster than original
.*ab
RegexOptions.RightToLeft
Sometimes experimentation pays off. The RightToLeft option was my "ah ha!" moment. That essentially returns the same performance as your other ab.* expression by preventing the massive backtracking from ever occurring.

I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba.*/ if the user enter /.*ab/.

Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.

Related

How to match regular expression starting exactly at a given index?

With the .NET Regex class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
regular expression ab
input string: ababab
Now, I can search for matches for the regular expression (named expr in the following) in the input string, for instance, starting at character index 2:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2.
If I pass index 1, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this.
As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
#
Work-Around
Drawback
1
I could check the resulting match to see whether its Index property matches the supplied index.
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached).
2
I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index.
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.)
3
I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group.
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean.
Lastly, the Match overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.

It appears you can use the \G operator, \Gab pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(#"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(#"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."

How to tell a RegEx to be greedy on an 'Or' Expression

Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).

I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])

You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.

Fastest regex for first occurence of a word

I would like my regex to capture the following kind of strings as two Urls with "%3f" inside them.
https://*****%3f****%3D,https://*****%3f****%3D …
Where each string URL of this type should be captured by itself. Note - The * is here for simplification and the URLS can be in any part of the big string with anything in between.
My regex now is:
(https://\S+?%3f)(?<toDelete>\S+?%3D)
But I've been asked to see if there's a non lazy approach for this (or just a faster version), as it is much slower then greediness, and this regex will be called over huge strings and dataflow.
Note that the reason I cant simply put \S* is that doing so will capture in one match from the first http to the last %3D.

You might probably split the string with a comma and then get a substring up to the %3f value.
If you want to make the \S*? pattern work "faster" you must take into account what kind of context this part of a pattern should be aware of.
You are matching any char that is not a whitespace char, any amount of times, up to the first occurrence of %3f. That is, you want to match any chars other than % and whitespace or % chars that are not followed with 3f. That makes (?:[^\s%]|%(?!3f))*. However, alternation ruins the whole idea of optimization. You need to use the "unroll-the-loop" approach: [^%\s]*(?:%(?!3f)[^%\s]*)*.
So, the whole pattern will look like
https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f
Or with the Delete part:
(https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f)(?<toDelete>[^%\s]*(?:%(?!3D)[^%\s]*)*%3D)
For short strings, this last pattern might work a tiny bit slower than the \S+? based pattern, but it becomes much more efficient when the matched string becomes longer.

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.

You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex

Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.

.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

Improving/Fixing a Regex for C style block comments

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.
On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.
The Regex I'm using is this:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
Any suggestions on why this might get locked up?
Alternatively, what's another Regex I could use instead?
More information:
Working in C# 3.0 targeting .NET 3.5;
I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
I've left the program running for over an hour, but the match isn't completed;
Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
The regex works correctly for 452 of my 453 test files.

Some problems I see with your regex:
There's no need for the |[\r\n] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.
Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.
Filtering out extraneous junk, the final alternative inside your outermost parens is \*+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- \*+/ -- will match them for keeps.
For maximum performance, I would use this regex:
/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/
This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.
Courtesy of David, here's a version that matches nested comments with any level of nesting:
(?s)/\*(?>/\*(?<LEVEL>)|\*/(?<-LEVEL>)|(?!/\*|\*/).)+(?(LEVEL)(?!))\*/
It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:
/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/

No no no! Hasn't anyone else read Mastering Regular Expressions (3rd Edition)!? In this, Jeffrey Friedl examines this exact problem and uses it as an example (pages 272-276) to illustrate his "unrolling-the-loop" technique. His solution for most regex engines is like so:
/\*[^*]*\*+(?:[^*/][^*]*\*+)*/
However, if the regex engine is optimized to handle lazy quantifiers (like Perl's is), then the most efficient expression is much simpler (as suggested above):
/\*.*?\*/
(With the equivalent 's' "dot matches all" modifier applied of course.)
Note that I don't use .NET so I can't say which version is faster for that engine.

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line:
/\*.*?\*/

I think your expression is way too complicated. Applied to a large string, the many alternatives imply a lot of backtracking. I guess this is the source of the performance hit you see.
If the basic assumption is to match everything from the "/*" until the first "*/" is encountered, then one way to do it would be this (as usual, regex is not suited for nested structures, so nesting block comments does not work):
/\*(.(?!\*/))*.?\*/ // run this in single line (dotall) mode
Essentially this says: "/*", followed by anything that itself is not followed by "*/", followed by "*/".
Alternatively, you can use the simpler:
/\*.*?\*/ // run this in single line (dotall) mode
Non-greedy matching like this has the potential to go wrong in an edge case - currently I can't think of one where this expression might fail, but I'm not entirely sure.

I'm using this at the moment
\/\*[\s\S]*?\*\/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.NET RegExp engine search performance optimization - c#

Use compiled regular expressions for better performance http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—Compiling_Regular_Expressions copy paste the full url, looks like there is rendering problem with this link.

I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba./ if the user enter /.ab/.

Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.

Related

How to match regular expression starting exactly at a given index?

How to tell a RegEx to be greedy on an 'Or' Expression

Fastest regex for first occurence of a word

Regex.Matches returns one match per line, not per "word"

Improving/Fixing a Regex for C style block comments

Categories

Resources