Why is my regex so much slower compiled than interpreted?

Why is my regex so much slower compiled than interpreted? - c#

I have a large and complex C# regex that runs OK when interpreted, but is a bit slow. I'm trying to speed this up by setting RegexOptions.Compiled, and this seems to take about 30 seconds for the first time and instantly after that. I'm trying to negate this by compiling the regex to an assembly first, so my app can be as fast as possible.
My problem is when the compiling delay takes place, whether it's compiled in the app:
Regex myComplexRegex = new Regex(regexText, RegexOptions.Compiled);
MatchCollection matches = myComplexRegex.Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
or using Regex.CompileToAssembly in advance:
MatchCollection matches = new CompiledAssembly.ComplexRegex().Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
This is making compiling to an assembly basically useless, as I still get the delay on the first foreach call. What I want is for all the compiling delay to be done at compile time instead (at the Regex.CompileToAssembly call), and not at runtime. Where am I going wrong ?
(The code I'm using to compile to an assembly is similar to http://www.dijksterhuis.org/regular-expressions-advanced/ , if that's relevant ).
Edit:
Should I be using new when calling the compiled assembly in new CompiledAssembly.ComplexRegex().Matches(searchText); ? It gives a "object reference required" error without it though.
Update 2
Thanks for the answers/comments. The regex that I'm using is pretty long but basically straightforward, a list of thousands of words each separated by |. I can't see it'd be a backtracking problem really. The subject string can be just one letter long, and it can still cause the compilation delay. For a RegexOptions.Compiled regex, it'll take over 10 seconds to execute when the regex contains 5000 words. For comparison, the non-compiled version of the regex can take 30,000+ words and still execute just about instantly.
After doing a lot of testing on this, what I think I've found out is:
Don't use RegexOptions.Compiled when your regex has many alternatives - it can be extremely slow to to compile.
.Net will use lazy evaluation for regex when possible, and AFAI can see this extends (at least to some extent) to regex compilation too. A regex will be fully compiled only when it has to be, and there seems to be no way of forcing compilation ahead of time.
Regex.CompileToAssembly would be much more useful if the regexes could be forced to be fully compiled, it seems to be verging on being pointless as it is.
Please correct me if I'm wrong or missing something!

When using RegexOptions.Compiled, you should make sure to re-use the Regex object. It doesn't seem like you are doing this.
RegexOptions.Compiled is a trade-off. The initial construction of the Regex will be slower, because code is compiled on-the-fly, but each match should be faster. If your regular expression changes at run-time, there will probably be no benefit from using RegexOptions.Compiled, although it might depend on the actual expression involved.
Update, per the comments
If your actual code looks like the one you have posted, you are not taking any advantage of CompileToAssembly, as you are creating new, on-the-fly compiled instances of Regex each time that piece of code runs. In order to take advantage of CompileToAssembly, you will need to compile the Regex first; then take the generated assembly and reference it in your project. You should then instantiate the generated, strongly-typed Regex types generated.
In the example you link to, he has a regular expression named FindTCPIP, which gets compiled into a type named FindCTPIP. When this needs to be used, one should create a new instance of this specific type, such as:
TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();

Try using Regex.CompileToAssembly, then link to the assembly so that you can construct the Regex objects. RegexOptions.Compiled is a runtime option, the regex would still get re-compiled every time you run the application.

A very probable cause when investigating a slow regex is that it backtracks too much. This is solved by rewriting the regex so that the number of backtracking is non existent or minimal.
Can you post the regex and a sample input where it is slow.
Personally I didn't have the need to compile a regex although its interesting to see some actual numbers about performance if you have taken this path.

To force initialization you can call Match against an empty string. On top of that you can use ngen to create a native image of the expression to speed up the process even further. But probably most importantly, it's essentially just as fast to throw 30.000 string.IndexOf's or string.Contains or Regex.Match statements against a given text, than compiling a ginormous big expression to Match against a single text. Since that requires a lot less compilation, jitting etc, as the state machine is a lot simpler.
Another thing you could consider is to tokenize the text and intersect it with the list of words you're after.

After extensive testing of my own, I can confirm the suspicions of mikel are essentially correct. Even when using Regex.CompileToAssembly() and statically linking the resultant DLL into the application, there is a substantial initial delay on the first practical matching call (at least for patterns involving many ORed alternatives). Moreover, the initial delay on the first matching call depends on what text you match against. For example, matching against an empty string or some other arbitrary text will cause less of an initial delay, but you will still get additional delays later on when actual positive matches are first encountered in new text. The only way to fully guarantee future matches will all be lightning fast is to initially force a positive match at runtime with text that does indeed match. Of course this gives the maximum initial delay possible (in exchange for all future matches being lightning fast).
I dug deeper in order to understand this better. For each regex compiled into the assembly, a triplet of classes are written with the following naming template: {RegexName, RegexNameFactoryN, RegexNameRunnerN}. A reference to the RegexNameFactoryN class is instantiated at time of RegexName ctor, but the RegexNameRunnerN class is not. See the private factory and runnerref fields in the base Regex class. runnerref is a cached weak reference to a RegexNameRunnerN object. After various experiments with reflection, I can confirm that the ctors of all 3 of these compiled classes are fast and the RegexNameFactoryN.CreateInstance() function (which returns the initial RegexNameRunnerN reference) is also fast. The initial delay occurs somewhere within RegexRunner.Scan(), or it's call tree, and is thus likely outside the reach of the compiled MSIL generated by Regex.CompileToAssembly() since this call tree involves numerous non-abstract functions. This is very unfortunate and means the C# Regex compilation process performance benefits only extend so far: At runtime there will always be some substantial delay at the first time a positive match is encountered (at least for this class of many-ORed patterns).
I theorize that this has to do with how the Nondeterministic Finite Automaton (NFA) engine performs some of it's own internal caching/instantiations at runtime as the pattern is processed.
jessehouwing's suggestion of ngen is interesting and could possibly improve performance. I have not tested it.

Related

How does Visual Studio syntax-highlight strings in the Regex constructor?

Hi fellow programmers and nerds!
When creating regular expressions Visual Studio, the IDE will highlight the string if it's preceded by a verbatim identifier (for example, #"Some string). This looks something like this:
(Notice the way the string is highlighted). Most of you will have seen this by now, I'm sure.
My problem: I am using a package acquired from NuGet which deals with regular expressions, and they have a function which takes in a regular expression string, however their function doesn't have the syntax highlighting.
As you can see, this just makes reading the Regex string just a pain. I mean, it's not all-too-important, but it would make a difference if we can just have that visually-helpful highlighting to reduce the time and effort one's brain uses trying to decipher the expression, especially in a case like mine where there will be quite a quantity of these expressions.
The question
So what I'm wanting to know is, is there a way to make a function highlight the string this way*, or is it just something that's hardwired into the IDE for the specific case of the Regex c-tor? Is there some sort of annotation which can be tacked onto the function to achieve this with minimal effort, or would it be necessary to use some sort of extension?
*I have wrapped the call to AddStyle() into one of my own functions anyway, and the string will be passed as a parameter, so if any modifications need to be made to achieve the syntax-highlight, they can be made to my function. Therefore the fact that the AddStyle() function is from an external library should be irrelevant.
If it's a lot of work then it's not worth my time, somebody else is welcome to develop an extension to solve this, but if there is a way...
Important distinction
Please bear in mind I am talking about Visual Studio, NOT Visual Studio Code.
Also, if there is a way to pull the original expression string from the Regex, I might do it that way, since performance isn't a huge concern here as this is a once-on-startup thing, however I would prefer not to do it that way. I don't actually need the Regex object.

According to https://devblogs.microsoft.com/dotnet/visual-studio-2019-net-productivity/#regex-language-support and https://www.meziantou.net/visual-studio-tips-and-tricks-regex-editing.htm you can mark the string with a special comment to get syntax highlighting:
// language=regex
var str = #"[A-Z]\d+;
or
MyMethod(/* language=regex */ #"[A-Z]\d+);
(the comment may contain more than just this language=regex part)
The first linked blog talks about a preview, but this feature is also present in the final product.

.NET 7 introduces the new [StringSyntax(...)] attribute, which is used in .NET 7 on more than 350 string, string[], and ReadOnlySpan<char> parameters, properties, and fields to highlight to an interested tool what kind of syntax is expected to be passed or set.
https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/?WT_mc_id=dotnet-35129-website&hmsr=joyk.com&utm_source=joyk.com&utm_medium=referral
So for a method argument you should just use:
void MyMethod([StringSyntax(StringSyntaxAttribute.Regex)] string regex);
Here is a video demonstrating the feature: https://youtu.be/Y2YOaqSAJAQ

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.

Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Whats the best way to optimise my regex matching

I've got an app with a textbox in it. The user enters text into this box.
I have the following function triggered in an OnKeyUp event in that textbox
private void bxItemText_KeyUp(object sender, System.Windows.Input.KeyEventArgs e)
{
// rules is an array of regex strings
foreach (string rule in rules)
{
Regex regex = new Regex(rule);
if (regex.Match(text.Trim().ToLower()))
{
// matched rule is a variable
matchedRule = rule;
break;
}
}
}
I've got about 12 strings in rules, although this can vary.
Once the length of the text in my textbox gets bigger than 80 characters performance starts to degrade. typing a letter after 100 characters takes a second to show up.
How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?

How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?
I would go with the second option, that is abandon KeyUp and trigger the validation every couple of seconds, or better yet trigger the validation when the TextBox loses focus.
On the other hand, I should suggest to cache the regular expressions beforehand and compile them because it seems like you are using them over and over again, in other words instead of storing the rules as strings in that array, you should store them as compiled regular expression objects when they are added or loaded.

Use static method calls instead of create a new object each time, static calls use a caching feature : Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects.
That will be a major improvement in performance, then you can provide your regexes (rules) to see if some optimization can be done in the regexes.
Other resources :
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Optimizing Regex Performance, Part 3

Combining strings to one on Regex level will work faster than foreach in code.
Combining two Regex to one

If you need pattern determination for Each new symbol, and you care about performance, than Final State Machine seems to be the best option...
That is much harder way. You should specify for each symbol list of next symbols, that are allowed.
And OnKeyUp you just walk on next state, if possible. And you will have the amount of patterns, that input text currently matches.
Some useful references, that I could found:
FSM example
Guy explaining how to convert Regex to FSM
Regex - FSM converting discussion

You don't need to create a new regex object each time. Also using static call will cache the pattern if used before (since .Net 2). Here is how I would rewrite it
matchedRule = Rules.FirstOrDefault( rule => Regex.IsMatch(text.Trim().ToLower(), rule));

Given that you seem to be matching keywords, can you just perform the match on the current portion of text that's been edited (i.e. in the vicinity of the cursor)? Might be tricky to engineer, especially for operations like paste or undo, but scope for a big performance gain.

Pre-compile your regexes (using RegexOptions.Compiled). Also, can you make the Trim and ToLower redundant by expanding your regex? You're running Trim and ToLower once for each rule, which is inefficient even if you can't eliminate them altogether
You can try and make your rules mutually exclusive - this should speed things up. I did a short test: matching against the following
"cat|car|cab|box|balloon|button"
can be sped up by writing it like this
"ca(t|r|b)|b(ox|alloon|utton)"

Given a dictionary, what's the optimal way to find all possible words that contains a particular set of characters and a string

I'm writing a word game. I have access to the dictionary object to validate the words. I need to find all possible words that contains a word and a set of additional characters.
for example:
lets the say the word is "MEN" and the set of additional characters are "WALOHTD". I need a way to find words like....
1.MEND
2.WOMEN
3.MENTAL
4. etc....
basically we are looking at all possible words that contain "MEN" and any of the specific additional characters.
I can certainly write code that can loop through the entire dictionary to first words that contains the subword and then check for the specific characters existance but that is not optimal. It's taking more than a second. Any help towards optimal solution is greatly appreciated.
_rey

The problem is a mixture of that of regular language and that of searching a data structure.
Considering the first aspect alone, we'd be inclined to use a regular expression. You don't say if we can repeat the "additional characters". If we can, it's easy enough [WALOTHD]*MEN[WALOTHD]* for your case, and that's easily adapted.
If we can't repeat, then we can start with [WALOTHD]{0,7}MEN[WALOTHD]{0,7} and filter out any that break the rule ("ALLOTMENT" matches that expression, but repeats L and T).
Or we can try to build a much more complicated regular expression, though I'm not sure if the gains in the better expression would out-weigh the cost of working out what it was though.
Coming from the other side of searching a dictionary, a DAWG is very space-efficient and makes finding matches that contain substrings relatively efficient. It's not a complete match to this puzzle, as we have quite a few permutations of prefixes and suffixes to worry about. Without testing, I'd guess it'd being reasonably good if we can't repeat from the "additional", and horrible if we can. But that is just a guess. A GADDAG might well be worth looking at, it'd be bigger than a DAWG, but likely faster for this sort of search (GADDAGs are used in scrabble-solving, which is pretty much the same problem that you have here).

Parsing A String - Is There A More Efficient Method than Checking Each Line?

I am working on a project to parse out a text file. The file is output from networking equipment. The incoming string is anywhere from a few thousand to tens of thousands of lines long. There will be a variable number of entries with keywords like these:
fcN/N is up
Hardware is Fibre Channel, SFP is short wave laser w/o OFC (SN)
Port WWN is 20:52:00:0d:ec:ef:b0:40
Admin port mode is F, trunk mode is on
snmp link state traps are enabled
Port vsan is 10
fcipN is up
.....
port-channel-N is trunking
......
The N is a number. There will always be the 'fcN/N' entries, there may or may not be the other two. The 'fcip' and 'port-channel' entries will have similar status information after each one as the fcN/N entries. All entries of the same type will be grouped - there won't be an fc followed by an fcip followed by another fc. Also as a general rule, all the fc entries are listed, then all the port-channel then all the fcip but I don't want to assume that. At the moment I have about 7 different RegEx patterns I am looking for. I do this by examining each line in turn, however managing all those is cumbersome. I thought about splitting the string on newline and then some kind of LINQ select to get all of each of the 3 types of entries, but that assumes they are always grouped in the same order. I also thought about 3 monster regexes to match everything from one entry to the next, but my experience is those are tough to get working and almost unreadable. Another thing I thought of was first match the three keywords - fc or port-channel or fcip, then have an if statement that matches the patterns unique to those. That is still matching each line for all 3 patterns though.
To be clear, I have the Regex patterns working. I am looking for a more efficient way to do this than test each line for 6 0r 8 matches.
Any other ideas?

I have two thought:
(1) Your last approach of using if statements to first find the right regex to apply is like to be quite efficient. I'd recommend it.
(2) You can compose regex's like this:
var pattern1 = #"abc";
var pattern2 = #"def";
var unionPattern = "((" + pattern1 + ")|(" + pattern2 + "))";
This makes it much more readable.
If you never want to find a match that spans lines you should split the file into lines first. That will improve efficiency because the regexes have smaller inputs and will backtrack less.
If your matches span multiple lines but they always start after a new-line, you can you can split the string into chunks first like this:
var chunks = Regex.Split(str, "((fc\d)|(fcip\d)|(port-channel-\d)));

You might get clearer and more concise code by using a parser combinator library, such as Sprache.
Not being a C# programmer, I'm not intimately familiar with this library (and there may well be others for C# as well), but I've used Scala parser combinators to good effect, and they build on and use regular expression parsing.
Whether it make your code more efficient likely depends on how inefficient your code now is.

Are you looking for raw speed, or efficiency? If the former, you can split the file into parts and have a thread parsing each part simultaneously. The trick will be finding a boundary to split on (so that each part contains only whole entries) quickly. You will also only want to go multithreaded if the total number of lines is large, or the overhead will outweigh the parallelization gains.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.