How to match regular expression starting exactly at a given index? - c#

With the .NET Regex class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
regular expression ab
input string: ababab
Now, I can search for matches for the regular expression (named expr in the following) in the input string, for instance, starting at character index 2:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2.
If I pass index 1, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this.
As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
#
Work-Around
Drawback
1
I could check the resulting match to see whether its Index property matches the supplied index.
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached).
2
I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index.
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.)
3
I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group.
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean.
Lastly, the Match overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.

It appears you can use the \G operator, \Gab pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(#"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(#"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."

Related

Regex match one digit or two

If this
(°[0-5])
matches °4
and this
((°[0-5][0-9]))
matches °44
Why does this
((°[0-5])|(°[0-5][0-9]))
match °4 but not °44?
Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):
((°[0-5])|(°[0-5][0-9]))
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
You are using shorter match first in regex alternation. Better use this regex to match both strings:
°[0-5][0-9]?
RegEx Demo
Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression
(a|ab|abc)
when fed this input:
abcdefghi
will only ever match a. However, if the regular expression is changed to
(a|ab|abc)d
It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.
Why would you not reduce your regular expression from
((°[0-5])|(°[0-5][0-9]))
to this?
°[0-5][0-9]?
It's simpler and easier to understand.

Regex only returns value when anchor is provided

I'm using the following pattern to match numbers in a string. I know there is only one number in a given string I'm trying to match.
var str = "Store # 100";
var regex = new Regex(#"[0-9]*");
When I call regex.Match.Value, this returns an empty string. However, if I change it to:
var regex = new Regex(#"[0-9]*$");
It does return the value I wanted. What gives?
Ok I figured it out.
The problem with [0-9]* or let's make it simpler: \d* is that * makes it optional so it will also result in zero-length match for every character before the '100'.
To rectify this you could use \d\d*, this will cause at least one mandatory digit before the rest and clear out zero-length matches.
Edit: The dollar version, e.g. \d*$ will only work if your number is at the end of the input string.
More information here!
Aaaaand One more link for yet even more info (what a time to be alive).
According to MSDN,
The quantifiers *, +, and {n,m} and their lazy counterparts never
repeat after an empty match when the minimum number of captures has
been found. This rule prevents quantifiers from entering infinite
loops on empty subexpression matches when the maximum number of
possible group captures is infinite or near infinite.
So, as the minimum number of captures is zero, the [0-9]* pattern returns so many NULLs. And [0-9]+ will capture 100 without any problems.

Regex performance issue - can anyone explain way this regex is slow

I'm trying to implement the String.Contains() method with regex. I noticed that this pattern #".\*foo.\*" takes much longer then this #"\A.\*foo.\*\Z".
Can anyone explain why?
Because without the anchor the regex engine has to make more tries to match before it can conclude that a match is impossible. Consider an example with the anchor:
Regex: \A.*foo.*\Z
Input: 123456789abcdef
The regular expression sees the start of input anchor and takes that into account. It now tries to match the first .* pattern, and since it's greedy it attempts to match all the input. Then it tries to match foo and fails, so it releases f from the .* match and attempts again. It fails again, releases e from the .* match, attempts again, fails, etc.
The end result is that the number of attempts taken until the whole expression fails to match is linear to the length of the input.
Now consider the non-anchored case:
Regex: .*foo.*
Input: 123456789abcdef
This time the regex engine attempts to match from the start of the string, as above (making a linear to the length of the string amount of attempts). But when that fails, it begins the process again starting from the second character of the input.
That is, it attempts to match the first .* successively with:
123456789abcdef
123456789abcde
123456789abcd
...
1
(empty string due to the * quantifier)
Up till now this is the same as with the anchored regex. But while the anchor would cause matching to fail at this point, the non-anchored regex continues to try with
23456789abcdef
23456789abcde
23456789abcd
...
2
(empty string due to the * quantifier)
As you see, this time the number of attempts taken until the whole expression fails to match is quadratic to the length of the input.
\A and \Z means beginning and the end of the string. Therefore the regex is more limited and has less searching to do. For example if your text has newlines in it, the 2nd regex is way faster since it only searches the first new line where the 1st regex keeps searching

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

.NET RegExp engine search performance optimization

I have List collection with around 35,000 strings
Typical string looks like this:
"<i>füüs</i>ampri tähis;lüh ld-st<i>anno</i>, aastal;<i>maj</i> lüh pr-st<i>argent</i>, raha (kursisedelitel)"
Basically this string contains bunch of words in Estonian :)
I need to allow user to perform RegExp search on 35,000 strings
If I perform search using /ab.*/ expression, then search takes less than a second
If I perform search using /.*ab/ expression, then search takes around 10 seconds
My question is: How can I make second search faster (less then 1.5 seconds)?
Thank You very much
It’s how regular expressions are processed that makes them perform so different. To explain that based on your examples:
/.*ab/   This expression consists on two sub-expressions, the .* and the literal ab. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .* will first match the whole string. Then it will try to match the following ab. But as it is not possible (we’re already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .* is reduced by one character and again the ab is tested. If the whole expression cannot be matched, the match of .* again will be reduced by one character until the whole expression is matched. In the worst case there is no ab in the string and the algorithm will do n+1 backtracks and additional tests for ab until it can determine that a match is impossible.
/ab.*/   This expression consists of two sub-expressions too. But here the order is changed, first the literal ab and than the .*. This is processed as follows: The algorithm first tries to find the literal ab by comparing one character by another. If there is a match, it tries to find a match for .* that is obvious easy.
The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.
Try some regular expression analysis tool such as RegexBuddy to see the difference visually.
Use compiled regular expressions for better performance
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—Compiling_Regular_Expressions
copy paste the full url, looks like there is rendering problem with this link.
There are two possible modifications I can suggest for the slow .*ab expression.
I performed my tests with this test string "1234567890 ab 1234567890" using the benchmarking feature in Regex Hero.
A. 5X faster than original
^.*ab
RegexOptions.None
or
B. 8X faster than original
.*ab
RegexOptions.RightToLeft
Sometimes experimentation pays off. The RightToLeft option was my "ah ha!" moment. That essentially returns the same performance as your other ab.* expression by preventing the massive backtracking from ever occurring.
I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba.*/ if the user enter /.*ab/.
Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.

Categories

Resources