Fastest regex for first occurence of a word

Fastest regex for first occurence of a word - c#

I would like my regex to capture the following kind of strings as two Urls with "%3f" inside them.
https://*****%3f****%3D,https://*****%3f****%3D …
Where each string URL of this type should be captured by itself. Note - The * is here for simplification and the URLS can be in any part of the big string with anything in between.
My regex now is:
(https://\S+?%3f)(?<toDelete>\S+?%3D)
But I've been asked to see if there's a non lazy approach for this (or just a faster version), as it is much slower then greediness, and this regex will be called over huge strings and dataflow.
Note that the reason I cant simply put \S* is that doing so will capture in one match from the first http to the last %3D.

You might probably split the string with a comma and then get a substring up to the %3f value.
If you want to make the \S*? pattern work "faster" you must take into account what kind of context this part of a pattern should be aware of.
You are matching any char that is not a whitespace char, any amount of times, up to the first occurrence of %3f. That is, you want to match any chars other than % and whitespace or % chars that are not followed with 3f. That makes (?:[^\s%]|%(?!3f))*. However, alternation ruins the whole idea of optimization. You need to use the "unroll-the-loop" approach: [^%\s]*(?:%(?!3f)[^%\s]*)*.
So, the whole pattern will look like
https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f
Or with the Delete part:
(https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f)(?<toDelete>[^%\s]*(?:%(?!3D)[^%\s]*)*%3D)
For short strings, this last pattern might work a tiny bit slower than the \S+? based pattern, but it becomes much more efficient when the matched string becomes longer.

Related

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)

First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

Performance and readability of RegEx using positive look ahead

I am validating following strings with regular expressions in C#:
[/1/2/]
[/1/2/];[/3/4/5/]
[/1/22/333/];[/1/];[/9999/]
Basically it's one or more group of square brackets separated by semi-colon (but not at the end). Each group consists out of one or more numbers seperated by slashes. There are no other characters allowed.
These are two alternatives:
^(\[\/(\d+\/)+\](;(?=\[)|$))+$
^(\[\/(\d+\/)+\];)*(\[\/(\d+\/)+\])$
The first version uses a positive look ahead and the second version duplicates part of the pattern.
Both RegEx-es seem to be ok, do what they should and aren't very nice to read. ;)
Does anybody have an idea for a better, faster and more easy to read solution? When I was playing around in regex101 I realized that the second version uses more steps, why?
At the same time I realized that it would be nice to count the steps used in a C#-RegEx. Is there any way to achieve this?

You can use 1 regex to validate all these strings:
^\[/(\d+/)+\](?:;\[/(\d+/)+\])*$
See regex demo
To make it easier to read, use a VERBOSE flag (inline (?x) or RegexOptions.IgnorePatternWhitespace):
var rx = #"(?x)^ # Start of string
\[/ # Literal `[/`
(\d+/)+ # 1 or more sequences of 1 or more digits followed by `/`
\] # Closing `]`
(?: # A non-capturing group start
; # a semi-colon delimiter
\[/(\d+/)+\] # Same as the first part of the regex
)* # 0 or more occurrences
$ # End of string
";
To test a .NET regex performance (not the number of steps), you can use a regexhero.net service. With the 3 sample strings above, my regex shows 217K iterations per second speed, which is more than either of your regexps.

There is nothing particularly wrong with the two options you suggest. They are not that complicated as regexes go, and they should be understandable enough, as long as you put an appropriate comment in your code.
In general, I think it is preferable to avoid look-arounds, unless they are necessary or greatly simplify the regex--they make it harder to figure out what is going on, since they add a non-linear element to the logic.
The relative performance of regexes this simple is not something to worry about, unless you are performing a huge number of operations or discover a performance problem with your code. Still, understanding the relative performance of different patterns may be instructive.

Why does Regex hang in C#

I have had a fairly good search around, and although there are quite a few similar questions, I don't believe the answers are applicable.
I have a reasonably inefficient regex searching a reasonably large string. I have tested it in http://regexpal.com with the exact regex and string, and it comes back with the correct answer almost instantaneously.
The C# Regex module with the same inputs hangs - or at least I've left it 10 minutes to do what regexpal can do in fractions of a second.
Is the C# implementation of Regex hopelessly less efficient than http://regexpal.com, or is it genuinely hanging? The regex is to search for two keywords which are separated by an unknown number of lines:
"KEYWORD1(.|\r|\n)+KEYWORD2\t +.+"
And the string is 830 lines long, each line being approximately 30 characters.

According to the documentation on Regular Expression, . matches any single character except \n. This means that . (which doesn't match \r in Java (default mode), JavaScript, etc.) matches \r in .NET.
Your regex effectively allows 2 branches for the same character \r. The more \r in the input, the longer it takes to run the regex. On a failing input, it will cause exponential complexity based on the number of \r in the input.
Note that regexpal is a JavaScript regular expression tester, and as mentioned above, . in JavaScript excludes \r, \n (and a few other line separator). Since there is no overlap in what they match, each character has at most 1 branch to follow.
One solution is to replace (.|\r|\n)+ with (?s:.+). The s flag will effectively makes . match any character without exception. There is only one branch for any character, so no exponential backtracking.
 +.+ can't cause much inefficiency in this case, since it is already at the end of the pattern. It may cause problem (quadratic complexity) if there is something else following it, though. For example, if there is $ at the end, then in the failing case, when the pattern  +.+$ is matches against a suffix with lots of spaces, followed by a newline at the end, then an unoptimized engine will try all ways to divide the consecutive spaces into 2 parts.

As was stated in the comments, a common issue with Regex is having a pattern along the lines of:
Word1.+Word2
Because if your text was very large and had something like:
Word1 ... Word1 ... Word2 ... Word2 ... Word1 .... Word2 ... Word2
You would have ALL combinations matched that began with Word1 and ended with Word2 - Even when Word1 or Word2 were in between them.
Generally that's not what you're looking for and want the shortest set of characters between your start and end points (or not have Word1 show up again). For that your Regex would best be changed to:
Word1.+?Word2
Hope that makes sense.

Basically, your regular expression is recursive. Removing the final .+ solves the issue. Here is a good article on avoiding it in the future. Why does .NET hang? Probably because it will never abort the search. Perhaps the JavaScript parser you used aborts if it detects too many steps or goes too deep into recursive searches.

Regex not capturing date

I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.

First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)

If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.

parsing a method Signature using regular expressions

I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)

Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fastest regex for first occurence of a word - c#

Related

Regex.IsMatch gives true but http://www.regexr.com/ gives false

Performance and readability of RegEx using positive look ahead

Why does Regex hang in C#

Regex not capturing date

parsing a method Signature using regular expressions

Categories

Resources