Improving/Fixing a Regex for C style block comments

Improving/Fixing a Regex for C style block comments - c#

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.
On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.
The Regex I'm using is this:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
Any suggestions on why this might get locked up?
Alternatively, what's another Regex I could use instead?
More information:
Working in C# 3.0 targeting .NET 3.5;
I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
I've left the program running for over an hour, but the match isn't completed;
Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
The regex works correctly for 452 of my 453 test files.

Some problems I see with your regex:
There's no need for the |[\r\n] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.
Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.
Filtering out extraneous junk, the final alternative inside your outermost parens is \*+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- \*+/ -- will match them for keeps.
For maximum performance, I would use this regex:
/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/
This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.
Courtesy of David, here's a version that matches nested comments with any level of nesting:
(?s)/\*(?>/\*(?<LEVEL>)|\*/(?<-LEVEL>)|(?!/\*|\*/).)+(?(LEVEL)(?!))\*/
It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:
/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/

No no no! Hasn't anyone else read Mastering Regular Expressions (3rd Edition)!? In this, Jeffrey Friedl examines this exact problem and uses it as an example (pages 272-276) to illustrate his "unrolling-the-loop" technique. His solution for most regex engines is like so:
/\*[^*]*\*+(?:[^*/][^*]*\*+)*/
However, if the regex engine is optimized to handle lazy quantifiers (like Perl's is), then the most efficient expression is much simpler (as suggested above):
/\*.*?\*/
(With the equivalent 's' "dot matches all" modifier applied of course.)
Note that I don't use .NET so I can't say which version is faster for that engine.

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line:
/\*.*?\*/

I think your expression is way too complicated. Applied to a large string, the many alternatives imply a lot of backtracking. I guess this is the source of the performance hit you see.
If the basic assumption is to match everything from the "/*" until the first "*/" is encountered, then one way to do it would be this (as usual, regex is not suited for nested structures, so nesting block comments does not work):
/\*(.(?!\*/))*.?\*/ // run this in single line (dotall) mode
Essentially this says: "/*", followed by anything that itself is not followed by "*/", followed by "*/".
Alternatively, you can use the simpler:
/\*.*?\*/ // run this in single line (dotall) mode
Non-greedy matching like this has the potential to go wrong in an edge case - currently I can't think of one where this expression might fail, but I'm not entirely sure.

I'm using this at the moment
\/\*[\s\S]*?\*\/

Related

Removing string(s) within delimiter chars from another string C# [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

regex to highlight XML values

DISCLAIMER: I know that using regex on xml is risky and generally a bad idea, but I can only feed regex into my syntax highlighting engine, and I can't spend the ressources required to create a new system just for xml-based languages.
So I'm trying to use regex to get the values inside XML tags, as such:
<LoremIpsum>I NEED THIS PART</LoremIpsum>
I thought this would be nice and easy, and I could just use (>.*<\/). It works perfectly on any online regex tester, however, as soon as I try using it in .NET, it completely messes up, and I end up getting a completely unpredictable output. What would be the correct way to do this, in one regex expression, considering I'm using .NETs System.Text.RegularExpressions?

This is probably because .NET Regex are greedy. My suggestion would be to use non greedy .*? or [^<] instead of .:
(>.*?<\/)
(>[^<]*<\/)
Like that it can't move over a <.

You never define what it completely messed up means, but try doing this:
(>.*?<\/)
The ? in .*? makes it a non-greedy match. By default, regular expressions operators greedy meaning they will match as much as possible. The non-greedy form matches as little as possible. To see the difference, match 'is test of' against both forms: With (>.*<\/) you will match: is <a>test</a> of. With (>.*?<\/) you will match is <a>test.
If you want to avoid any XML tags in the match, then you should use #ThomasWeller's solution.

Performance and readability of RegEx using positive look ahead

I am validating following strings with regular expressions in C#:
[/1/2/]
[/1/2/];[/3/4/5/]
[/1/22/333/];[/1/];[/9999/]
Basically it's one or more group of square brackets separated by semi-colon (but not at the end). Each group consists out of one or more numbers seperated by slashes. There are no other characters allowed.
These are two alternatives:
^(\[\/(\d+\/)+\](;(?=\[)|$))+$
^(\[\/(\d+\/)+\];)*(\[\/(\d+\/)+\])$
The first version uses a positive look ahead and the second version duplicates part of the pattern.
Both RegEx-es seem to be ok, do what they should and aren't very nice to read. ;)
Does anybody have an idea for a better, faster and more easy to read solution? When I was playing around in regex101 I realized that the second version uses more steps, why?
At the same time I realized that it would be nice to count the steps used in a C#-RegEx. Is there any way to achieve this?

You can use 1 regex to validate all these strings:
^\[/(\d+/)+\](?:;\[/(\d+/)+\])*$
See regex demo
To make it easier to read, use a VERBOSE flag (inline (?x) or RegexOptions.IgnorePatternWhitespace):
var rx = #"(?x)^ # Start of string
\[/ # Literal `[/`
(\d+/)+ # 1 or more sequences of 1 or more digits followed by `/`
\] # Closing `]`
(?: # A non-capturing group start
; # a semi-colon delimiter
\[/(\d+/)+\] # Same as the first part of the regex
)* # 0 or more occurrences
$ # End of string
";
To test a .NET regex performance (not the number of steps), you can use a regexhero.net service. With the 3 sample strings above, my regex shows 217K iterations per second speed, which is more than either of your regexps.

There is nothing particularly wrong with the two options you suggest. They are not that complicated as regexes go, and they should be understandable enough, as long as you put an appropriate comment in your code.
In general, I think it is preferable to avoid look-arounds, unless they are necessary or greatly simplify the regex--they make it harder to figure out what is going on, since they add a non-linear element to the logic.
The relative performance of regexes this simple is not something to worry about, unless you are performing a huge number of operations or discover a performance problem with your code. Still, understanding the relative performance of different patterns may be instructive.

Why does checking this string with Regex.IsMatch cause CPU to reach 100%?

When using Regex.IsMatch (C#, .Net 4.5) on a specific string, the CPU reaches 100%.
String:
https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/?type=1&permPage=1
Pattern:
^http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=])?$
Full code:
Regex.IsMatch("https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/?type=1&permPage=1",
#"^http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=])?$");
I found that redacting the URL prevents this problem. Redacted URL:
https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792
But still very interested in understanding what causes this.

As nu11p01n73R pointed out, you have a lot backtracking with your regular expression. That’s because parts of your expression can all match the same thing, which gives the engine many choices it has to try before finding a result.
You can avoid this by changing the regular expression to make individual sections more specific. In your case, the cause is that you wanted to match a real dot but used the match-all character . instead. You should escape that to \..
This should already reduce the backtracking need a lot and make it fast:
^http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=])?$
And if you want to actually match the original string, you need to add a quantifier to the character class at the end:
^http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]+)?$
↑

I suggest you to check http://regexr.com/ website, to test your regular expression.
The corrected version of your regular expression is this:
^(https?://(?:[\w]+\.?[\w]+)+[\w]/?)([\w\./]+)(\?[\w-=&%]+)?$
It also has 3 groups:
group1=Main url (for example: facebook.com)
group2=Sub urls (for example: /CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/
group3=Variables (for example: ?type=1&permPage=1)
Also remember for checking actual character of dot (.) in your regular expression you must use \. not .

Your regex suffers for catastrophic backtracking.You can simply use
^http(s)?://([\w.-])+(/[\w ./?%&=-]+)*$
See demo.
https://regex101.com/r/cK4iV0/15

Difficulty with Simple Regex (match prefix/suffix)

I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil

Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])

Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improving/Fixing a Regex for C style block comments - c#

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line: /\.?\*/

I'm using this at the moment \/\[\s\S]?\*\/

Related

Removing string(s) within delimiter chars from another string C# [duplicate]

regex to highlight XML values

Performance and readability of RegEx using positive look ahead

Why does checking this string with Regex.IsMatch cause CPU to reach 100%?

Difficulty with Simple Regex (match prefix/suffix)

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improving/Fixing a Regex for C style block comments - c#

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line: /\*.*?\*/

I'm using this at the moment \/\*[\s\S]*?\*\/

Related

Removing string(s) within delimiter chars from another string C# [duplicate]

regex to highlight XML values

Performance and readability of RegEx using positive look ahead

Why does checking this string with Regex.IsMatch cause CPU to reach 100%?

Difficulty with Simple Regex (match prefix/suffix)

Categories

Resources

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line: /\.?\*/

I'm using this at the moment \/\[\s\S]?\*\/