Difficulty with Simple Regex (match prefix/suffix) - c#

I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil

Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])

Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

Related

regex to highlight XML values

DISCLAIMER: I know that using regex on xml is risky and generally a bad idea, but I can only feed regex into my syntax highlighting engine, and I can't spend the ressources required to create a new system just for xml-based languages.
So I'm trying to use regex to get the values inside XML tags, as such:
<LoremIpsum>I NEED THIS PART</LoremIpsum>
I thought this would be nice and easy, and I could just use (>.*<\/). It works perfectly on any online regex tester, however, as soon as I try using it in .NET, it completely messes up, and I end up getting a completely unpredictable output. What would be the correct way to do this, in one regex expression, considering I'm using .NETs System.Text.RegularExpressions?
This is probably because .NET Regex are greedy. My suggestion would be to use non greedy .*? or [^<] instead of .:
(>.*?<\/)
(>[^<]*<\/)
Like that it can't move over a <.
You never define what it completely messed up means, but try doing this:
(>.*?<\/)
The ? in .*? makes it a non-greedy match. By default, regular expressions operators greedy meaning they will match as much as possible. The non-greedy form matches as little as possible. To see the difference, match 'is test of' against both forms: With (>.*<\/) you will match: is <a>test</a> of. With (>.*?<\/) you will match is <a>test.
If you want to avoid any XML tags in the match, then you should use #ThomasWeller's solution.

Why does checking this string with Regex.IsMatch cause CPU to reach 100%?

When using Regex.IsMatch (C#, .Net 4.5) on a specific string, the CPU reaches 100%.
String:
https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/?type=1&permPage=1
Pattern:
^http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=])?$
Full code:
Regex.IsMatch("https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/?type=1&permPage=1",
#"^http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=])?$");
I found that redacting the URL prevents this problem. Redacted URL:
https://www.facebook.com/CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792
But still very interested in understanding what causes this.
As nu11p01n73R pointed out, you have a lot backtracking with your regular expression. That’s because parts of your expression can all match the same thing, which gives the engine many choices it has to try before finding a result.
You can avoid this by changing the regular expression to make individual sections more specific. In your case, the cause is that you wanted to match a real dot but used the match-all character . instead. You should escape that to \..
This should already reduce the backtracking need a lot and make it fast:
^http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=])?$
And if you want to actually match the original string, you need to add a quantifier to the character class at the end:
^http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]+)?$
↑
I suggest you to check http://regexr.com/ website, to test your regular expression.
The corrected version of your regular expression is this:
^(https?://(?:[\w]+\.?[\w]+)+[\w]/?)([\w\./]+)(\?[\w-=&%]+)?$
It also has 3 groups:
group1=Main url (for example: facebook.com)
group2=Sub urls (for example: /CashKingPirates/photos/a.197028616990372.62904.196982426994991/1186500984709792/
group3=Variables (for example: ?type=1&permPage=1)
Also remember for checking actual character of dot (.) in your regular expression you must use \. not .
Your regex suffers for catastrophic backtracking.You can simply use
^http(s)?://([\w.-])+(/[\w ./?%&=-]+)*$
See demo.
https://regex101.com/r/cK4iV0/15

Matching operators

I have a text input containing lots of operators, variable and English words. From this input I have to separate all the operators alone.
As of now I'm using regular expression matching, so the number of operators matched depends on the regular expression. problem I get are '= is matched with <=', '& is matched with &&'. I need to match both = and <= separately.
Is there any better way for matching the operators other than regex?
as far as regex goes, you could have the pattern match the special (compound) case first, then the catch-all last with simple alternation. In your simple input case: /<=|&&|=|&/. this isn't necessarily terrible, you can still put whatever your catch-all is after that: /special1|...specialN|special-chars-catch-all/
this technique could be useful in some cases where a greedy expression would just get the whole thing, like: if($x==-1), you would want ==, not ==-
Look at the extended variants in your RE language.
In most RE languages /[<](?![=])/ will match "<" but not "<=" and not "=", for example. The (?! ... ) means "except when followed by ...". The term for this is Negative Look-ahead Assertion. These are sometimes spelled differently, as they are less standard than most other formations, but they are usually available. They never consume more characters, but they create slower matches.
The "except when preceded" or Negative Look-behind Assertion is sometimes also available, but you may wish to avoid it. It is seldom clear to a reader and can create slower matches.
There probably is. But as an alternative, you could have your regex as (e.g.):
[><=&|]+
(Modify to your specifications - not sure if you want addition, subtraction, ++ for incrementing etc too).
The + means "one or more" and so the regex matches as many characters as possible, meaning that if <= is in the text, it will match <= rather than < and then =.
Then, only once you've extracted all the matches, loop through them all and classify them.
I think you might still be able to get regex to do what you want.
If you want to completely abandon it, please forgive me and ignore my suggestion :)
If you want to use regex to detect just = then you could use [^<>=]=[^<>=] which means 'match the equals only when it is not preceded or seceded by < > or another =.
You could use {1}& with ampersands to detect one (and only one) ampersand.
(NB you might need to escape a couple of those symbols with \)
I hope that might help. Good luck.
K.
If you do multiple passes, you can also find the compound operators and then replace them with other characters before a pass that finds the simple ones.
This is often a useful approach anyway: to slowly overwrite your interpreted string as it is processed, so that what is left when you are done is just tokens. RE processors often return index ranges. So you can easily go back and overwrite that range with something no one else will match later (like a control-character token, a NUL, or a tilde).
An advantage is that you can then have debug code that does a verification pass to check that you have not left anything around uninterpreted.

How to Check if a String is a "string" or a RegEx?

How can I check if a String in an textbox is a plain String ore a RegEx?
I'm searching through a text file line by line.
Either by .Contains(Textbox.Text); or by Regex(Textbox.Text) Match(currentLine)
(I know, syntax isn't working like this, it's just for presentation)
Now my Program is supposed to autodetect if Textbox.Text is in form of a RegEx or if it is a normal String.
Any suggestions? Write my own little RexEx to detect if Textbox contains a RegEx?
Edit:
I failed to add thad my Strings
can be very simple like Foo ore 0005
I'm trying the suggested solutions
right away!
You can't detect regular expressions with a regular expression, as regular expressions themselves are not a regular language.
However, the easiest you probably could do is trying to compile a regex from your textbox contents and when it succeeds you know that it's a regex. If it fails, you know it's not.
But this would classify ordinary strings like "foo" as a regular expression too. Depending on what you need to do, this may or may not be a problem. If it's a search string, then the results are identical for this case. In the case of "foo.bar" they would differ, though since it's a valid regex but matches different things than the string itself.
My advice, also stated in another comment, would be that you simply always enable regex search since there is exactly no difference if you split code paths here. Aside from a dubious performance benefit (which is unlikely to make any difference if there is much of a benefit at all).
Many strings could be a regex, every regex could actually be a string.
Consider the string "thin." could either be a string ('.' is a dot) or a regex ('.' is any character).
I would just add a checkbox where the user indicates if he enters a regex, as usual in many applications.
One possible solution depending on your definition of string and regex would be to check if the string contains any regex typical characters.
You could do something like this:
string s = "I'm not a Regex";
if (s == Regex.Escape(s))
{
// no regex indeed
}
Try and use it in a regex and see if an exception is thrown.
This approach only checks if it is a valid regex, not whether it was intended to be one.
Another approach could be to check if it is surrounded by slashes (ie. ‘/foo/‘) Surrounding regexes with slashes is common practice (although you must remove the slashes before feeding it into the regex library)

Improving/Fixing a Regex for C style block comments

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.
On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.
The Regex I'm using is this:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
Any suggestions on why this might get locked up?
Alternatively, what's another Regex I could use instead?
More information:
Working in C# 3.0 targeting .NET 3.5;
I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
I've left the program running for over an hour, but the match isn't completed;
Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
The regex works correctly for 452 of my 453 test files.
Some problems I see with your regex:
There's no need for the |[\r\n] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.
Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.
Filtering out extraneous junk, the final alternative inside your outermost parens is \*+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- \*+/ -- will match them for keeps.
For maximum performance, I would use this regex:
/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/
This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.
Courtesy of David, here's a version that matches nested comments with any level of nesting:
(?s)/\*(?>/\*(?<LEVEL>)|\*/(?<-LEVEL>)|(?!/\*|\*/).)+(?(LEVEL)(?!))\*/
It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:
/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/
No no no! Hasn't anyone else read Mastering Regular Expressions (3rd Edition)!? In this, Jeffrey Friedl examines this exact problem and uses it as an example (pages 272-276) to illustrate his "unrolling-the-loop" technique. His solution for most regex engines is like so:
/\*[^*]*\*+(?:[^*/][^*]*\*+)*/
However, if the regex engine is optimized to handle lazy quantifiers (like Perl's is), then the most efficient expression is much simpler (as suggested above):
/\*.*?\*/
(With the equivalent 's' "dot matches all" modifier applied of course.)
Note that I don't use .NET so I can't say which version is faster for that engine.
You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line:
/\*.*?\*/
I think your expression is way too complicated. Applied to a large string, the many alternatives imply a lot of backtracking. I guess this is the source of the performance hit you see.
If the basic assumption is to match everything from the "/*" until the first "*/" is encountered, then one way to do it would be this (as usual, regex is not suited for nested structures, so nesting block comments does not work):
/\*(.(?!\*/))*.?\*/ // run this in single line (dotall) mode
Essentially this says: "/*", followed by anything that itself is not followed by "*/", followed by "*/".
Alternatively, you can use the simpler:
/\*.*?\*/ // run this in single line (dotall) mode
Non-greedy matching like this has the potential to go wrong in an edge case - currently I can't think of one where this expression might fail, but I'm not entirely sure.
I'm using this at the moment
\/\*[\s\S]*?\*\/

Categories

Resources