Don't match string in specific context

Don't match string in specific context - c#

For my internship I've been asked to create a tool that creates a regular expression from a few examples. Now I got it working, it generates multiple regular expressions and they are sorted depending on how greedy they are, but I want more.
The regex generator works by replacing parts of a string with regular expression character classes. For example GOM178 would turn into [A-Z]+178(letters replaced) or GOM\d+ (numbers replaced). The hard part is getting multiple character classes in one. For example at one point \p{P} is tried as well, and it replaces [],/\- and more. That causes the other character classes to mess up. It would turn [A-Z] in \p{P}A\p{P}Z\p{P}. Replacing \p{P} before [A-Z] wouldn't work as well, because that would replace the P in \p{P} causing this: \p{[A-Z]}.
I've already tried negative lookaheads, but that didn't workout too well. The only reason why it currently works is because I test it before saving the result. This is the regular expression I've used for that:
(?:(?!(?:\[a-z\]|\[A-Z\]|\[a-zA-Z\]|\\d\+\[\\\.,\]\?\\d\*|\\d|\\s|\\p\{P\}|\\w|\\n|\.)(?:\*|\?|\+|\+\?|\*\?)?)(<The character class to match goes here>))
Here is an example of the regex in action: Link to example.
As you can see it also matches the - and ] in the character class. It should ignore it because it's part of [a-z] which is noted in the negative lookahead.
Long story short, a string should not be replaced when it's in a specific context. Does anyone have an idea on how to fix this, or perhaps have a better idea on how to do this.

Related

RegEx to capture text between two delimiter characters including 'shared'

If I have the following text...
The quick :brown:fox: jumped over the lazy :dog:.
I would like a regular expression to capture all the words that are between 2 : characters. In the above example it should return :brown:, :fox:, :dog:.
So far, I have this (\:{1}.\w*\s*\:{1}) which returns :brown: and :dog:. I can't quite figure out how to share the : between the 2 matching groups so that it will also return ':fox:'.

Here is a simple pattern which can be made to work:
(?<=:)(\w+)(?=:)
This uses lookarounds to make sure that one or more word characters are surrounded before and after by colons. Check the demo below to see it working.
The match would be available as the first capture group. Actually, it should also be available as the entire match itself, because lookarounds do not consume anything.
Demo
I like the above lookaround approach because it is clean and simple (at least in my mind). If, for some reason, you don't want any lookarounds, then just use the following pattern:
:(\w+):
But note that now you explicitly have to access the first capture group to obtain the matching word without colons on either side.

regex not matching when using ? if first character not present

Here is my c# regex:
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
I am testing here with sample string:
{"RestrictedCompany": "","SQLServerIndex": 0,"SurveyAdmin": false}`
This is what I think the regex does:
PART 1: Look for the pattern of " ANYTHING ":
and store ANYTHING (without the quotes).
PART 2: Then look for a : and store everything until you reach a stop character of either " or , or }
It extracts part 1 fine, but doesnt pick up part 2 at all when the " isnt present (ie when part 2 isnt a string). So I have two questions:
Why isn't my current code picking up part 2? (and how can I fix it)
is there a way to make the ANYTHING match more flexible? (I tried using \S but it was too greedy)

First off, don't write your own JSON parser. Use one written by professionals. You're reinventing a rather complex wheel here.
That said, there are also lessons you could learn here about how to write, understand and debug regular expressions, so let's look at that.
Why isn't my current code picking up part 2? (and how can I fix it)
Learn to reason like the regular expression engine.
Let's take a simpler case. We'll take the expression
\"([a-zA-Z0-9]*)\":\"?([a-zA-Z0-9]*)\"?,?}?
And we will search this string:
{"A": "B"}
for an instance of the regular expression.
OK.
The { doesn't match anything, so skip it.
The first " matches \", so maybe we have a match.
A matches ([a-zA-Z0-9]*), so again, maybe we have a match.
The second " matches the second \", so we're still good.
The : matches :...
We now are trying to match \"?, zero or one quotes. We have , a space. We match zero quotes.
We are now trying to match ([a-zA-Z0-9]*), any number of alphanumerics. We have , a space. Therefore we have zero alphanumerics.
We are now trying to again match \"?, and again we have , so we match zero.
We are now trying to match ,?, we have zero of them.
We are now trying to match }?, again we have zero of them
And we're done. We've successfully matched the pattern, and the match is "A":.
Now keep on going; can we match anything in the rest of the string? No. The pattern requires a :, and there is no : in the rest of the string, so I won't labour the point; plainly the match will fail.
If that's not the pattern you wanted to match then write a different pattern. For example, if you want there to be arbitrary whitespace before and after the colon, you probably need a /s* before and after the colon. Also, if you require a value after the : then why did you make everything after the colon optional? "Required" and "optional" are opposites.
So what's the right thing to do here? Again, the right thing to do is to stop trying to solve this problem with regular expressions and use a json parser like a sensible person. But suppose we did want to parse this with regular expressions. How do we do it?
We do it by breaking the problem down into smaller parts.
What do we really want to match? Let's name each thing we want to match and then write a colon, and then say what the structure of that thing is:
DESIRED : NAME OPTIONAL_WHITESPACE COLON OPTIONAL_WHITESPACE VALUE
OK, break it down. What's a name?
NAME : QUOTE NAMECONTENTS QUOTE
Keep breaking it down.
NAMECONTENTS : any alphanumeric text of any length
Ask yourself is that true? Is an "" a NAME? Is "1234" a NAME? Is "$" a NAME? Refine the pattern until you get it right. We'll go with this for now.
Now here is a hard one:
VALUE : BOOLEAN_LITERAL
VALUE : NUMBER_LITERAL
VALUE : STRING_LITERAL
This can be any of three things. So again, keep breaking it down:
BOOLEAN_LITERAL : true
BOOLEAN_LITERAL : false
Keep going; you can see how to do it from here.
Now make a regular expression for each part and start putting it back together.
The regular expression for NAMECONTENTS is \w*.
The regular expression for QUOTE is \".
Therefore the regular expression for NAME is \"\w*\".
We want to capture the name text so put it in a group: \"(\w*)\"
Great. Similarly:
The regular expression for OPTIONAL_WHITESPACE is \s*.
The regular expression for COLON is :.
So our regular expression begins \"(\w*)\"\s:\s
Now we need to handle VALUE. But we've broken it down. What is the regular expression for BOOLEAN_LITERAL? That's [true|false].
Keep going; make a regular expression for the other literals and then build up your regular expression from the leaves to the root.

parsing a method Signature using regular expressions

I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)

Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.

Improving/Fixing a Regex for C style block comments

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.
On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.
The Regex I'm using is this:
/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
Any suggestions on why this might get locked up?
Alternatively, what's another Regex I could use instead?
More information:
Working in C# 3.0 targeting .NET 3.5;
I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
I've left the program running for over an hour, but the match isn't completed;
Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
The regex works correctly for 452 of my 453 test files.

Some problems I see with your regex:
There's no need for the |[\r\n] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.
Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.
Filtering out extraneous junk, the final alternative inside your outermost parens is \*+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- \*+/ -- will match them for keeps.
For maximum performance, I would use this regex:
/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/
This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.
Courtesy of David, here's a version that matches nested comments with any level of nesting:
(?s)/\*(?>/\*(?<LEVEL>)|\*/(?<-LEVEL>)|(?!/\*|\*/).)+(?(LEVEL)(?!))\*/
It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:
/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/

No no no! Hasn't anyone else read Mastering Regular Expressions (3rd Edition)!? In this, Jeffrey Friedl examines this exact problem and uses it as an example (pages 272-276) to illustrate his "unrolling-the-loop" technique. His solution for most regex engines is like so:
/\*[^*]*\*+(?:[^*/][^*]*\*+)*/
However, if the regex engine is optimized to handle lazy quantifiers (like Perl's is), then the most efficient expression is much simpler (as suggested above):
/\*.*?\*/
(With the equivalent 's' "dot matches all" modifier applied of course.)
Note that I don't use .NET so I can't say which version is faster for that engine.

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line:
/\*.*?\*/

I think your expression is way too complicated. Applied to a large string, the many alternatives imply a lot of backtracking. I guess this is the source of the performance hit you see.
If the basic assumption is to match everything from the "/*" until the first "*/" is encountered, then one way to do it would be this (as usual, regex is not suited for nested structures, so nesting block comments does not work):
/\*(.(?!\*/))*.?\*/ // run this in single line (dotall) mode
Essentially this says: "/*", followed by anything that itself is not followed by "*/", followed by "*/".
Alternatively, you can use the simpler:
/\*.*?\*/ // run this in single line (dotall) mode
Non-greedy matching like this has the potential to go wrong in an edge case - currently I can't think of one where this expression might fail, but I'm not entirely sure.

I'm using this at the moment
\/\*[\s\S]*?\*\/

Difficulty with Simple Regex (match prefix/suffix)

I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil

Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])

Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.