Regex alternation construct eats part of previous group - c#

I am trying to fashion a Regex to capture a function style argument list, which should be straight forward enough, but I'm encountering a behaviour I don't understand.
In the snippet below the first example behaves as you would expect, capturing the function name into the first group and the argument list into the second group.
In the second example I want to replace the 'zero or more' quantifier that captures the argument list with the 'one or more' quantifier so that the second group will fail if there are no arguments. I'm expecting the regex to capture just the function name, but for some reason the regex is eating the '1' off the end of the function name, and I cannot for the life of me see why it would be doing that. Can anyone see what's going wrong with this please?
// {func1} {blah, blah, blah}
Match m13 = Regex.Match("func1(blah, blah, blah)", #"(\w+) (?([(]) [(]([^)]*) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
// {func}
Match m14 = Regex.Match("func1()", #"(\w+) (?([(]) [(]([^)]+) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

Your expression can be adjusted to:
(\w+) (?([(]) [(]([^)]*) )
^ rather than +
The reason the expression returns an unexpected result relates to backtracking. The regex engine effectively takes the following steps:
(\w) matches func1.
func1 is immediately followed by a (, matching the zero-width expression in the conditional matching construct.
The conditional construct requires a ( literal followed by one or more characters that are not ). This condition fails for the input func1(), since there are zero characters between ( and ).
The engine backtracks to step (1) and removes a character, so that (\w) now matches func instead of func1.
func is immediately followed by a 1, which does not satisfy the zero-width expression in the conditional matching construct.
Since the conditional matching construct does not match and there is no alternate expression, the regex completes successfully with func in the first captured group, and no match in the second captured group.
The issue emerges in Step 3, where the expression fails to allow () as a legal argument list. Adjusting the expression to allow zero characters between the opening and closing parentheses (as illustrated above) allows this sequence. An expression such as ^(\w+)(?:\((.*)\))?$ may also address the underlying issue without the need for a conditional construct.

Related

Regular Expression get text between braces including other braces

I have a "main"-string like:
((Gripper|Open==true OR RIT|Turning==false) AND Robot|PosX >=3 OR (Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false)))
I want to get three sub strings in the best case:
1: (Gripper|Open==true OR RIT|Turning==false)
2: Robot|PosX >=3
3: (Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false))
But only two (the one in braces [1,3]) would be fine too, since they can be replaced in the main-string, getting the 3rd[2] as a result.
Ideally with the help of regex.
All the sub strings go into a class as children so I can apply the regex for each child and get their sub strings as well.
1: Test|Close==false
2: (Gripper|Open==false AND RIT|Turning==false)
For child number three (where the first result without the braces would be optional again.
I tried something similar to Regular expression to extract text between braces and putting positions of the matches onto a stack, but not with the expected results.
The best regex I found so far is
([^()]+(?:[^()]+)+) or
([^()]+(?:)+)
(seriously, regex is powerful, but I have no idea what the above statements really do) which gives me
1. Gripper|Open == true OR RIT|Turning==false
2. AND Robot|PosX >=3 OR
3. Test|Close==false OR
4. Gripper|Open==false AND RIT|Turning==false
But still, 3+4 should be in only one group as
Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false)
Does anyone know how to achieve this?
It seems like you are looking for balanced parenthesis where the matches start with 2 words divided by a pipe and then an operator followed by an equals sign
In C# you might match either the balanced parenthesis or match a pattern that does not contain them using an alternation.
(?:\(\w+\|\w+\s*[<>!=]{1,2}[^()]*(?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)|\w+\|\w+\s*[<>!=]{1,2}\S+)
(?: Non capture group
\(\w+\|\w+\s* Match ( then 2 words divided by a pipe and 0+ whitespace chars
[<>!=]{1,2}[^()]* Match any of the operators and match any char except ()
(?> Atomic group
[^()]+ Match 1+ times any char except ()
| Or
(?<o>)\( Add to stack
| Or
(?<-o>)\) Remove from stack
)* Close atomic group and repeat 0+ times
(?(o)(?!)|)\) Conditional with capturing group, evaluate the final subpattern
| Or
\w+\|\w+\s*[<>!=]{1,2}\S+ Match 2 words divided by a pipe and match operators
) Close non capture group
Regex demo
You may try with that:
(?<=\))(?!\()[^()]+|\((?!\()[^)]+\)
Regex101
Explanation:
(?<=\))(?!\()[^()]+ OR \((?!\()[^)]+\)
The first part before 'OR' basically matches AND Robot|PosX >=3 OR
(?<=\)) negative lookbehind: match current character if the
previous character is not )
(?!\() negative lookahead : match current character if the next
charcter is not ( or )
[^()]+ matches anything that is Neither ( nor ).
The last part after OR matches anything that starts with ( and ends with ) while ignoring any opening braces inside it.

Regex.Matches throws exception for regex formula c# [duplicate]

I am trying to create a .NET RegEx expression that will properly balance out my parenthesis. I have the following RegEx expression:
func([a-zA-Z_][a-zA-Z0-9_]*)\(.*\)
The string I am trying to match is this:
"test -> funcPow((3),2) * (9+1)"
What should happen is Regex should match everything from funcPow until the second closing parenthesis. It should stop after the second closing parenthesis. Instead, it is matching all the way to the very last closing parenthesis. RegEx is returning this:
"funcPow((3),2) * (9+1)"
It should return this:
"funcPow((3),2)"
Any help on this would be appreciated.
Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(#"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.
The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.
Using balanced groups, it is:
Regex rx = new Regex(#"func([a-zA-Z_][a-zA-Z0-9_]*)\(((?<BR>\()|(?<-BR>\))|[^()]*)+\)");
var match = rx.Match("funcPow((3),2) * (9+1)");
var str = match.Value; // funcPow((3),2)
(?<BR>\()|(?<-BR>\)) are a Balancing Group (the BR I used for the name is for Brackets). It's more clear in this way (?<BR>\()|(?<-BR>\)) perhaps, so that the \( and \) are more "evident".
If you really hate yourself (and the world/your fellow co-programmers) enough to use these things, I suggest using the RegexOptions.IgnorePatternWhitespace and "sprinkling" white space everywhere :-)
Regular Expressions only work on Regular Languages. This means that a regular expression can find things of the sort "any combination of a's and b's".(ab or babbabaaa etc) But they can't find "n a's, one b, n a's".(a^n b a^n) Regular expressions can't guarantee that the first set of a's matches the second set of a's.
Because of this, they aren't able to match equal numbers of opening and closing parenthesis. It would be easy enough to write a function that traverses the string one character at a time. Have two counters, one for opening paren, one for closing. increment the pointers as you traverse the string, if opening_paren_count != closing_parent_count return false.
func[a-zA-Z0-9_]*\((([^()])|(\([^()]*\)))*\)
You can use that, but if you're working with .NET, there may be better alternatives.
This part you already know:
func[a-zA-Z0-9_]*\( --weird part-- \)
The --weird part-- part just means; ( allow any character ., or | any section (.*) to exist as many times as it wants )*. The only issue is, you can't match any character ., you have to use [^()] to exclude the parenthesis.
(([^()])|(\([^()]*\)))*

How do I find a match which has already been captured by another match?

How can I replace all occurrences of matches in a string if some parts have already been captured:
E.g. Given the pattern "AB|BC" and the target "ABC" we match "AB" but not "BC"
I've been trying to understand the various regex grouping options (Grouping Constructs in Regular Expressions) without success. I'm probably barking up the wrong tree. :-(
var test = Regex.Replace("(AB)(BC)(AC)(ABC)", #"AB|BC", string.Empty);
In the example, test evaluates to "()()(AC)(C)", but what I actually want is "()()(AC)()"
Without taking care of the parenthesis, you cou use and alternation with an optional character using the question mark.
Match AB with an optional C or Match an optional A followed by BC. In the replacement use an empty string.
ABC?|A?BC
Regex demo
Including the parenthesis you might use a capturing group or lookarounds to assert what is on the left and on the right are opening and closing parenthesis.
(?<=\()(?:ABC?|A?BC)(?=\))
Explanation
(?<=\() Assert what is on the left is (
(?: Non capturing group
ABC? Match AB with optional C
-| Or
A?BC Match optional A and BC
) Close non capturing group
(?=\)) Assert what is on the right is )
Regex demo
In order to consume the overlaps buddy, it has to be matched.
Therefore, one side of the alternation has to include its buddies last
or first literal (doesn't have to be both).
AB|BC ~ ABC?|BC = A?BC|AB

Regex match one digit or two

If this
(°[0-5])
matches °4
and this
((°[0-5][0-9]))
matches °44
Why does this
((°[0-5])|(°[0-5][0-9]))
match °4 but not °44?
Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):
((°[0-5])|(°[0-5][0-9]))
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
You are using shorter match first in regex alternation. Better use this regex to match both strings:
°[0-5][0-9]?
RegEx Demo
Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression
(a|ab|abc)
when fed this input:
abcdefghi
will only ever match a. However, if the regular expression is changed to
(a|ab|abc)d
It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.
Why would you not reduce your regular expression from
((°[0-5])|(°[0-5][0-9]))
to this?
°[0-5][0-9]?
It's simpler and easier to understand.

Using RegEx to balance match parenthesis

I am trying to create a .NET RegEx expression that will properly balance out my parenthesis. I have the following RegEx expression:
func([a-zA-Z_][a-zA-Z0-9_]*)\(.*\)
The string I am trying to match is this:
"test -> funcPow((3),2) * (9+1)"
What should happen is Regex should match everything from funcPow until the second closing parenthesis. It should stop after the second closing parenthesis. Instead, it is matching all the way to the very last closing parenthesis. RegEx is returning this:
"funcPow((3),2) * (9+1)"
It should return this:
"funcPow((3),2)"
Any help on this would be appreciated.
Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(#"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.
The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.
Using balanced groups, it is:
Regex rx = new Regex(#"func([a-zA-Z_][a-zA-Z0-9_]*)\(((?<BR>\()|(?<-BR>\))|[^()]*)+\)");
var match = rx.Match("funcPow((3),2) * (9+1)");
var str = match.Value; // funcPow((3),2)
(?<BR>\()|(?<-BR>\)) are a Balancing Group (the BR I used for the name is for Brackets). It's more clear in this way (?<BR>\()|(?<-BR>\)) perhaps, so that the \( and \) are more "evident".
If you really hate yourself (and the world/your fellow co-programmers) enough to use these things, I suggest using the RegexOptions.IgnorePatternWhitespace and "sprinkling" white space everywhere :-)
Regular Expressions only work on Regular Languages. This means that a regular expression can find things of the sort "any combination of a's and b's".(ab or babbabaaa etc) But they can't find "n a's, one b, n a's".(a^n b a^n) Regular expressions can't guarantee that the first set of a's matches the second set of a's.
Because of this, they aren't able to match equal numbers of opening and closing parenthesis. It would be easy enough to write a function that traverses the string one character at a time. Have two counters, one for opening paren, one for closing. increment the pointers as you traverse the string, if opening_paren_count != closing_parent_count return false.
func[a-zA-Z0-9_]*\((([^()])|(\([^()]*\)))*\)
You can use that, but if you're working with .NET, there may be better alternatives.
This part you already know:
func[a-zA-Z0-9_]*\( --weird part-- \)
The --weird part-- part just means; ( allow any character ., or | any section (.*) to exist as many times as it wants )*. The only issue is, you can't match any character ., you have to use [^()] to exclude the parenthesis.
(([^()])|(\([^()]*\)))*

Categories

Resources