Regex.Matches throws exception for regex formula c# [duplicate] - c#

I am trying to create a .NET RegEx expression that will properly balance out my parenthesis. I have the following RegEx expression:
func([a-zA-Z_][a-zA-Z0-9_]*)\(.*\)
The string I am trying to match is this:
"test -> funcPow((3),2) * (9+1)"
What should happen is Regex should match everything from funcPow until the second closing parenthesis. It should stop after the second closing parenthesis. Instead, it is matching all the way to the very last closing parenthesis. RegEx is returning this:
"funcPow((3),2) * (9+1)"
It should return this:
"funcPow((3),2)"
Any help on this would be appreciated.

Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(#"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.
The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.

Using balanced groups, it is:
Regex rx = new Regex(#"func([a-zA-Z_][a-zA-Z0-9_]*)\(((?<BR>\()|(?<-BR>\))|[^()]*)+\)");
var match = rx.Match("funcPow((3),2) * (9+1)");
var str = match.Value; // funcPow((3),2)
(?<BR>\()|(?<-BR>\)) are a Balancing Group (the BR I used for the name is for Brackets). It's more clear in this way (?<BR>\()|(?<-BR>\)) perhaps, so that the \( and \) are more "evident".
If you really hate yourself (and the world/your fellow co-programmers) enough to use these things, I suggest using the RegexOptions.IgnorePatternWhitespace and "sprinkling" white space everywhere :-)

Regular Expressions only work on Regular Languages. This means that a regular expression can find things of the sort "any combination of a's and b's".(ab or babbabaaa etc) But they can't find "n a's, one b, n a's".(a^n b a^n) Regular expressions can't guarantee that the first set of a's matches the second set of a's.
Because of this, they aren't able to match equal numbers of opening and closing parenthesis. It would be easy enough to write a function that traverses the string one character at a time. Have two counters, one for opening paren, one for closing. increment the pointers as you traverse the string, if opening_paren_count != closing_parent_count return false.

func[a-zA-Z0-9_]*\((([^()])|(\([^()]*\)))*\)
You can use that, but if you're working with .NET, there may be better alternatives.
This part you already know:
func[a-zA-Z0-9_]*\( --weird part-- \)
The --weird part-- part just means; ( allow any character ., or | any section (.*) to exist as many times as it wants )*. The only issue is, you can't match any character ., you have to use [^()] to exclude the parenthesis.
(([^()])|(\([^()]*\)))*

Related

Regular Expression get text between braces including other braces

I have a "main"-string like:
((Gripper|Open==true OR RIT|Turning==false) AND Robot|PosX >=3 OR (Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false)))
I want to get three sub strings in the best case:
1: (Gripper|Open==true OR RIT|Turning==false)
2: Robot|PosX >=3
3: (Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false))
But only two (the one in braces [1,3]) would be fine too, since they can be replaced in the main-string, getting the 3rd[2] as a result.
Ideally with the help of regex.
All the sub strings go into a class as children so I can apply the regex for each child and get their sub strings as well.
1: Test|Close==false
2: (Gripper|Open==false AND RIT|Turning==false)
For child number three (where the first result without the braces would be optional again.
I tried something similar to Regular expression to extract text between braces and putting positions of the matches onto a stack, but not with the expected results.
The best regex I found so far is
([^()]+(?:[^()]+)+) or
([^()]+(?:)+)
(seriously, regex is powerful, but I have no idea what the above statements really do) which gives me
1. Gripper|Open == true OR RIT|Turning==false
2. AND Robot|PosX >=3 OR
3. Test|Close==false OR
4. Gripper|Open==false AND RIT|Turning==false
But still, 3+4 should be in only one group as
Test|Close==false OR (Gripper|Open==false AND RIT|Turning==false)
Does anyone know how to achieve this?
It seems like you are looking for balanced parenthesis where the matches start with 2 words divided by a pipe and then an operator followed by an equals sign
In C# you might match either the balanced parenthesis or match a pattern that does not contain them using an alternation.
(?:\(\w+\|\w+\s*[<>!=]{1,2}[^()]*(?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)|\w+\|\w+\s*[<>!=]{1,2}\S+)
(?: Non capture group
\(\w+\|\w+\s* Match ( then 2 words divided by a pipe and 0+ whitespace chars
[<>!=]{1,2}[^()]* Match any of the operators and match any char except ()
(?> Atomic group
[^()]+ Match 1+ times any char except ()
| Or
(?<o>)\( Add to stack
| Or
(?<-o>)\) Remove from stack
)* Close atomic group and repeat 0+ times
(?(o)(?!)|)\) Conditional with capturing group, evaluate the final subpattern
| Or
\w+\|\w+\s*[<>!=]{1,2}\S+ Match 2 words divided by a pipe and match operators
) Close non capture group
Regex demo
You may try with that:
(?<=\))(?!\()[^()]+|\((?!\()[^)]+\)
Regex101
Explanation:
(?<=\))(?!\()[^()]+ OR \((?!\()[^)]+\)
The first part before 'OR' basically matches AND Robot|PosX >=3 OR
(?<=\)) negative lookbehind: match current character if the
previous character is not )
(?!\() negative lookahead : match current character if the next
charcter is not ( or )
[^()]+ matches anything that is Neither ( nor ).
The last part after OR matches anything that starts with ( and ends with ) while ignoring any opening braces inside it.

Regex match a string that is not part of a larger word [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm stumped on how to even go about this.
I am trying to match the string "ashi" but not if the word containing it is in a small list of known false positives like "flashing", "lashing", "smashing". The false positive words can appear in the string as well as long as the string "ashi" (not as part of one of the false positive words) is in the string it should return true.
I'm using C# and I was trying to go about it without using regular expressions, but I am having no luck.
These strings should return true
...somethingashisomething...
...something2!ashi*&something...
... something ashi something flashing...
These strings should return false
...somethingflashingsomething...
...smashingthesomething...
...the lashings are too tight...
Another option might be to use a negative lookbehind with a nested lookahead to match words that start with fl but not if they are followed by ashing to match ashi but not flashing.
(?<!\bfl(?=ashing\b))ashi
Explanation
(?<! Negative lookbehind, assert what is directly on the right is not
\bfl Word boundary, match fb
(?= Positive lookahead, assert what is directly on the right is
ashing\b Match ashing and word boundary
) Close positive lookahead
) Close positive lookbehind.
ashi Match literally
.NET Regex demo
Update
If you want to match and not match the updated values, you could use an alternation (?:sm|f?l) in the negative lookbehind to match sm or an optional f followed by l
(?<!(?:sm|f?l)(?=ashing))ashi
.NET regex demo | C# demo
You can make use of a capturing group:
(flashing)|ashi
If the first group is not empty, you matched flashing literally
The following will match ashi but not within flashing. I interpreted "word" loosely, so flashing is not required to be isolated as a separate word with space/punctuation delimiters.
(?<=(?<prefix>fl)|)ashi(?(prefix)(?!ng))
It is sufficient to return true/false over the entire pattern and won't require checking specific capture groups. In other words, it is usable with Regex.IsMatch().
Pattern details:
(?<= # Zero-width positive lookbehind: match but don't consume characters
(?<prefix>fl) # Named capture group to match "fl" at start of "flashing"
| # Alternate blank capture - will succeed if "fl" is not present
) # End lookbehind
ashi # match literal "ashi"
(?(prefix) # Conditional: Only match if named group prefix has successful capture (i.e. "fl" was matched)
(?!ng) # Zero-width negative loohahead: Fail match if "ng" follows
) # Close conditional (there is no false part, so match succeeds if "fl" was not present)
If flashing is only excluded as an isolated word, just add word boundary operators. This will match something like flashingwithnospace, whereas the first pattern would fail on that string:
(?<=(?<prefix>\bfl)|)ashi(?(prefix)(?!ng\b))
(FYI, the pattern will work in isolation, but if it is combined within another pattern, especially inside a repeating construction, it may not work due to the conditional on the named capture group. Once the named capture group has succeeded, the conditional will remain true while matching the larger pattern, even if it were to encounter another occurrence of ashi.)
The question gives the examples
...somethingashisomething...
...something2!ashi*&something...
... something ashi something...
The second and third examples can be found by including the word boundary \b in the search, i.e. search for \bashi\b. Finding the first example requires more knowledge of what the two enclosing somethings are. If they are alphanumeric then you need to specify the problem in much more detail.

How do I find a match which has already been captured by another match?

How can I replace all occurrences of matches in a string if some parts have already been captured:
E.g. Given the pattern "AB|BC" and the target "ABC" we match "AB" but not "BC"
I've been trying to understand the various regex grouping options (Grouping Constructs in Regular Expressions) without success. I'm probably barking up the wrong tree. :-(
var test = Regex.Replace("(AB)(BC)(AC)(ABC)", #"AB|BC", string.Empty);
In the example, test evaluates to "()()(AC)(C)", but what I actually want is "()()(AC)()"
Without taking care of the parenthesis, you cou use and alternation with an optional character using the question mark.
Match AB with an optional C or Match an optional A followed by BC. In the replacement use an empty string.
ABC?|A?BC
Regex demo
Including the parenthesis you might use a capturing group or lookarounds to assert what is on the left and on the right are opening and closing parenthesis.
(?<=\()(?:ABC?|A?BC)(?=\))
Explanation
(?<=\() Assert what is on the left is (
(?: Non capturing group
ABC? Match AB with optional C
-| Or
A?BC Match optional A and BC
) Close non capturing group
(?=\)) Assert what is on the right is )
Regex demo
In order to consume the overlaps buddy, it has to be matched.
Therefore, one side of the alternation has to include its buddies last
or first literal (doesn't have to be both).
AB|BC ~ ABC?|BC = A?BC|AB

regex for javascript regular expressions

I need to parse some JavaScript code in C# and find the regular expressions in that.
When the regular expressions are created using RegExp, I am able to find. (Since the expression is enclosed in quotes.) When it comes to inline definition, something like:
var x = /as\/df/;
I am facing difficulty in matching the pattern. I need to start at a /, exclude all chars until a / is found but should ignore \/.
I may not relay on the end of statement (;) because of Automatic Semicolon Insertion or the regex may be part of other statement, something like:
foo(/xxx/); //assume function takes regex param
If I am right, a line break is not allowed within the inline regex in JavaScript to save my day. However, there the following is allowed:
var a=/regex1def/;var b=/regex2def/;
foo(/xxx/,/yyy/)
I need regular expression someting like /.*/ that captures right data.
You cannot reliably parse programming languages with regular expressions only. Especially Javascript, because its grammar is quite ambiguous. Consider:
a = a /b/ 1
foo = /*bar*/ + 1
a /= 5 //.*/hi
This code is valid Javascript, but none of /.../'s here are regular expressions.
In case you know what you're doing ;), an expression for matching escaped strings is "delimiter, (something escaped or not delimiter), delimiter":
delim ( \\. | [^delim] ) * delim
where delim is / in your case.
After several trials with RegexHero, this seems working. /.*?[^\\]/. But not sure if I am missing any corner case.
How about this:
Regex regexObj = new Regex(#"/(?:\\/|[^/])*/");
Explanation:
/ # Match /
(?: # Non-capturing group:
\\ # Either match \
/ # followed by /
| # or
[^/] # match any character except /
)* # Repeat any number of times
/ # Match /
I think that this may help you
var patt=/pattern/modifiers;
•pattern specifies the pattern of an expression
•modifiers specify if a search should be global, case-sensitive, etc.

Using RegEx to balance match parenthesis

I am trying to create a .NET RegEx expression that will properly balance out my parenthesis. I have the following RegEx expression:
func([a-zA-Z_][a-zA-Z0-9_]*)\(.*\)
The string I am trying to match is this:
"test -> funcPow((3),2) * (9+1)"
What should happen is Regex should match everything from funcPow until the second closing parenthesis. It should stop after the second closing parenthesis. Instead, it is matching all the way to the very last closing parenthesis. RegEx is returning this:
"funcPow((3),2) * (9+1)"
It should return this:
"funcPow((3),2)"
Any help on this would be appreciated.
Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(#"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.
The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.
Using balanced groups, it is:
Regex rx = new Regex(#"func([a-zA-Z_][a-zA-Z0-9_]*)\(((?<BR>\()|(?<-BR>\))|[^()]*)+\)");
var match = rx.Match("funcPow((3),2) * (9+1)");
var str = match.Value; // funcPow((3),2)
(?<BR>\()|(?<-BR>\)) are a Balancing Group (the BR I used for the name is for Brackets). It's more clear in this way (?<BR>\()|(?<-BR>\)) perhaps, so that the \( and \) are more "evident".
If you really hate yourself (and the world/your fellow co-programmers) enough to use these things, I suggest using the RegexOptions.IgnorePatternWhitespace and "sprinkling" white space everywhere :-)
Regular Expressions only work on Regular Languages. This means that a regular expression can find things of the sort "any combination of a's and b's".(ab or babbabaaa etc) But they can't find "n a's, one b, n a's".(a^n b a^n) Regular expressions can't guarantee that the first set of a's matches the second set of a's.
Because of this, they aren't able to match equal numbers of opening and closing parenthesis. It would be easy enough to write a function that traverses the string one character at a time. Have two counters, one for opening paren, one for closing. increment the pointers as you traverse the string, if opening_paren_count != closing_parent_count return false.
func[a-zA-Z0-9_]*\((([^()])|(\([^()]*\)))*\)
You can use that, but if you're working with .NET, there may be better alternatives.
This part you already know:
func[a-zA-Z0-9_]*\( --weird part-- \)
The --weird part-- part just means; ( allow any character ., or | any section (.*) to exist as many times as it wants )*. The only issue is, you can't match any character ., you have to use [^()] to exclude the parenthesis.
(([^()])|(\([^()]*\)))*

Categories

Resources