Regex to capture groups of parentheses including inner and outer parentheses - c#

I want to match all parentheses including the inner and outer parentheses.
Input: abc(test)def(rst(another test)uv)xy
Desired Output: (test)
(rst(another test)uv)
(another test)
My following c# code returns only (test) and (rst(another test)uv):
string input = "abc(test)def(rst(another test)uv)xy";
Regex regex = new Regex(#"\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)", RegexOptions.IgnorePatternWhitespace);
foreach (Match c in regex.Matches(input))
{
Console.WriteLine(c.Value);
}

You are looking for overlapping matches. Thus, just place your regex into a capturing group and put it inside a non-anchored positive lookahead:
Regex regex = new Regex(#"(?=(\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)))", RegexOptions.IgnorePatternWhitespace);
The value you need will be inside match.Groups[1].Value.
See the IDEONE demo:
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Linq;
public class Test
{
public static void Main()
{
var input = "abc(test)def(rst(another test)uv)xy";
var regex = new Regex(#"(?=(\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)))", RegexOptions.IgnorePatternWhitespace);
var results = regex.Matches(input).Cast<Match>()
.Select(p => p.Groups[1].Value)
.ToList();
Console.WriteLine(String.Join(", ", results));
}
}
Results: (test), (rst(another test)uv), (another test).
Note that unanchored positive look-aheads can be used to find overlapping matches with capturing in place because they do not consume text (i.e. the regex engine index stays at its current position when trying to match with all the subpatterns inside the lookahead) and the regex engine automatically moves its index after match/failure making the matching process "global" (i.e. tests for a match at every position inside an input string).
Although lookahead subexpressions do not match, they still can capture into groups.
Thus, when the look-ahead comes to the (, it may match a zero-width string and place they value you need into Group 1. Then, it goes on and finds another ( inside the first (...), and can capture a substring inside it again.

You could use this one : \((?>[^()]+|\((?<P>)|(?<C-P>)\))*(?(P)(?!))\) but you'll have to dig through captures, groups and groups' captures to get what you want (see demo)

Edit: This answer is flat out wrong for .Net regular expressions - see nam's comment below.
Original answer:
Regular expressions match regular languages. Nested parentheses are not a regular language, they require a context-free grammar to match. So the short answer is there is no way to do what you're attempting.
https://stackoverflow.com/a/133684/361631

Related

How to exclude regex match inside nested parentheses

I have a text like this:
UseProp1?(Prop1?Prop1:Test):(UseProp2?Prop2:(Test Text: '{TextProperty}' Test Reference:{Reference}))
I'm trying to use regex in c# to extract the nested if/else-segments.
To find '?' I've used:
Pattern 1: \?\s*(?![^()]*\))
and to find ':' I've used:
Pattern 2: \:\s*(?![^()]*\))
This works fine when there is one level of parentheses but not when nesting them.
I've used this online tool to simplify the testing: http://regexstorm.net/tester (and insert pattern-1 and input from above)
As you can see, it highlights two matches but I only want the first. You'll also notice that first parentheses is overlooked but not the next one with the nested levels
I expect the match list to be:
1) UseProp1
2) (Prop1?Prop1:Test):(UseProp2?Prop2:(Test Text: '{TextProperty}' Test Reference:{Reference}))
What I'm getting now is:
1) UseProp1
2) (Prop1?Prop1:Test):(UseProp2
3) Prop2:(Test Text: '{TextProperty}' Test Reference:{Reference}))
Expanding on #bobble bubble's comment, here's my regex:
It will capture the first layer of ternary functions. Capture groups: $1 is the conditional, $2 is the true clause, and $3 is the false clause. You will then have to match the regex on each of those to step further down the tree:
((?:\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\))+|\b[^)(?:]+)+\?((?:\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\))+|\b[^)(?:]+)+\:((?:\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\))+|\b[^)(?:]+)+
Code in Tester
That being said, if you are evaluating math in these expressions as well, it may be more valuable to use a runtime compiler to do all the heavy lifting for you. This answer will help you design in that direction if you so choose.
If I understand it right, and we wish to capture only the two listed formats, we can start with a simple expression using alternation, then we'd modify its compartments, if we would like so:
UseProp1|(\(?Prop1\?Prop1(:Test)\)):(\(UseProp2\?Prop2):\((Test\sText):\s+'\{(.+?)}'\s+Test\sReference:\{(.+?)}\)\)
Demo
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"UseProp1|(\(?Prop1\?Prop1(:Test)\)):(\(UseProp2\?Prop2):\((Test\sText):\s+'\{(.+?)}'\s+Test\sReference:\{(.+?)}\)\)";
string input = #"UseProp1
(Prop1?Prop1:Test):(UseProp2?Prop2:(Test Text: '{TextProperty}' Test Reference:{Reference}))
";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Regex including what is supposed to be non-capturing group in result

I have the following simple test where i'm trying to get the Regex pattern such that it yanks the executable name without the ".exe" suffix.
It appears my non-capturing group setting (?:\\.exe) isn't working or i'm misunderstanding how its intended to work.
Both regex101 and regexstorm.net show the same result and the former confirms that "(?:\.exe)" is a non-capturing match.
Any thoughts on what i'm doing wrong?
// test variable for what i would otherwise acquire from Environment.CommandLine
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?"
var asmName = Regex.Match(testEcl, #"[^\\]+(?:\.exe)", RegexOptions.IgnoreCase).Value;
// expecting "MyApp" but I get "MyApp.exe"
I have been able to extract the value i wanted by using a matching pattern with group names defined, as shown in the following, but would like to understand why non-capturing group setting approach didn't work the way i expected it to.
// test variable for what i would otherwise acquire from Environment.CommandLine
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?"
var asmName = Regex.Match(Environment.CommandLine, #"(?<fname>[^\\]+)(?<ext>\.exe)",
RegexOptions.IgnoreCase).Groups["fname"].Value;
// get the desired "MyApp" result
/eoq
A (?:...) is a non-capturing group that matches and still consumes the text. It means the part of text this group matches is still added to the overall match value.
In general, if you want to match something but not consume, you need to use lookarounds. So, if you need to match something that is followed with a specific string, use a positive lookahead, (?=...) construct:
some_pattern(?=specific string) // if specific string comes immmediately after pattern
some_pattern(?=.*specific string) // if specific string comes anywhere after pattern
If you need to match but "exclude from match" some specific text before, use a positive lookbehind:
(?<=specific string)some_pattern // if specific string comes immmediately before pattern
(?<=specific string.*?)some_pattern // if specific string comes anywhere before pattern
Note that .*? or .* - that is, patterns with *, +, ?, {2,} or even {1,3} quantifiers - in lookbehind patterns are not always supported by regex engines, however, C# .NET regex engine luckily supports them. They are also supported by Python PyPi regex module, Vim, JGSoft software and now by ECMAScript 2018 compliant JavaScript environments.
In this case, you may capture what you need to get and just match the context without capturing:
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?";
var asmName = string.Empty;
var m = Regex.Match(testEcl, #"([^\\]+)\.exe", RegexOptions.IgnoreCase);
if (m.Success)
{
asmName = m.Groups[1].Value;
}
Console.WriteLine(asmName);
See the C# demo
Details
([^\\]+) - Capturing group 1: one or more chars other than \
\. - a literal dot
exe - a literal exe substring.
Since we are only interested in capturing group #1 contents, we grab m.Groups[1].Value, and not the whole m.Value (that contains .exe).
You're using a non-capturing group. The emphasis is on the word group here; the group does not capture the .exe, but the regex in general still does.
You're probably wanting to use a positive lookahead, which just asserts that the string must meet a criteria for the match to be valid, though that criteria is not captured.
In other words, you want (?=, not (?:, at the start of your group.
The former is only if you are enumerating the Groups property of the Match object; in your case, you're just using the Value property, so there's no distinction between a normal group (\.exe) and a non-capturing group (?:\.exe).
To see the distinction, consider this test program:
static void Main(string[] args)
{
var positiveInput = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?";
Test(positiveInput, #"[^\\]+(\.exe)");
Test(positiveInput, #"[^\\]+(?:\.exe)");
Test(positiveInput, #"[^\\]+(?=\.exe)");
var negativeInput = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.dll\" /?";
Test(negativeInput, #"[^\\]+(?=\.exe)");
}
static void Test(String input, String pattern)
{
Console.WriteLine($"Input: {input}");
Console.WriteLine($"Regex pattern: {pattern}");
var match = Regex.Match(input, pattern, RegexOptions.IgnoreCase);
if (match.Success)
{
Console.WriteLine("Matched: " + match.Value);
for (int i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine($"Groups[{i}]: {match.Groups[i]}");
}
}
else
{
Console.WriteLine("No match.");
}
Console.WriteLine("---");
}
The output of this is:
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(\.exe)
Matched: MyApp.exe
Groups[0]: MyApp.exe
Groups[1]: .exe
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(?:\.exe)
Matched: MyApp.exe
Groups[0]: MyApp.exe
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(?=\.exe)
Matched: MyApp
Groups[0]: MyApp
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.dll" /?
Regex pattern: [^\\]+(?=\.exe)
No match.
---
The first regex (#"[^\\]+(\.exe)") has \.exe as just a normal group.
When we enumerate the Groups property, we see that .exe is indeed a group captured in our input.
(Note that the entire regex is itself a group, hence Groups[0] is equal to Value).
The second regex (#"[^\\]+(?:\.exe)") is the one provided in your question.
The only difference compared to the previous scenario is that the Groups property doesn't contain .exe as one of its entries.
The third regex (#"[^\\]+(?=\.exe)") is the one I'm suggesting you use.
Now, the .exe part of the input isn't captured by the regex at all, but a regex won't match a string unless it ends in .exe, as the fourth scenario illustrates.
It would match the non capturing group but won't capture it, so if you want the non captured part you should access the capture group instead of the whole match
you can access groups in
var asmName = Regex.Match(testEcl, #"([^\\]+)(?:\.exe)", RegexOptions.IgnoreCase);
asmName.Groups[1].Value
the demo for the regex can be found here

Matching string separated by - using regex

Regex is not my favorite thing, but it certainly has it's uses. Right now I'm trying to match a string consisting of this.
[video-{service}-{id}]
An example of such a string:
[video-123abC-zxv9.89]
In the example above I would like to get the "service" 123abC and the "id" zxv9.89.
So far this is what I've got. Probably overcompliacated..
var regexPattern = #"\[video-(?<id1>[^]]+)(-(?<id2>[^]]+))?\]";
var ids = Regex.Matches(text, regexPattern, RegexOptions.IgnoreCase)
.Cast<Match>()
.Select(m => new VideoReplaceItem()
{
Tag = m.Value,
Id = string.IsNullOrWhiteSpace(m.Groups["id1"].Value) == false ? m.Groups["id1"].Value : "",
Service = string.IsNullOrWhiteSpace(m.Groups["id2"].Value) == false ? m.Groups["id2"].Value : "",
}).ToList();
This does not work and puts all the charachters after '[video-' into into Id variable.
Any suggestions?
The third part seems to be optional. The [^]]+ is actually matching the - symbol, and to fix the expression, you either need to make the first [^]]+ lazy ([^]]+?) or add a hyphen to the negated character class.
Use
\[video-(?<id1>[^]-]+)(-(?<id2>[^]-]+))?]
See the regex demo
Or with the lazy character class:
\[video-(?<id1>[^]]+?)(-(?<id2>[^]]+))?]
^
See another demo.
Since you are using named groups, you may compile the regex object with RegexOptions.ExplicitCapture option to make the regex engine treat all numbered capturing groups as non-capturing ones (so as not to add ?: after the ( that defines the optional (-(?<id2>[^]-]+))? group).
Try this:
\[video-(?<service>[^]]+?)(-(?<id>[^]]+))?\]
The "?" in the service group makes the expression before it "lazy" (meaning it matches the fewest possible characters to satisfy the overall expression).
I would recommend Regexstorm.net for .NET regex testing: http://regexstorm.net/tester

How to match all regular expression groups with or without character between them

Here is my regular expression
(".+?")*([^{}\s]+)*({.+?})*
Generally matching with this expression work well, but only if there is any character between matched groups. For example this:
{1.0 0b1 2006-01-01_12:34:56.789} {1.2345 0b100000 2006-01-01_12:34:56.789}
produces two matches:
1. {1.0 0b1 2006-01-01_12:34:56.789}
2. {1.2345 0b100000 2006-01-01_12:34:56.789}
but this:
{1.0 0b1 2006-01-01_12:34:56.789}{1.2345 0b100000 2006-01-01_12:34:56.789}
only one containing last match:
{1.2345 0b100000 2006-01-01_12:34:56.789}
PS. I'm using switch g for global match
EDIT:
I do research in meantime and I must to provide additional data. I pasted whole regular expression which matches also words and strings so asterix after groups is neccecary
EDIT2: Here is example text:
COMMAND STATUS {OBJECT1}{OBJECT2} "TEXT" "TEXT"
As a result I want this groups:
COMMAND
STATUS
{OBJECT1}
{OBJECT2}
"TEXT1"
"TEXT2"
Here is my actual C# code:
var regex = new Regex("(\".+?\")*([^{}\\s]+)*({.+?})*");
var matches = regex.Matches(responseString);
return matches
.Cast<Match>()
.Where(match => match.Success && !string.IsNullOrWhiteSpace(match.Value))
.Select(match => CommandParameter.Parse(match.Value));
You can use the following regex to capture all the {...}s:
(".+?"|[^{}\s]+|{[^}]+?})
See demo here.
My approach to capture anything inside some single characters is using a negated character class with the same character. Also, since you are matching non-empty texts, you'd better off using + quantifier that ensures at least 1 character match.
EDIT:
Instead of making each group optional, you should use alternative lists.
You have an extra quantifier * for ({.+?}) sub-pattern.
You can use this regex:
("[^"]*"|{[^}]*}|[^{}\s]+)
RegEx Demo
And note how it matches both groups one with space between them and one without any space.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources