Regex with balancing groups - c#

I need to write regex that capture generic arguments (that also can be generic) of type name in special notation like this:
System.Action[Int32,Dictionary[Int32,Int32],Int32]
lets assume type name is [\w.]+ and parameter is [\w.,\[\]]+
so I need to grab only Int32, Dictionary[Int32,Int32] and Int32
Basically I need to take something if balancing group stack is empty, but I don't really understand how.
UPD
The answer below helped me solve the problem fast (but without proper validation and with depth limitation = 1), but I've managed to do it with group balancing:
^[\w.]+ #Type name
\[(?<delim>) #Opening bracet and first delimiter
[\w.]+ #Minimal content
(
[\w.]+
((?(open)|(?<param-delim>)),(?(open)|(?<delim>)))* #Cutting param if balanced before comma and placing delimiter
((?<open>\[))* #Counting [
((?<-open>\]))* #Counting ]
)*
(?(open)|(?<param-delim>))\] #Cutting last param if balanced
(?(open)(?!) #Checking balance
)$
Demo
UPD2 (Last optimization)
^[\w.]+
\[(?<delim>)
[\w.]+
(?:
(?:(?(open)|(?<param-delim>)),(?(open)|(?<delim>))[\w.]+)?
(?:(?<open>\[)[\w.]+)?
(?:(?<-open>\]))*
)*
(?(open)|(?<param-delim>))\]
(?(open)(?!)
)$

I suggest capturing those values using
\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*
See the regex demo.
Details:
\w+(?:\.\w+)* - match 1+ word chars followed with . + 1+ word chars 1 or more times
\[ - a literal [
(?:,?(?<res>\w+(?:\[[^][]*])?))* - 0 or more sequences of:
,? - an optional comma
(?<res>\w+(?:\[[^][]*])?) - Group "res" capturing:
\w+ - one or more word chars (perhaps, you would like [\w.]+)
(?:\[[^][]*])? - 1 or 0 (change ? to * to match 1 or more) sequences of a [, 0+ chars other than [ and ], and a closing ].
A C# demo below:
var line = "System.Action[Int32,Dictionary[Int32,Int32],Int32]";
var pattern = #"\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups["res"].Captures.Cast<Capture>()
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);
UPDATE: To account for unknown depth [...] substrings, use
\w+(?:\.\w+)*\[(?:\s*,?\s*(?<res>\w+(?:\[(?>[^][]+|(?<o>\[)|(?<-o>]))*(?(o)(?!))])?))*
See the regex demo

Related

Comma is breaking Grouping

I want a regular expression that's match anything as a parameter for this string concat(1st,2nd) and extract three matching groups as below :
Group1: concat
Group2: 1st
Group3: 2nd.
I have tried this :^\s*(concat)\(\s*(.*?)\s*\,\s*(.*)\)\s*$, and it worked fine until I had a parameter with comma as below:
concat(regex(3,4),regex(3,4)). It seams the comma is breaking it down, how to ignore the parameter content and take it as a seperate group?
You may use
^\s*(concat)\(\s*((?>\w*\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|\w+))\s*,\s*((?>\w*\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|\w+))\)\s*$
See the regex demo.
Details
^ - start of string
\s* - 0+ whitespaces
(concat) - Group 1: concat word
\( - a ( char
\s* - 0+ whitespaces
({arg}) - Group 2: arg pattern:
\w* - 0+ word chars
\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\) - (, then any amount of nested parentheses or chars other than ( and ) and then )
|\w+ - or just 1+ word chars
\s*,\s* - a comma enclosed with 0+ whitespaces
({arg}) - Group 3: arg pattern
\) - a ) char
\s* - 0+ whitespaces
$ - end of string.
See C# demo:
var arg = #"(?>\w*\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|\w+)";
var pattern = $#"^\s*(concat)\(\s*({arg})\s*,\s*({arg})\)\s*$";
var match = Regex.Match("concat(regex(3,4),regex(3,4))", pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
Console.WriteLine(match.Groups[2].Value);
Console.WriteLine(match.Groups[3].Value);
}
// => concat regex(3,4) regex(3,4)
Results:

RegEx split string into words by space and containing chars

How can one perform this split with the Regex.Split(input, pattern) method?
This is a [normal string ] made up of # different types # of characters
Array of strings output:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
Also it should keep the leading spaces, so I want to preserve everything. A string contains 20 chars, array of strings should total 20 chars across all elements.
What I have tried:
Regex.Split(text, #"(?<=[ ]|# #)")
Regex.Split(text, #"(?<=[ ])(?<=# #")
I suggest matching, i.e. extracting words, not splitting:
string source = #"This is a [normal string ] made up of # different types # of characters";
// Three possibilities:
// - plain word [A-Za-z]+
// - # ... # quotation
// - [ ... ] quotation
string pattern = #"[A-Za-z]+|(#.*?#)|(\[.*?\])";
var words = Regex
.Matches(source, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, words
.Select((w, i) => $"{i + 1}. {w}")));
Outcome:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
You may use
var res = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
See the regex demo
The (\[[^][]*]|#[^#]*#) part is a capturing group whose value is output to the resulting list along with the split items.
Pattern details
(\[[^][]*]|#[^#]*#) - Group 1: either of the two patterns:
\[[^][]*] - [, followed with 0+ chars other than [ and ] and then ]
#[^#]*# - a #, then 0+ chars other than # and then #
| - or
\s+ - 1+ whitespaces
C# demo:
var s = "This is a [normal string ] made up of # different types # of characters";
var results = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
Console.WriteLine(string.Join("\n", results));
Result:
This
is
a
[normal string ]
made
up
of
# different types #
of
characters
It would be easier using matching approach however it can be done using negative lookeaheads :
[ ](?![^\]\[]*\])(?![^#]*\#([^#]*\#{2})*[^#]*$)
matches a space not followed by
any character sequence except [ or ] followed by ]
# followed by an even number of #

Why is my regex not finding any matches?

So I have a string that looks like this with the spaces and everything.
id: 123456789,
name: 'HappyDev',
member: false,
language: 0,
isLoggedIn: 0
And here is my pattern
static string pattern = #" id: (.*),
name: (.*),
member: (.*),
language: (.*),
isLoggedIn: (.*)";
Then to get my match I do it like so..
static Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(myString);
if (m.Success)
{
Console.WriteLine(m.Value);
}
For some reason it's returning false when I compile even though on every website where I can test my pattern it returns a match with the values.
Why is it returning false when I compile?
Alternate solution:
Pattern
(?:: ?)(.*)(?:,)|(?:: ?)(.*)
Explanation:
1st Alternative (?:: ?)(.*)(?:,)
Non-capturing group (?:: ?)
: matches the character : literally (case sensitive)
? matches the character literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times
as possible, giving back as needed (greedy)
1st Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many
times as possible, giving back as needed (greedy)
Non-capturing group (?:,)
, matches the character , literally (case sensitive)
2nd Alternative (?:: ?)(.*)
Non-capturing group (?:: ?)
: matches the character : literally (case sensitive)
? matches the character literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times
as possible, giving back as needed (greedy)
2nd Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many
times as possible, giving back as needed (greedy)
You loose the distinctness of specifying ID etc - but you reley on non-named capturing groups with implicit ordering anyways - so some place for refinement. If you think they might skip params or reorder them, I would keep the named identifiers part of the pattern and add names to the capturing groups so they are decoupled from ordering.
The problem is that you have different number of spaces. To ignore this problem in any case you can use a pattern to match multiple spaces: \s+. Also you should replace your new lines with a pattern for new line: [\n\r]+ (note that this will match any number of new lines)
So your pattern becomes:
static string pattern = #"\s+id: (.*),[\n\r]+\s+name: (.*),[\n\r]+\s+member: (.*),[\n\r]+\s+language: (.*),[\n\r]+\s+isLoggedIn: (.*)";
There are different ways of solving this. Here is mine:
string pattern = #"^id:\s*(.+),[\n|\r|\r\n]\s+name:(.+),[\n|\r|\r\n]\s+member:\s+(.+),[\n|\r|\r\n]\s+language:\s+(.+),[\n|\r|\r\n]\s+isLoggedIn:\s+(.+)$";
It will account for any space in-between as well as any combination of carriage return/line feed.
var str = #"
id: 123456789,
name: 'HappyDev',
member: false,
language: 0,
isLoggedIn: 0";
var matches = Regex.Matches(str, #"(?im)(?'attr'\w+):\s+(?'val'[^,]+)");
if (matches.Count == 0)
Console.WriteLine("No matches found");
else
matches.Cast<Match>().ToList().ForEach(m =>
Console.WriteLine($"Match: '{m.Value}' [Attribute = {m.Groups["attr"].Value}, Value = {m.Groups["val"].Value}]"));
I tried it on Regex101, your pattern have spacing issues (number of spaces don't match).
You can use the following regex for spaces and new line chars so no more need to worry for how many spaces:
id: (.*),\s*name: (.*),\s*member: (.*),\s*language: (.*),\s*isLoggedIn: (.*)
Talking about the initial code, check if the amount of spaces is equal in the string and pattern. This code finds the match:
var myString =
#" id: 123456789,
name: 'HappyDev',
member: false,
language: 0,
isLoggedIn: 0";
string pattern =
#" id:.*,
name: (.*),
member: (.*),
language: (.*),
isLoggedIn: (.*)";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(myString);
if (m.Success)
{
Console.WriteLine(m.Value);
}
However, you shouldn't use it like that, but replace spaces with ( +) or make use of other solutions provided here.

Regex to get everything starting with # and removing everything after any non-included characters

I have the following:
Regex RgxUrl = new Regex("[^a-zA-Z0-9-_]");
foreach (var item in source.Split(' ').Where(s => s.StartsWith("#")))
{
var mention = item.Replace("#", "");
mention = RgxUrl.Replace(mention, "");
usernames.Add(mention);
}
CURRENT INPUT > OUTPUT
#fish and fries are #good > fish, good
#fish and fries and #Mary's beer are #good > fish, good, marys
DESIRED INPUT > OUTPUT
#fish and fries are #good > fish, good
#fish and fries and #Mary's beer are #good > fish, good, Mary
The key here is to remove anything that's after an offending character. How can this be achieved?
You split a string with a space, check if a chunk starts with #, then if yes, remove all the # symbols in the string, then use a regex to remove all non-alphanumeric, - and _ chars in the string and then add it to the list.
You can do that with a single regex:
var res = Regex.Matches(source, #"(?<!\S)#([a-zA-Z0-9-_]+)")
.Cast<Match>()
.Select(m=>m.Groups[1].Value)
.ToList();
Console.WriteLine(string.Join("; ", res)); // demo
usernames.AddRange(res); // in your code
See the C# demo
Pattern details:
(?<!\S) - there must not be a non-whitespace symbol immediately to the left of the current location (i.e. there must be a whitespace or start of string) (this lookbehind is here because the original code split the string with whitespace)
# - a # symbol (it is not part of the subsequent group because this symbol was removed in the original code)
([a-zA-Z0-9-_]+) - Capturing Group 1 (accessed with m.Groups[1].Value) matching one or more ASCII letters, digits, - and _ symbols.

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

Categories

Resources