Howto get groups by repeated pattern with qantizer in regular expression

Howto get groups by repeated pattern with qantizer in regular expression - c#

I have the following string:
(a,b,c,d,e)
I want to get out all comma separated values by a regular expression.
If I put away the brackets
a,b,c,d,e
and use the following regular expression:
([^,]),?
I get out one match as well as one group for each comma separated value.
But if I want to do with concluding brackets using the regular expression:
\((([^,]),?)+\)
I still get only one match and one group. The group contains only the last comma separated value.
I tried also with group captures like:
(?:....)
(...?)
(...)?
but I cannot get out the comma separated values by regular expression groups.
How can I do this, when the comma separated values are enclosed in brackets?

In general that's how repeated groups work - you don't have separate groups, just the last one. If you want to separate values between commas, it's better to use string functions available in your programming language to first strip brackets and then split string on commas.
For example in Ruby:
[10] pry(main)> '(a,b,c,d,e,f)'.gsub(/[()]/,'').split(',')
# => ["a", "b", "c", "d", "e", "f"]

I found it out. Using C# you can use the property Captures in the Match Collection.
Using Regex:
\((([^,]),?)+\)
Do:
string text = "(a,b,c,d,e)";
Regex rgx = new Regex("\\((([^,]),?)+\\)");
MatchCollection matches = rgx.Matches(text);
Then you have 1 item with the following 3 groups in the matchcollection:
[0]: \((([^,]),?)+\) => (a,b,c,d,e)
[1]: ([^,]),?+ => value and optional comma, eg. a, or b, or e
[2]: [^,] => value only, eg. a or b or ...
The list captures within the group stores each extracted value by quantizer. So use group [2] and captures to get out all the values.
So the solution is:
string text = "(a,b,c,d,e)";
Regex rgx = new Regex("\\((([^,]),?)+\\)");
MatchCollection matches = rgx.Matches(text);
//now get out the captured calues
CaptureCollection captures = matches[0].Groups[2].Captures;
//and extract them to list
List<string> values = new List<string>();
foreach (Capture capture in captures)
{
values.Add(capture.Value);
}

Related

Using a regex with 'or' operator and getting matched groups?

I have some string in a file in the format
rid="deqn1-2"
rid="deqn3"
rid="deqn4-5a"
rid="deqn5b-7"
rid="deqn7-8"
rid="deqn9a-10v"
rid="deqn11a-12c"
I want a regex to match each deqnX-Y where X and Y are either both integers or both combination of integer and alphabet and if there is a match store X and Y in some variables.
I tried using the regex (^(\d+)-(\d+)$|^(\d+[a-z])-(\d+[a-z]))$
, but how do I get the values of the matched groups in variables?
For a match between two integers the groups would be (I think)
Groups[2].Value
Groups[3].Value
and for match between two integer and alphabet combo will be
Groups[4].Value
Groups[5].Value
How do I determine which match actually occured and then capture the matching groups accordingly?

As branch reset(?|) is not supported in C#, we can use named capturing group with same name like
deqn(?:(?<match1>\d+)-(?<match2>\d+)|(?<match1>\d+\w+)-(?<match2>\d+\w+))\b
regextester demo
C# code
String sample = "deqn1-2";
Regex regex = new Regex("deqn(?:(?<match1>\\d+)-(?<match2>\\d+)|(?<match1>\\d+\\w+)-(?<match2>\\d+\\w+))\\b");
Match match = regex.Match(sample);
if (match.Success) {
Console.WriteLine(match.Groups["match1"].Value);
Console.WriteLine(match.Groups["match2"].Value);
}
dotnetfiddle demo

You could simply not care. One of the pairs will be empty anyway. So what if you just interpret the result as a combination of both? Just slap them together. First value of the first pair plus first value of the second pair, and second value of the first pair plus second value of the second pair. This always gives the right result.
Regex regex = new Regex("^deqn(?:(\\d+)-(\\d+)|(\\d+[a-z])-(\\d+[a-z]))$");
foreach (String str in listData)
{
Match match = regex.Match(str);
if (!match.Success)
continue;
String value1 = Groups[1].Value + Groups[3].Value;
String value2 = Groups[2].Value + Groups[4].Value;
// process your strings
// ...
}

Regex to match multiple number groups between two characters

I have a string that looks like the following:
<#399969178745962506> hello to <#!104729417217032192>
I have a dictionary containing both that looks like following:
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
My goal here is to replace the <#399969178745962506> into the value of that number key, which in this case would be One
Regex.Replace(arg.Content, "(?<=<)(.*?)(?=>)", m => userDic.ContainsKey(m.Value) ? userDic[m.Value] : m.Value);
My current regex is as following: (?<=<)(.*?)(?=>) which only matches everything in between < and > which would in this case leave both #399969178745962506 and #!104729417217032192
I can't just ignore the # sign, because the ! sign is not there every time. So it could be optimal to only get numbers with something like \d+
I need to figure out how to only get the numbers between < and > but I can't for the life of me figure out how.
Very grateful for any help!

In C#, you may use 2 approaches: a lookaround based on (since lookbehind patterns can be variable width) and a capturing group approach.
Lookaround based approach
The pattern that will easily help you get the digits in the right context is
(?<=<#!?)\d+(?=>)
See the regex demo
The (?<=<#!?) is a positive lookbehind that requires <= or <=! immediately to the left of the current location and (?=>) is a positive lookahead that requires > char immediately to the right of the current location.
Capturing approach
You may use the following pattern that will capture the digits inside the expected <...> substrings:
<#!?(\d+)>
Details
<# - a literal <# substring
!? - an optional exclamation sign
(\d+) - capturing group 1 that matches one or more digits
> - a literal > sign.
Note that the values you need can be accessed via match.Groups[1].Value as shown in the snippet above.
Usage:
var userDic = new Dictionary<string, string> {
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
};
var p = #"<#!?(\d+)>";
var s = "<#399969178745962506> hello to <#!104729417217032192>";
Console.WriteLine(
Regex.Replace(s, p, m => userDic.ContainsKey(m.Groups[1].Value) ?
userDic[m.Groups[1].Value] : m.Value
)
); // => One hello to Two
// Or, if you need to keep <#, <#! and >
Console.WriteLine(
Regex.Replace(s, #"(<#!?)(\d+)>", m => userDic.ContainsKey(m.Groups[2].Value) ?
$"{m.Groups[1].Value}{userDic[m.Groups[2].Value]}>" : m.Value
)
); // => <#One> hello to <#!Two>
See the C# demo.

To extract just the numbers from you're given format, use this regex pattern:
(?<=<#|<#!)(\d+)(?=>)
See it work in action: https://regexr.com/3j6ia

You can use non-capturing groups to exclude parts of the needed pattern to be inside the group:
(?<=<)(?:#?!?)(.*?)(?=>)
alternativly you could name the inner group and use the named group to get it:
(?<=<)(?:#?!?)(?<yourgroupname>.*?)(?=>)
Access it via m.Groups["yourgroupname"].Value - more see f.e. How do I access named capturing groups in a .NET Regex?

Regex: (?:<#!?(\d+)>)
Details:
(?:) Non-capturing group
<# matches the characters <# literally
? Matches between zero and one times
(\d+) 1st Capturing Group \d+ matches a digit (equal to [0-9])
Regex demo
string text = "<#399969178745962506> hello to <#!104729417217032192>";
Dictionary<string, string> list = new Dictionary<string, string>() { { "399969178745962506", "One" }, { "104729417217032192", "Two" } };
text = Regex.Replace(text, #"(?:<#!?(\d+)>)", m => list.ContainsKey(m.Groups[1].Value) ? list[m.Groups[1].Value] : m.Value);
Console.WriteLine(text); \\ One hello to Two
Console.ReadLine();

Extract a string surrounded by two known values c# regex [duplicate]

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?

match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]

The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).

Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.

The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.

How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?

Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]

I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();

There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.

To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();

Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Regex repetition group

Capturing a repetition group is always returning the last element but that is not quite helpfull. For example:
var regex = new RegEx("^(?<somea>a)+$");
var match = regex.Match("aaa");
match.Group["somea"]; // return "a"
I would like to have a collection of match element instead of the last match item.
Is that possible?

CaptureCollection
You can use CaptureCollection which represents the set of captures made by a single capturing group.
If a quantifier is not applied to a capturing group, the CaptureCollection includes a single Capture object that represents the same captured substring as the Group object.
If a quantifier is applied to a capturing group, the CaptureCollection includes one Capture object for each captured substring, and the Group object provides information only about the last captured substring.
So you can do this
var regex = new Regex("^(?<somea>a)+$");
var match = regex.Match("aaa");
List<string> aCaptures=match.Groups["somea"]
.Captures.Cast<Capture>()
.Select(x=>x.Value)
.ToList<string>();
//aCaptures would now contain a list of a

Take a look in the Captures collection:
match.Groups["somea"].Captures

You can also try something like this :
var regex = new RegEx("^(?<somea>a)+$");
var matches = regex.Matches("aaa");
foreach(Match _match in matches){
match.Group["somea"]; // return "a"
}
This is just a sample but it should give a good start.
I did not check the validity of your regular expression though

You must use the quantifier + to the thing you want to match, not the group, if you quantify the group that will create as many groups as matches are.
So (a)+ in aaa Will create 1 group and will replace the match with the new occurrence of the match and (a+) will create 1 group with aaa
So you know what to do with your problem, just move the + inside the capturing group.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Howto get groups by repeated pattern with qantizer in regular expression - c#

Related

Using a regex with 'or' operator and getting matched groups?

Regex to match multiple number groups between two characters

Extract a string surrounded by two known values c# regex [duplicate]

How to split a string every time the character changes?

Regex repetition group

Categories

Resources