Regex repetition group - c#

Capturing a repetition group is always returning the last element but that is not quite helpfull. For example:
var regex = new RegEx("^(?<somea>a)+$");
var match = regex.Match("aaa");
match.Group["somea"]; // return "a"
I would like to have a collection of match element instead of the last match item.
Is that possible?

CaptureCollection
You can use CaptureCollection which represents the set of captures made by a single capturing group.
If a quantifier is not applied to a capturing group, the CaptureCollection includes a single Capture object that represents the same captured substring as the Group object.
If a quantifier is applied to a capturing group, the CaptureCollection includes one Capture object for each captured substring, and the Group object provides information only about the last captured substring.
So you can do this
var regex = new Regex("^(?<somea>a)+$");
var match = regex.Match("aaa");
List<string> aCaptures=match.Groups["somea"]
.Captures.Cast<Capture>()
.Select(x=>x.Value)
.ToList<string>();
//aCaptures would now contain a list of a

Take a look in the Captures collection:
match.Groups["somea"].Captures

You can also try something like this :
var regex = new RegEx("^(?<somea>a)+$");
var matches = regex.Matches("aaa");
foreach(Match _match in matches){
match.Group["somea"]; // return "a"
}
This is just a sample but it should give a good start.
I did not check the validity of your regular expression though

You must use the quantifier + to the thing you want to match, not the group, if you quantify the group that will create as many groups as matches are.
So (a)+ in aaa Will create 1 group and will replace the match with the new occurrence of the match and (a+) will create 1 group with aaa
So you know what to do with your problem, just move the + inside the capturing group.

Related

Howto get groups by repeated pattern with qantizer in regular expression

I have the following string:
(a,b,c,d,e)
I want to get out all comma separated values by a regular expression.
If I put away the brackets
a,b,c,d,e
and use the following regular expression:
([^,]),?
I get out one match as well as one group for each comma separated value.
But if I want to do with concluding brackets using the regular expression:
\((([^,]),?)+\)
I still get only one match and one group. The group contains only the last comma separated value.
I tried also with group captures like:
(?:....)
(...?)
(...)?
but I cannot get out the comma separated values by regular expression groups.
How can I do this, when the comma separated values are enclosed in brackets?
In general that's how repeated groups work - you don't have separate groups, just the last one. If you want to separate values between commas, it's better to use string functions available in your programming language to first strip brackets and then split string on commas.
For example in Ruby:
[10] pry(main)> '(a,b,c,d,e,f)'.gsub(/[()]/,'').split(',')
# => ["a", "b", "c", "d", "e", "f"]
I found it out. Using C# you can use the property Captures in the Match Collection.
Using Regex:
\((([^,]),?)+\)
Do:
string text = "(a,b,c,d,e)";
Regex rgx = new Regex("\\((([^,]),?)+\\)");
MatchCollection matches = rgx.Matches(text);
Then you have 1 item with the following 3 groups in the matchcollection:
[0]: \((([^,]),?)+\) => (a,b,c,d,e)
[1]: ([^,]),?+ => value and optional comma, eg. a, or b, or e
[2]: [^,] => value only, eg. a or b or ...
The list captures within the group stores each extracted value by quantizer. So use group [2] and captures to get out all the values.
So the solution is:
string text = "(a,b,c,d,e)";
Regex rgx = new Regex("\\((([^,]),?)+\\)");
MatchCollection matches = rgx.Matches(text);
//now get out the captured calues
CaptureCollection captures = matches[0].Groups[2].Captures;
//and extract them to list
List<string> values = new List<string>();
foreach (Capture capture in captures)
{
values.Add(capture.Value);
}

Extract a string surrounded by two known values c# regex [duplicate]

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?
match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]
The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).
Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.
The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.
How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

Confusion over Multiple Matches in a Regex

I've tested my regex in a regex tester and the statement itself appears that it should be working, however instead of matching 4 objects as it should, it only matches 1 (the entire string) which I'm not sure why its even doing that...
rgx = new Regex(#"^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$");
matches = rgx.Matches("0.0.0.95");
at this point if I do:
foreach (Match m in matches)
{
Console.WriteLine(m.Value);
}
it will just show "0.0.0.95" when it should be matching 0, 0, 0, and 95 and not the entire string. What am I doing wrong here?
ANSWER - The single match of the entire string contained the group matches I was looking for, accessed in this manner:
r.r1 = Convert.ToInt32(m.Groups[1].Value);
r.r2 = Convert.ToInt32(m.Groups[2].Value);
r.r3 = Convert.ToInt32(m.Groups[3].Value);
r.r4 = Convert.ToInt32(m.Groups[4].Value);
In this case you don't get multiple matches - there is only one match in there, but it has four capturing groups:
^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$
// ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^
// Group 1 Group 2 Group 3 Group 4
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// Group 0
There is a special group number zero that includes the entire match.
So you need to modify your program like this:
Console.WriteLine("One:'{0}' Two:'{1}' Three:'{2}' Four:'{3}'"
, m.Groups[1].Value
, m.Groups[2].Value
, m.Groups[3].Value
, m.Groups[4].Value
);

Regular expression match substring

I tried to create a regular expression which pulls everything that matches:
[aA-zZ]{2}[0-9]{5}
The problem is that I want to exclude from matching when I have eg. ABCD12345678
Can anyone help me resolve this?
EDIT1:
I am looking two letters and five digits in the string, but I want to exclude from matching when I have string like ABCD12345678, because when I use above regular expression it will return CD12345.
EDIT2:
I didn't check everything but I think I found answer:
WHEN field is null then field
WHEN fnRegExMatch(field, '[a-zA-Z]{2}[0-9]{5}') = 'N/A' THEN field
WHEN field like '%[^a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][^0-9]%' or field like '[a-z][a-z][0-9][0-9][0-9][0-9][0-9][^0-9]%' THEN fnRegExMatch(field, '[a-zA-Z]{2}[0-9]{5}')
ELSE field
First [aA-zZ] haven't any sense, second use word boundaries:
\b[a-zA-Z]{2}[0-9]{5}\b
You could also use case insensitive modifier:
(?i)\b[a-z]{2}[0-9]{5}\b
According to your comment, it seems you may have underscore after the five digits. In this case, word boundary doesn't work, you have to use ths instead:
(?i)(?<![a-z])([a-z]{2}[0-9]{5})(?![0-9])
(?<![a-z]) is a negative lookbehind that assumes you haven't a letter before the two that are mandatory
(?![0-9]) is a negative lookahead that assumes you haven't a digit after the five that are mandatory
This would be the code, along with usage samples.
public static Regex regex = new Regex(
"\\b[a-zA-Z]{2}\\d{5}\\b",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);
//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

Regex to find multiple matched items at one time

So here is the string:
"DC:PPE Env:CH1 Slice:whatever to extract"
or "babaasdfsd DC:PPE asdfas Env:CH1 or Slice:whatever "
basically I am trying to find "DC:PPE" "Env:CH1" "Slice:whatever" and remove them.
I am using the following regex:(c#)
Regex r = new Regex(
#"(?:
(?<captured>(?:^|\s+)Slice|Env|Dc:.*?\s+)()
){1}
\1",
with (?:^|\s+) I am trying to match either Slice|Env|Dc appear at the beginning or have leading spaces with it.
With .*?\s+ I am trying to non-greedy match the spaces after DC:PPE.
I want it to return all three matches together.
What is wrong with this?
string combinedTestSTring = "DC:PPE Env:CH1 Slice:whatever to extract";
Regex r = new Regex(
#"(?:
(?<captured>(?:^|\s+)Slice|Env|Dc:.*?\s+)()
){1}
\1",
RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
var a = r.Matches(combinedTestSTring);
Does this do what you want:
Regex r = new Regex(#"\b(:?Slice|Env|Dc):.+?)\b");
\b matches a word boundary. Then it matches Slice|Env|Dc followed by : and then at least one character leading up to another word boundary.
You can't return all matches together. When returning an array of matches, each element corresponds to a different capturing group in the regexp. If you have a group with a repeat count or wildcard, the returned match is just the last one found, not all of them. So you have to write a loop that walks through the input string, returning each match.
However, if you just want replace them all, r.Replace() will do that, since it replaces all the matches in the string.

Categories

Resources