Confusion over Multiple Matches in a Regex

Confusion over Multiple Matches in a Regex - c#

I've tested my regex in a regex tester and the statement itself appears that it should be working, however instead of matching 4 objects as it should, it only matches 1 (the entire string) which I'm not sure why its even doing that...
rgx = new Regex(#"^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$");
matches = rgx.Matches("0.0.0.95");
at this point if I do:
foreach (Match m in matches)
{
Console.WriteLine(m.Value);
}
it will just show "0.0.0.95" when it should be matching 0, 0, 0, and 95 and not the entire string. What am I doing wrong here?
ANSWER - The single match of the entire string contained the group matches I was looking for, accessed in this manner:
r.r1 = Convert.ToInt32(m.Groups[1].Value);
r.r2 = Convert.ToInt32(m.Groups[2].Value);
r.r3 = Convert.ToInt32(m.Groups[3].Value);
r.r4 = Convert.ToInt32(m.Groups[4].Value);

In this case you don't get multiple matches - there is only one match in there, but it has four capturing groups:
^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$
// ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^
// Group 1 Group 2 Group 3 Group 4
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// Group 0
There is a special group number zero that includes the entire match.
So you need to modify your program like this:
Console.WriteLine("One:'{0}' Two:'{1}' Three:'{2}' Four:'{3}'"
, m.Groups[1].Value
, m.Groups[2].Value
, m.Groups[3].Value
, m.Groups[4].Value
);

Related

Match properties using regex

I have a string like that represent a set of properties, for example:
AB=0, TX="123", TEST=LDAP, USR=" ", PROPS="DN=VB, XN=P"
I need to extract this properties in:
AB=0
TX=123
TEST=LDAP
USR=
PROPS=DN=VB, XN=P
To resolve this problem I tried to use a regex, but without success.
public IEnumerable<string> SplitStr(string input)
{
Regex reg= new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))", RegexOptions.Compiled);
foreach (Match match in reg.Matches(input))
{
yield return match.Value.Trim(',');
}
}
I can't find the ideal regex to expected output. With the above regex the output is:
AB=0
123
TEST=LDAP
DN=VB, XN=P
Anyone can help me?

You may use
public static IEnumerable<string> SplitStr(string input)
{
var matches = Regex.Matches(input, #"(\w+=)(?:""([^""]*)""|(\S+)\b)");
foreach (Match match in matches)
{
yield return string.Concat(match.Groups.Cast<Group>().Skip(1).Select(x => x.Value)).Trim();
}
}
The regex details:
(\w+=) - Group 1: one or more word chars and a = char
(?:""([^""]*)""|(\S+)\b) - a non-capturing group matching either of the two alternatives:
"([^"]*)" - a ", then 0 or more chars other than " and then a "
| - or
(\S+)\b - any 1+ chars other than whitespace, as many as possible, up to the word boundary position.
See the regex demo.
The string.Concat(match.Groups.Cast<Group>().Skip(1).Select(x => x.Value)).Trim() code omits the Group 0 (whole match) value from the groups, takes Group 1, 2 and 3 and concats them into a single string, and trims it afterwards.
C# test:
var s = "AB=0, TX=\"123\", TEST=LDAP, USR=\" \", PROPS=\"DN=VB, XN=P\"";
Console.WriteLine(string.Join("\n", SplitStr(s)));
Output:
AB=0
TX=123
TEST=LDAP
USR=
PROPS=DN=VB, XN=P

Another way could be to use 2 capturing groups where the first group captures the first part including the equals sign and the second group captures the value after the equals sign.
Then you can concatenate the groups and use Trim to remove the double quotes. If you also want to remove the whitespaces after that, you could use Trim again.
([^=\s,]+=)("[^"]+"|[^,\s]+)
That will match
( First capturing group
[^=\s,]+= Match 1+ times not an equals sign, comma or whitespace char, then match = (If the property name can contain a comma, you could instead use character class and specify what you would allow to match like for example[\w,]+)
) Close group
( Second capturing group
"[^"]+" Match from opening till closing double quote
| Or
[^,\s]+ Match 1+ times not a comma or whitespace char
)
Regex demo | C# demo
Your code might look like:
public IEnumerable<string> SplitStr(string input)
{
foreach (Match m in Regex.Matches(input, #"([^=\s,]+=)(""[^""]+""|[^,\s]+)"))
{
yield return string.Concat(m.Groups[1].Value, m.Groups[2].Value.Trim('"'));
}
}

Extract a string surrounded by two known values c# regex [duplicate]

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?

match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]

The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).

Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.

The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.

How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

Regex match not returning group when using wildcard

Why does this work (returns 25):
var match = Regex.Match("Age: 25 yrs.", #"(\d+)");
Console.WriteLine(match.Groups[1].Value);
But this doesn't (returns a blank group):
var match = Regex.Match("Age: 25 yrs.", #"(\d*)");
Console.WriteLine(match.Groups[1].Value);
There must be something fundamental about how .NET handles regular expressions that I'm missing.

The point is \d* also matches empty string. And Match finds only first match. And as we know, you can fit as many empty strings as you want in front of any string. So it returns the first empty one.
So if you do this, it does match total of 13 strings with 25 being one of them.
var matches = Regex.Matches("Age: 25 yrs.", #"(\d*)");
foreach (var match in matches.Cast<Match>())
{
Console.WriteLine(match.Index + ":" + match.Value);
}

(\d*) will try to take 0-infinite therefor the result will be infinite and this isn't valid.
You meant to use (\d)+ this will take 1 or more digits.

Improve RegEx search

Using DirectoryServices.AccountManagement I'm getting users DistinguishedName which looks like so:
CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
I need to get first OU value from this.
I found similar solution: C# Extracting a name from a string
And using some tweaks I created this code:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
This code returns STORE as expected, but if I change Groups[1] to Groups[0] I get almost same result as input string:
OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
How can I change this regex so it will return only values of OU? SO that in this example I get array of 2 matches. If I would have more OU in my string then array would be longer.
EDIT:
I've converted my code (using #dasblinkenlight suggestions) into function:
private static List<string> GetOUs()
{
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
return (from Match m in mm select m.Groups[1].Value).ToList();
}
Is that correct?

Your regular expression is fine (almost), you are just using a wrong API.
Remove the parts of the regexp that match up to the ending anchor $, and change the call of Match for a call of Matches, and get the matches in a loop, like this:
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in mm)
Console.WriteLine(m.Groups[1].Value);
}

Your existing regex:
#"OU=([a-zA-Z\\]+)\,.*$"
Matches OU=, then some letters and backslashes ([a-zA-Z\\]+), then a comma, then any characters (.*) to the end of the line ($).
Thus a single match will always match the entire line after the first OU section.
Modify your regex by removing the ,.*$ at the end, at it will match each OU group:
#"OU=([a-zA-Z\\]+)"
Also note that the parentheses are a capturing group. They are useful if you also want to capture just the value part by itself, but if you are not using that, they are not necessary, and you can just have this:
#"OU=[a-zA-Z\\]+"

It's beacuse you are mixing up matches and groups
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection mc = Regex.Matches(input, #"OU=([a-zA-Z\\]+),");
foreach(Match m in mc)
{
Console.WriteLine(m.Result("$1"));
}

Group[0] returns the full match:
Group[1] returns the first Pattern in the match [i.e. everything in the first parenthesis '(' ')' ]
So if you wanted to get exactly those 2 occurances of OU.. you could do this:
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Group[0] returns the full match: (which you don't want)
Group[1] returns the first Pattern in the match [i.e everything in the first parenthesis '(' ')' ]
Group[2] returns the second Pattern in the match [i.e. everything in the second parenthesis '(' ')' ]
Giving:
STORE
COMPANY
But I'm assuming you don't want to be so explicit with your Regex for each Pattern you are interested in.
If you want to get multiple matches, then you need to do Regex's Matches call that returns a Matchcollection.
MatchCollection ms = Regex.Matches(...);
This still won't work with your current Regex though, because everything from STORE so the end of the line will be in the first match. If you only want to get the pattern "1-or-more-letters" after a "OU="
You only need:
#"OU=([a-zA-Z\\]+)"
So your code would be:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection ms = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in ms)
{
Console.WriteLine(m.Groups[1].Value);// get the string in the first "(" ")"
}

Regex repetition group

Capturing a repetition group is always returning the last element but that is not quite helpfull. For example:
var regex = new RegEx("^(?<somea>a)+$");
var match = regex.Match("aaa");
match.Group["somea"]; // return "a"
I would like to have a collection of match element instead of the last match item.
Is that possible?

CaptureCollection
You can use CaptureCollection which represents the set of captures made by a single capturing group.
If a quantifier is not applied to a capturing group, the CaptureCollection includes a single Capture object that represents the same captured substring as the Group object.
If a quantifier is applied to a capturing group, the CaptureCollection includes one Capture object for each captured substring, and the Group object provides information only about the last captured substring.
So you can do this
var regex = new Regex("^(?<somea>a)+$");
var match = regex.Match("aaa");
List<string> aCaptures=match.Groups["somea"]
.Captures.Cast<Capture>()
.Select(x=>x.Value)
.ToList<string>();
//aCaptures would now contain a list of a

Take a look in the Captures collection:
match.Groups["somea"].Captures

You can also try something like this :
var regex = new RegEx("^(?<somea>a)+$");
var matches = regex.Matches("aaa");
foreach(Match _match in matches){
match.Group["somea"]; // return "a"
}
This is just a sample but it should give a good start.
I did not check the validity of your regular expression though

You must use the quantifier + to the thing you want to match, not the group, if you quantify the group that will create as many groups as matches are.
So (a)+ in aaa Will create 1 group and will replace the match with the new occurrence of the match and (a+) will create 1 group with aaa
So you know what to do with your problem, just move the + inside the capturing group.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Confusion over Multiple Matches in a Regex - c#

Related

Match properties using regex

Extract a string surrounded by two known values c# regex [duplicate]

Regex match not returning group when using wildcard

Improve RegEx search

Regex repetition group

Categories

Resources