Why does Regex.Match include noncapturing groups in the result? - c#

In matching a regular expression, I want to exclude noncapturing groups from the result. I incorrectly assumed that they'd be excluded by default since, well, they're called noncapturing groups.
For some reason, though, Regex.Match behaves as though I hadn't even specified a noncapturing group. Try running this in the Immediate window:
System.Text.RegularExpressions.Regex.Match("b3a",#"(?:\d)\w").Value
I expected the result to be
"a"
but it's actually
"3a"
This question suggested I look at the Groups, but there is only one Group in the result and it too is "3a". It contains one Capture, also "3a".
What's going on here? Is Regex bugged, or is there an option I need to set?

Matching is not the same thing as capturing. (?:\d) simply means match a subpattern containing \d, but don't bother putting it in a capture group. Your entire pattern (?:\d)\w looks for a (?:\d) followed by a \w; it's functionally equivalent to \d\w.
If you're trying to match a \w only when it is preceded by a \d, use a lookbehind assertion instead:
System.Text.RegularExpressions.Regex.Match("b3a", #"(?<=\d)\w").Value

Non-capturing group means it does not make a group. Matching string are included in the resulting string.
If you want exclude that part, use something like lookbehind assertion.
#"(?<=\d)\w"

You are misunderstanding the purpose of noncapturing groups.
In general, groups (defined by a pair of parentheses ()) mean two things:
The contained regular expression is grouped, so any quantifiers after the brackets apply to the whole expression rather than just the previous single character.
The substring matching the group is stored as a subcapture in the Groups property.
Sometimes, you do not want the second result for certain groups, which is why noncapturing groups were introduced: They allow you to group a sub-expression without having any matches of it stored in an item in the Groups property.
You have observed that your Groups property contains one item, though - which is true, as by default, the first group is always the capture of the complete expression. cf. in the docs:
If the regular expression engine can find a match, the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern.
You can still use groups to achieve what you want, by placing the string you want to capture into a group:
\d(\w)
(I have left out the noncapturing group again as it does not change anything in your above expression.)
With this modified expression, the Groups property in your match should have 2 items:
The complete match (of \d\w)
Only the part of the above string you seem to be interested in, matched by \w

Related

Regex to extract string between quotes

I'm trying to extract a string between two quotes, and I thought I had my regex working, but it's giving me two strings in my GroupCollection, and I can't get it to ignore the first one, which includes the first quote and ID=
The string that I want to parse is
Test ID="12345" hello
I want to return 12345 in a group, so that I can manipulate it in code later. I've tried the following regex: http://regexr.com/3bgtl, with this code:
nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
The problem is that the GroupCollection contains two entries:
ID="12345
12345
I just want it to return the second one.
Use positive lookbehind operator:
GroupCollection ids = Regex.Match(nodeValue, "(?<=ID=\")[^\"]*").Groups;
You also used a capturing group (the parenthesis), this is why you get 2 results.
There are a few ways to accomplish this. I like named capture groups for readability.
Regex with named capture group:
"(?<capture>.*?)"
And your code would be:
match.Groups["capture"].Value
Your code is totally OK and is the most efficient from all the solutions suggested here. Capturing groups allow the quickest and least resource-consuming way to match substrings inside larger texts.
All you need to do with your regex is just access the captured group 1 that is defined by the round brackets. Like this:
var nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
Console.WriteLine(ids[1].Value);
// or just on one line
// Console.WriteLine(Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups[1].Value);
See IDEONE demo
Please have a look at Grouping Constructs in Regular Expressions:
Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string. You can use grouping constructs to do the following:
Match a subexpression that is repeated in the input string.
Apply a quantifier to a subexpression that has multiple regular expression language elements. For more information about quantifiers, see [Quantifiers in Regular Expressions][3].
Include a subexpression in the string that is returned by the [Regex.Replace][4] and [Match.Result][5] methods.
Retrieve individual subexpressions from the [Match.Groups][6] property and process them separately from the matched text as a whole.
Note that if you do not need overlapping matches, capturing group mechanism is the best solution here.

Regex match one digit or two

If this
(°[0-5])
matches °4
and this
((°[0-5][0-9]))
matches °44
Why does this
((°[0-5])|(°[0-5][0-9]))
match °4 but not °44?
Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):
((°[0-5])|(°[0-5][0-9]))
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
You are using shorter match first in regex alternation. Better use this regex to match both strings:
°[0-5][0-9]?
RegEx Demo
Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression
(a|ab|abc)
when fed this input:
abcdefghi
will only ever match a. However, if the regular expression is changed to
(a|ab|abc)d
It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.
Why would you not reduce your regular expression from
((°[0-5])|(°[0-5][0-9]))
to this?
°[0-5][0-9]?
It's simpler and easier to understand.

Weird Regex behavior in C#

I am trying to extract some alfanumeric expressions out of a longer word in C# using regular expressions. For example I have the word "FooNo12Bee". I use the the following regular expression code, which returns me two matches, "No12" and "No" as results:
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"(No|Num)\d{1,3}");
If I use the following expression, without paranthesis and without any alternative for "No" it works the way I am expecting, it returns only "No12":
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"No\d{1,3}");
What is the difference between these two expressions, why using paranthesis results in a redundant result for "No"?
Parenthesis in regex are capture groups; meaning what's in between the paren will be captured and stored as a capture group.
If you don't want a capture group but still need a group for the alternation, use a non-capture group instead; by putting ?: after the first paren:
Match m = Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");
Usually, if you don't want to change the regex for some reason, you can simply retrieve the group 0 from the match to get only the whole match (and thus ignore any capture groups); in your case, using m.Groups[0].Value.
Last, you can improve the efficiency of the regex by a notch using:
Match m = Regex.Match(alfaNumericWord, #"N(?:o|um)\d{1,3}");
i can't explain how they call it, but it is because putting parentheses around it is creating a new group. it is well explained here
Besides grouping part of a regular expression together, parentheses
also create a numbered capturing group. It stores the part of the
string matched by the part of the regular expression inside the
parentheses.
The regex Set(Value)? matches Set or SetValue. In the first case, the
first (and only) capturing group remains empty. In the second case,
the first capturing group matches Value.
It is because the parentheses are creating a group. You can remove the group with ?: like so
Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");

Simple regex doesn't work

I want to match the strings "F1" to "F12". I only need the number. I'm out of training - my first try:
var r = new Regex(#"^(?:[F])[\d]{1,2}$");
matches - but returns "F1" - but i expect to get "1"?
What have I done wrong?
Maybe you want to use lookbehind:
var r = new Regex(#"^(?<=F)\d\d?$");
Even though you are using a non-capturing group for the "F", the overall match for your Regex will return the entire string it matched. Groups are used to outline sub-expressions within your regular expression that you want be able to extract the value of. Non-capturing groups are used if you want to specify a sub-expression without having it be stored in a group. They allow you to apply quantifiers to your sub-expression, but do not allow you to extract their resulting value after running the regex against a string. They are typically used for performance gains, since capturing groups add extra overhead.
If you want to get just the number, you need to put the number portion in a capturing group and look at the Groups property of the resulting Match (assuming you are calling the r.Match function).
The updated Regex would be:
var r = new Regex(#"^(?:[F])([\d]{1,2})$");
Since our number is inside of the first set of parenthesis associated with a capturing group, it will be group 1. You could also name your group to avoid confusion or possible errors if the regex gets updated at a later date.
Alternately, you can just use look-behind as M42 has suggested.

Match optional pattern preceeded by random content

I need to capture two groups in the following sentence, one is I, the other is optional
I want to match random optional field.
I tried the following approach, but it's not yielding expected result:
(I).*?(optional)?
Removing the round patenthesis around optional can match correctly, but since I need the second match, I can't do so.
(I).*?optional?
So how can I match both groups correctly? thanks!
The trick with your regex is that you need to group (and discard) anything leading up to optional that doesn't match optional.
Use negative look-around (with a ?: prepended so that the group isn't used for capture):
(I)(?:(?!optional).)*(optional)?.*
You can try the following regex:
(I).*(optional)
the pair of brackets will return you the capturing groups.

Categories

Resources