Regular expression to match optional group - c#

I'm expecting a string that may or may not have '#' followed by some numbers. If there's any numbers after '#' I want to capture it as second group, otherwise just capture everything in first group. Following is example for C#
ABC#99999//match Group1: ABC and Group2: 99999
9ABC#8 //match Group1 9ABC and Group
9ABC //match Group1 9ABC
9ABC# //match Group1 9ABC#
The following Regex works, but for 3rd and 4th string, it captures into group 3 instead of group 1. is there a better way for the above scenario?
(?:(.+)#(\d+))|(.+)
Alternatively, I came up with following Regex, but the problem is since the first group doesn't have a fixed format (like length), it captures whole string from 1st and second string instead of capturing 2 groups
(.+)(?:#(\d+))?

To match a whole string and place the part before an optional # + digits into Group 1 and all those digits into Group 2, you may use
^(.+?)(?:#(\d+))?$
See the .NET regex demo. The \r? is added because it is a multiline input demo, you won't need it if you plan to test separate strings against the pattern.
Details
^ - start of string
(.+?) - Group 1: one or more chars as few as possible (due to +? lazy quantifier) (NOTE that if Group 1 value may be missing, use (.*?) instead)
(?:#(\d+))? - an optional non-capturing group matching 1 or 0 occurrences of
# - a # symbol
(\d+) - Group 2: one or more digits
$ - end of string.

Try
(\w+)(?:#(\d+))?
The # is in an optional non-capturing group, and the following digits in a captured group inside the non-capturing group.
https://regex101.com/r/obKPFw/1

An alternative solution, WITHOUT using regular expressions:
public class Program
{
static void Main(string[] args)
{
List<string> inputs = new List<string>
{
"ABC#99999",
"9ABC#8",
"9ABC",
"9ABC#"
};
var groups = new List<Group>();
foreach (string input in inputs)
{
string[] parts = input.Split("#", StringSplitOptions.RemoveEmptyEntries);
var group = new Group
{
Part1 = input
};
if (parts.Length == 2)
{
group.Part1 = parts[0];
group.Part2 = parts[1];
};
groups.Add(group);
Console.WriteLine($"Input: '{input}': {group}");
}
Console.ReadKey();
}
}
public class Group
{
public string Part1 { get; set; }
public string Part2 { get; set; }
/// <inheritdoc />
public override string ToString()
{
return $"Part1: {Part1 ?? "null"}, Part2: {Part2 ?? "[null]"}";
}
}
Output:
Input: 'ABC#99999': Part1: ABC, Part2: 99999
Input: '9ABC#8': Part1: 9ABC, Part2: 8
Input: '9ABC': Part1: 9ABC, Part2: [null]
Input: '9ABC#': Part1: 9ABC#, Part2: [null]

Related

How to match different scenarios with regex in c# and groups

I want to match these different scenarios with a regex pattern. Mainly delimiter is #:
1234-1111-234.011#333 => [id = 1234-1111-234.011 and code =333]
whatever text before 1234-1111-234.011#333 => [textb=whatever text before, id = 1234-1111-234.011 , code =333, texta="]
1234-1111-234.011#333 whatever text after => [ textb="" id = 1234-1111-234.011 ,code =333 , texta=whatever text after]
Text can be both at the beginning or the end
In every case code can contain also a postfix letter W like 1234-1111-234.011#333W => code=333E
textb = text with length maximum 15 characters. Only letters and
numbers.
id = 17 character long with this format XXXX-XXXX-XXX.XXX code - 3
or 4 character long based on W letter is presenting or not
texta = text with length maximum 15 characters. Only letters and
numbers.
I am trying to match these scenarios with this piece of code and groups
pattern ="
?(<textb>[\w\s]{15})#
?(<id>[\d\s]{17,17})#
(?<code>([A-Z]{0,1}\d{2,3}))
(?<wo>[W]{1})
?(<texta>[\w\s]{15})"
and
var textb = Regex.Match(mytext, pattern).Groups["textb"].Value;
var id= Regex.Match(mytext, pattern).Groups["id"].Value;
var code= Regex.Match(mytext, pattern).Groups["code"].Value;
var wo= Regex.Match(mytext, pattern).Groups["wo"].Value;
var texta= Regex.Match(mytext, pattern).Groups["texta"].Value;
A full example is "This is before text 234-1111-234.011#333E This is next text"
Not matching at all.
You could do it with one regular expression and then use Groups to get the parts you need.
void Main()
{
var input = "Before text 1234-1111-234.011#333E After text";
var pattern = #"(?<btext>[\w ]{0,15})(?<id>[\d\-\.]{17})#(?<code>[\d]{2,3})(?<wo>[A-Z]?)(?<atext>[\w ]{0,15})";
var matches = Regex.Match(input, pattern);
var btext = matches.Groups["btext"];
var wo = matches.Groups["wo"];
Console.WriteLine(btext.Value);
Console.WriteLine(wo.Value);
// etc.
}
(?<btext>[\w ]{0,15}) // Match letters, numbers and spaces, minimum 0 chars, maximum 15 chars
(?<id>[\d\-\.]{17}) // match numbers, '-' and '.'. Must be 17 chars
# // Match pound sign
(?<code>[\d]{2,3}) // Match numbers 2 or 3 chars long
(?<wo>[A-Z]?) // Match optional letter after code
(?<atext>[\w ]{0,15}) // Match letters, numbers and spaces, minimum 0 chars, maximum 15 chars

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

Need Regex expression to allow only either numbers or letters separated by comma and it should not allow alpha numeric

Need Regex expression to allow only either numbers or letters separated by comma and it should not allow alpha numeric combinations (like "abc123").
Some examples:
Valid:
123,abc
abc,123
123,123
abc,abc
Invalid:
abc,abc123
abc133,abc
abc123,abc123
Since valid and invalid are changed, I've rewritten my answer from scratch.
The suggested pattern is
^(([0-9]+)|([a-zA-Z]+))(,(([0-9]+)|([a-zA-Z]+)))*$
Demo:
string[] tests = new string[] {
"123,abc",
"abc,123",
"123,123",
"abc,abc",
"abc,abc123",
"abc133,abc",
"abc123,abc123",
// More tests
"123abc", // invalid (digits first, then letters)
"123", // valid (one item)
"a,b,c,1,2,3", // valid (more than two items)
"1e4", // invalid (floating point number)
"1,,2", // invalid (empty part)
"-3", // invalid (minus sign)
"۱۲۳", // invalid (Persian digits)
"число" // invalid (Russian letters)
};
string pattern = #"^(([0-9]+)|([a-zA-Z]+))(,(([0-9]+)|([a-zA-Z]+)))*$";
var report = string.Join(Environment.NewLine, tests
.Select(item => $"{item,-20} : {(Regex.IsMatch(item, pattern) ? "valid" : "invalid")}"));
Console.WriteLine(report);
Outcome:
123,abc : valid
abc,123 : valid
123,123 : valid
abc,abc : valid
abc,abc123 : invalid
abc133,abc : invalid
abc123,abc123 : invalid
123abc : invalid
123 : valid
a,b,c,1,2,3 : valid
1e4 : invalid
1,,2 : invalid
-3 : invalid
۱۲۳ : invalid
число : invalid
Pattern's explanation:
^ - string beginning (anchor)
([0-9]+)|([a-zA-Z]+) - either group of digits (1+) or group of letters
(,(([0-9]+)|([a-zA-Z]+))) - fllowed by zero or more such groups
$ - string ending (anchor)
If you specify Dmitry Bychenkos regex with RegexOptions.IgnoreCase you can shrink it down to Regex.IsMatch (test, #"^[0-9a-z](,[0-9a-z])*$",RegexOptions.IgnoreCase)
Alternate way to check it w/o regex (performs worse):
using System;
using System.Linq;
public class Program1
{
public static void Main()
{
var mydata = new[] {"1,3,4,5,1,3,a,s,r,3", "2, 4 , a", " 2,3,as"};
// function that checks it- perfoms not as good as reges as internal stringarray
// is build and analyzed
Func<string,bool> isValid =
data => data.Split(new[]{','}, StringSplitOptions.RemoveEmptyEntries)
.Select(s => s.Trim())
.All(aChar => aChar.Length == 1 && char.IsLetterOrDigit(aChar[0]));
foreach (var d in mydata)
{
Console.WriteLine(string.Format("{0} => is {1}",d, isValid(d) ? "Valid" : "Invalid"));
}
}
}
Output:
1,3,4,5,1,3,a,s,r,3 => is Valid
2, 4 , a => is Valid
2,3,as => is Invalid
To match words separated by commas, where the words consist either of digits or of letters:
^(\d+|[a-zA-Z]+)(,(\d+|[a-zA-Z]+))*$
Explanation
\d+ matches a string of at least one digit.
[a-zA-Z] matches a string of at least one upper or lower case letter.
(\d+|[a-zA-Z]+) matches either a string of digits or a string of letters.
C#
Regex regex = new Regex(#"^(\d+|[a-zA-Z]+)(,(\d+|[a-zA-Z]+))*$");

Regex match 2 out of 4 groups

I want a single Regex expression to match 2 groups of lowercase, uppercase, numbers or special characters. Length needs to also be grater than 7.
I currently have this expression
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$
It, however, forces the string to have lowercase and uppercase and digit or special character.
I currently have this implemented using 4 different regex expressions that I interrogate with some C# code.
I plan to reuse the same expression in JavaScript.
This is sample console app that shows the difference between 2 approaches.
class Program
{
private static readonly Regex[] Regexs = new[] {
new Regex("[a-z]", RegexOptions.Compiled), //Lowercase Letter
new Regex("[A-Z]", RegexOptions.Compiled), // Uppercase Letter
new Regex(#"\d", RegexOptions.Compiled), // Numeric
new Regex(#"[^a-zA-Z\d\s:]", RegexOptions.Compiled) // Non AlphaNumeric
};
static void Main(string[] args)
{
Regex expression = new Regex(#"^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$", RegexOptions.ECMAScript & RegexOptions.Compiled);
string[] testCases = new[] { "P#ssword", "Password", "P2ssword", "xpo123", "xpo123!", "xpo123!123##", "Myxpo123!123##", "Something_Really_Complex123!#43#2*333" };
Console.WriteLine("{0}\t{1}\t", "Single", "C# Hack");
Console.WriteLine("");
foreach (var testCase in testCases)
{
Console.WriteLine("{0}\t{2}\t : {1}", expression.IsMatch(testCase), testCase,
(testCase.Length >= 8 && Regexs.Count(x => x.IsMatch(testCase)) >= 2));
}
Console.ReadKey();
}
}
Result Proper Test String
------- ------- ------------
True True : P#ssword
False True : Password
True True : P2ssword
False False : xpo123
False False : xpo123!
False True : xpo123!123##
True True : Myxpo123!123##
True True : Something_Really_Complex123!#43#2*333
For javascript you can use this pattern that looks for boundaries between different character classes:
^(?=.*(?:.\b.|(?i)(?:[a-z]\d|\d[a-z])|[a-z][A-Z]|[A-Z][a-z]))[^:\s]{8,}$
if a boundary is found, you are sure to have two different classes.
pattern details:
\b # is a zero width assertion, it's a boundary between a member of
# the \w class and an other character that is not from this class.
.\b. # represents the two characters with the word boundary.
boundary between a letter and a number:
(?i) # make the subpattern case insensitive
(?:
[a-z]\d # a letter and a digit
| # OR
\d[a-z] # a digit and a letter
)
boundary between an uppercase and a lowercase letter:
[a-z][A-Z] | [A-Z][a-z]
since all alternations contains at least two characters from two different character classes, you are sure to obtain the result you hope.
You could use possessive quantifiers (emulated using atomic groups), something like this:
((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}
Since using possessive matching will prevent backtracking, you won't run into the two groups being two consecutive groups of lowercase letters, for instance. So the full regex would be something like:
^(?=.*((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}).{8,}$
Though, were it me, I'd cut the lookahead, just use the expression ((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}, and check the length separately.

What's the difference between "groups" and "captures" in .NET regular expressions?

I'm a little fuzzy on what the difference between a "group" and a "capture" are when it comes to .NET's regular expression language. Consider the following C# code:
MatchCollection matches = Regex.Matches("{Q}", #"^\{([A-Z])\}$");
I expect this to result in a single capture for the letter 'Q', but if I print the properties of the returned MatchCollection, I see:
matches.Count: 1
matches[0].Value: {Q}
matches[0].Captures.Count: 1
matches[0].Captures[0].Value: {Q}
matches[0].Groups.Count: 2
matches[0].Groups[0].Value: {Q}
matches[0].Groups[0].Captures.Count: 1
matches[0].Groups[0].Captures[0].Value: {Q}
matches[0].Groups[1].Value: Q
matches[0].Groups[1].Captures.Count: 1
matches[0].Groups[1].Captures[0].Value: Q
What exactly is going on here? I understand that there's also a capture for the entire match, but how do the groups come in? And why doesn't matches[0].Captures include the capture for the letter 'Q'?
You won't be the first who's fuzzy about it. Here's what the famous Jeffrey Friedl has to say about it (pages 437+):
Depending on your view, it either adds
an interesting new dimension to the
match results, or adds confusion and
bloat.
And further on:
The main difference between a Group
object and a Capture object is that
each Group object contains a
collection of Captures representing
all the intermediary matches by the
group during the match, as well as the
final text matched by the group.
And a few pages later, this is his conclusion:
After getting past the .NET
documentation and actually
understanding what these objects add,
I've got mixed feelings about them. On
one hand, it's an interesting
innovation [..] on the other hand, it
seems to add an efficiency burden [..]
of a functionality that won't be used
in the majority of cases
In other words: they are very similar, but occasionally and as it happens, you'll find a use for them. Before you grow another grey beard, you may even get fond of the Captures...
Since neither the above, nor what's said in the other post really seems to answer your question, consider the following. Think of Captures as a kind of history tracker. When the regex makes his match, it goes through the string from left to right (ignoring backtracking for a moment) and when it encounters a matching capturing parentheses, it will store that in $x (x being any digit), let's say $1.
Normal regex engines, when the capturing parentheses are to be repeated, will throw away the current $1 and will replace it with the new value. Not .NET, which will keep this history and places it in Captures[0].
If we change your regex to look as follows:
MatchCollection matches = Regex.Matches("{Q}{R}{S}", #"(\{[A-Z]\})+");
you will notice that the first Group will have one Captures (the first group always being the whole match, i.e., equal to $0) and the second group will hold {S}, i.e. only the last matching group. However, and here's the catch, if you want to find the other two catches, they're in Captures, which contains all intermediary captures for {Q} {R} and {S}.
If you ever wondered how you could get from the multiple-capture, which only shows last match to the individual captures that are clearly there in the string, you must use Captures.
A final word on your final question: the total match always has one total Capture, don't mix that with the individual Groups. Captures are only interesting inside groups.
This can be explained with a simple example (and pictures).
Matching 3:10pm with the regular expression ((\d)+):((\d)+)(am|pm), and using Mono interactive csharp:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Zip(Enumerable.Range(0, int.MaxValue), (g, n) => "[" + n + "] " + g);
{ "[0] 3:10pm", "[1] 3", "[2] 3", "[3] 10", "[4] 0", "[5] pm" }
So where's the 1?
Since there are multiple digits that match on the fourth group, we only "get at" the last match if we reference the group (with an implicit ToString(), that is). In order to expose the intermediate matches, we need to go deeper and reference the Captures property on the group in question:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Skip(4).First().Captures.Cast<Capture>().
> Zip(Enumerable.Range(0, int.MaxValue), (c, n) => "["+n+"] " + c);
{ "[0] 1", "[1] 0" }
Courtesy of this article.
A Group is what we have associated with groups in regular expressions
"(a[zx](b?))"
Applied to "axb" returns an array of 3 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
group 2: b, the second group matched.
except that these are only 'captured' groups. Non capturing groups (using the '(?: ' syntax are not represented here.
"(a[zx](?:b?))"
Applied to "axb" returns an array of 2 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
A Capture is also what we have associated with 'captured groups'. But when the group is applied with a quantifier multiple times, only the last match is kept as the group's match. The captures array stores all of these matches.
"(a[zx]\s+)+"
Applied to "ax az ax" returns an array of 2 captures of the second group.
group 1, capture 0 "ax "
group 1, capture 1 "az "
As for your last question -- I would have thought before looking into this that Captures would be an array of the captures ordered by the group they belong to. Rather it is just an alias to the groups[0].Captures. Pretty useless..
From the MSDN documentation:
The real utility of the Captures property occurs when a quantifier is applied to a capturing group so that the group captures multiple substrings in a single regular expression. In this case, the Group object contains information about the last captured substring, whereas the Captures property contains information about all the substrings captured by the group. In the following example, the regular expression \b(\w+\s*)+. matches an entire sentence that ends in a period. The group (\w+\s*)+ captures the individual words in the collection. Because the Group collection contains information only about the last captured substring, it captures the last word in the sentence, "sentence". However, each word captured by the group is available from the collection returned by the Captures property.
Imagine you have the following text input dogcatcatcat and a pattern like dog(cat(catcat))
In this case, you have 3 groups, the first one (major group) corresponds to the match.
Match == dogcatcatcat and Group0 == dogcatcatcat
Group1 == catcatcat
Group2 == catcat
So what it's all about?
Let's consider a little example written in C# (.NET) using Regex class.
int matchIndex = 0;
int groupIndex = 0;
int captureIndex = 0;
foreach (Match match in Regex.Matches(
"dogcatabcdefghidogcatkjlmnopqr", // input
#"(dog(cat(...)(...)(...)))") // pattern
)
{
Console.Out.WriteLine($"match{matchIndex++} = {match}");
foreach (Group #group in match.Groups)
{
Console.Out.WriteLine($"\tgroup{groupIndex++} = {#group}");
foreach (Capture capture in #group.Captures)
{
Console.Out.WriteLine($"\t\tcapture{captureIndex++} = {capture}");
}
captureIndex = 0;
}
groupIndex = 0;
Console.Out.WriteLine();
}
Output:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = abc
capture0 = abc
group4 = def
capture0 = def
group5 = ghi
capture0 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Let's analyze just the first match (match0).
As you can see there are three minor groups: group3, group4 and group5
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Those groups (3-5) were created because of the 'subpattern' (...)(...)(...) of the main pattern (dog(cat(...)(...)(...)))
Value of group3 corresponds to it's capture (capture0). (As in the case of group4 and group5). That's because there are no group repetition like (...){3}.
Ok, let's consider another example where there is a group repetition.
If we modify the regular expression pattern to be matched (for code shown above)
from (dog(cat(...)(...)(...))) to (dog(cat(...){3})),
you'll notice that there is the following group repetition: (...){3}.
Now the Output has changed:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = ghi
capture0 = abc
capture1 = def
capture2 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = pqr
capture0 = kjl
capture1 = mno
capture2 = pqr
Again, let's analyze just the first match (match0).
There are no more minor groups group4 and group5 because of (...){3} repetition ({n} wherein n>=2)
they've been merged into one single group group3.
In this case, the group3 value corresponds to it's capture2 (the last capture, in other words).
Thus if you need all the 3 inner captures (capture0, capture1, capture2) you'll have to cycle through the group's Captures collection.
Сonclusion is: pay attention to the way you design your pattern's groups.
You should think upfront what behavior causes group's specification, like (...)(...), (...){2} or (.{3}){2} etc.
Hopefully it will help shed some light on the differences between Captures, Groups and Matches as well.

Categories

Resources