Why does this loop through Regex groups print the output twice? - c#

I have written this very straight forward regex code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace RegexTest1
{
class Program
{
static void Main(string[] args)
{
string a = "\"foobar123==\"";
Regex r = new Regex("^\"(.*)\"$");
Match m = r.Match(a);
if (m.Success)
{
foreach (Group g in m.Groups)
{
Console.WriteLine(g.Index);
Console.WriteLine(g.Value);
}
}
}
}
}
However the output is
0
"foobar123=="
1
foobar123==
I don't understand why does it print twice. why should there be a capture at index 0? when I say in my regex ^\" and I am not using capture for this.
Sorry if this is very basic but I don't write Regex on a daily basis.
According to me, this code should print only once and the index should be 1 and the value should be foobar==

This happens because group zero is special: it returns the entire match.
From the Regex documentation (emphasis added):
A simple regular expression pattern illustrates how numbered (unnamed) and named groups can be referenced either programmatically or by using regular expression language syntax. The regular expression ((?<One>abc)\d+)?(?<Two>xyz)(.*) produces the following capturing groups by number and by name. The first capturing group (number 0) always refers to the entire pattern.
# Name Group
- ---------------- --------------------------------
0 0 (default name) ((?<One>abc)\d+)?(?<Two>xyz)(.*)
1 1 (default name) ((?<One>abc)\d+)
2 2 (default name) (.*)
3 One (?<One>abc)
4 Two (?<Two>xyz)
If you do not want to see it, start the output from the first group.

A regex captures several groups at once. Group 0 is the entire matched region (including the accents). Group 1 is the group defined by the brackets.
Say your regex has the following form:
A(B(C)D)E.
With A, B, C, D end E regex expressions.
Then the following groups will be matched:
0 A(B(C)D)E
1 B(C)D
2 C
The i-th group starts at the i-th open bracket. And you can say the "zero-th" open bracket is implicitly placed at the begin of the regex (and ends at the end of the regex).
If you want to omit group 0, you can use the Skip method of the LINQ framework:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace RegexTest1 {
class Program {
static void Main(string[] args) {
string a = "\"foobar123==\"";
Regex r = new Regex("^\"(.*)\"$");
Match m = r.Match(a);
if (m.Success) {
foreach (Group g in m.Groups.Skip(1)) {//Skipping the first (thus group 0)
Console.WriteLine(g.Index);
Console.WriteLine(g.Value);
}
}
}
}
}

0
"foobar123==" -- Matched string.
Entire match by a pattern would be found at index 0.
1
foobar123== -- Captured string.
group index 1 contains the characters which are captured by the first capturing group.

Using #dasblinkenlight regex as an example...
This is not the whole story with Dot-Net capture group counting.
As named groups are added, the default is count them and count them last.
These can optionally be changed.
Of course group 0 always contain the entire match. Group counting really starts at 1
because you can't specify a back reference (in the regex) to group 0, it conflicts
with the binary construct \0000.
Here is counting with named/normal groups in the Dot-Net the default state.
( # (1 start)
(?<One> abc ) #_(3)
\d+
)? # (1 end)
(?<Two> xyz ) #_(4)
( .* ) # (2)
Here it is with names last turned OFF.
( # (1 start)
(?<One> abc ) # (2)
\d+
)? # (1 end)
(?<Two> xyz ) # (3)
( .* ) # (4)
Here it is with named counting turned OFF.
( # (1 start)
(?<One> abc )
\d+
)? # (1 end)
(?<Two> xyz )
( .* ) # (2)

You can return only one by removing the group 1 using ?:
Regex r = new Regex("^\"(?:.*)\"$");
Online Demo
Every time you use () you are creating groups and you can reference them later using back references $1,$2,$3 of course in the case of your expression simpler will be:
Regex r = new Regex("^\".*\"$");
Which is not using parenthesis at all

Related

Seperating numbers, punctuation and letters trought the whole case

What I'm trying to achieve:
Split the string into separate parts of numbers, punctuation(except the . and , these should be found in the number part), and letters.
Example:
Case 1:
C_TANTALB
Result:
Alpha[]: C, TANTALB
Beta[]:
Zeta[]: _
Case 2:
BGA-100_T0.8
Result:
Alpha[]: BGA, T
Beta[]: 100, 0.8
Zeta[]: -, _
Case 3: C0201
Result:
Alpha[]: C
Beta[]: 0201
Zeta[]:
I've found this post but it doesn't the entire job for me as it fails on example 1 not returning even the alpha part. And it doesn't find the punctuation.
Any help would be appricated.
If iterating the string an test with IsDigit and IsLetter a bit to complexe,
You can use Regex for this : (?<Alfas>[a-zA-Z]+)|(?<Digits>\d+)|(?<Others>[^a-zA-Z\d])
1/. Named Capture Group Alfas (?<Alfas>[a-zA-Z]+)
Match a single character present in the list below [a-zA-Z]+
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2/. Named Capture Group Digits (?<Digits>[\d,.]+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
3/. Named Capture Group Others (?<Others>[^a-zA-Z\d]+)
Match a single character not present in the list below [^a-zA-Z\d]
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Then to get one goup values:
var matches = Regex.Matches(testInput, pattern).Cast<Match>();
var alfas = matches.Where(x => !string.IsNullOrEmpty(x.Groups["Alfas"].Value))
.Select(x=> x.Value)
.ToList();
LiveDemo
Probably the simplest way to do this is with 3 separate regular expressions; one for each class of characters.
[A-Za-z]+ for letter sequences
[\d.,]+ for numbers
[-_]+ for punctuation (incomplete for now; please feel free to extend the list)
Example:
using System;
using System.Linq;
using System.Text.RegularExpressions;
class MainClass
{
private static readonly Regex _regexAlpha = new Regex(#"[A-Za-z]+");
private static readonly Regex _regexBeta = new Regex(#"[\d.,]+");
private static readonly Regex _regexZeta = new Regex(#"[-_]+");
public static void Main (string[] args)
{
Console.Write("Input: ");
string input = Console.ReadLine();
var resultAlpha = _regexAlpha.Matches(input).Select(m => m.Value);
var resultBeta = _regexBeta.Matches(input).Select(m => m.Value);
var resultZeta = _regexZeta.Matches(input).Select(m => m.Value);
Console.WriteLine($"Alpha: {string.Join(", ", resultAlpha)}");
Console.WriteLine($"Beta: {string.Join(", ", resultBeta)}");
Console.WriteLine($"Zeta: {string.Join(", ", resultZeta)}");
}
}
Sample output:
Input: ABC_3.14m--nop
Alpha: ABC, m, nop
Beta: 3.14
Zeta: _, --
Live demo: https://repl.it/repls/LopsidedUsefulBucket

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

Reg ex start with fixed string and end with variable number

I need to match these strings that start with fixed string These masters were created using and end with variable [space][name][space][char][number]
These masters were created using Kevin O014
These masters were created using Jhon A039
These masters were created using Geeth P034
These masters were created using Jemes M077
These masters were created using Anne H058
These masters were created using JANE S345
Any idea?
I tried this ^(These masters were created using).\s.[a-zA-Z].\s.[a-zA-Z].[0-9]{3}.$. it gooks greek to me
I would use this:
These masters were created using [a-zA-Z]+ [a-zA-Z]\d+
See demo
[a-zA-Z]+ To match a name (assuming, simple names, no -, no accentuated char)
[a-zA-Z]\d+ to match a letter followed by any digit. You might change to [a-zA-Z]\d{3} if you need exactly 3 digits
string input = #" </span>These masters were created using Smith J054<br>";
Regex regex = new Regex(#"These masters were created using [a-zA-Z]+ ([a-zA-Z]\d+)");
foreach (Match match in regex.Matches(input))
{
Console.Out.WriteLine("Found a match : " + match);
if(match.Groups.Count >= 2)
Console.Out.WriteLine("Extract " + match.Groups[1].Value);
}
Output:
Found a match : These masters were created using Smith J054
Extract J054
^These masters were created using [[a-zA-z\S]* [[A-Za-z0-9]*$
Made sure it matches using multiline on an online calculator.
To not only validate, but also to get the names and codes at the end, you can use
\bThese masters were created using (?<Name>[A-Za-z]+)\s+(?<Code>[A-Za-z][0-9]{3})\b
See the regex demo
where (?<Name>[A-Za-z]+) matches any 1+ ASCII letters and captures it into Group with name "Name", \s+ matches one or more whitespaces, (?<Code>[A-Za-z][0-9]{3}) matches and captures into Group "Code" a letter followed with 3 digits exactly (use + instead of {3} to match 1 or more digits).
Note that \b are word boundaries that help match the strings inside larger strings as whole words.
In C#:
var pattern = #"\bThese masters were created using (?<Name>\p{L}+)\s+(?<Code>\p{L}\d{3})\b";
Sample C# demo:
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var s = "aaaaaa These masters were created using Kevin O014\nThese masters were created using Jhon A039\nThese masters were created using Geeth P034\nThese masters were created using Jemes M077\nThese masters were created using Anne H058\nThese masters were created using JANE S345 aaaaa";
var matches = Regex.Matches(s, #"\bThese masters were created using (?<Name>[A-Za-z]+)\s+(?<Code>[A-Za-z][0-9]{3})\b")
.Cast<Match>()
.Select(m=> string.Format("{0}: {1}", m.Groups["Name"].Value, m.Groups["Code"].Value))
.ToList();
Console.WriteLine(string.Join("\n", matches));
}
}

Regex to do not match certain sequence

I have a text file as below:
1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..
so i want a regex to get 4 matches in this case for each point. My regex doesn't work as I wish. Please, advice:
private readonly Regex _reactionRegex = new Regex(#"(\d+)\.(\d+)\s*-\s*(.+)", RegexOptions.Compiled | RegexOptions.Singleline);
even this regex isn't very helpful:
(\d+)\.(\d+)\s*-\s*(.+)(?<!\d+\.\d+)
Alex, this regex will do it:
(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)
This is assuming that you want to capture the point, without the numbers, for instance: just Hello
If you want to also capture the digits, for instance 1.1 - Hello, you can use the same regex and display the entire match, not just Group 1. The online demo below will show you both.
How does it work?
The idea is to capture the text you want to Group 1 using (parentheses).
We match in multi-line mode m to allow the anchor ^ to work on each line.
We match in dotall mode s to allow the dot to eat up strings on multiple lines
We use a negative lookahead (?! to stop eating characters when what follows is the beginning of the line with your digit marker
Here is full working code and an online demo.
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main() {
string yourstring = #"1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..";
var resultList = new StringCollection();
try {
var yourRegex = new Regex(#"(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)");
Match matchResult = yourRegex.Match(yourstring);
while (matchResult.Success) {
resultList.Add(matchResult.Groups[1].Value);
Console.WriteLine("Whole Match: " + matchResult.Value);
Console.WriteLine("Group 1: " + matchResult.Groups[1].Value + "\n");
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
This may do for what you're looking for, though there is some ambiguity of the expected result.
(\d+)\.(\d+)\s*-\s*(.+?)(\n)(?>\d|$)
The ambiguity is for example what would you expect to match if data looked like:
1.1 - Hello
1.2 - world!
2.1 - Some
data here and it contains some
32 digits so i cannot use \D+
2.2 - Etc..
Not clear if 32 here starts a new record or not.

Regex match 2 out of 4 groups

I want a single Regex expression to match 2 groups of lowercase, uppercase, numbers or special characters. Length needs to also be grater than 7.
I currently have this expression
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$
It, however, forces the string to have lowercase and uppercase and digit or special character.
I currently have this implemented using 4 different regex expressions that I interrogate with some C# code.
I plan to reuse the same expression in JavaScript.
This is sample console app that shows the difference between 2 approaches.
class Program
{
private static readonly Regex[] Regexs = new[] {
new Regex("[a-z]", RegexOptions.Compiled), //Lowercase Letter
new Regex("[A-Z]", RegexOptions.Compiled), // Uppercase Letter
new Regex(#"\d", RegexOptions.Compiled), // Numeric
new Regex(#"[^a-zA-Z\d\s:]", RegexOptions.Compiled) // Non AlphaNumeric
};
static void Main(string[] args)
{
Regex expression = new Regex(#"^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$", RegexOptions.ECMAScript & RegexOptions.Compiled);
string[] testCases = new[] { "P#ssword", "Password", "P2ssword", "xpo123", "xpo123!", "xpo123!123##", "Myxpo123!123##", "Something_Really_Complex123!#43#2*333" };
Console.WriteLine("{0}\t{1}\t", "Single", "C# Hack");
Console.WriteLine("");
foreach (var testCase in testCases)
{
Console.WriteLine("{0}\t{2}\t : {1}", expression.IsMatch(testCase), testCase,
(testCase.Length >= 8 && Regexs.Count(x => x.IsMatch(testCase)) >= 2));
}
Console.ReadKey();
}
}
Result Proper Test String
------- ------- ------------
True True : P#ssword
False True : Password
True True : P2ssword
False False : xpo123
False False : xpo123!
False True : xpo123!123##
True True : Myxpo123!123##
True True : Something_Really_Complex123!#43#2*333
For javascript you can use this pattern that looks for boundaries between different character classes:
^(?=.*(?:.\b.|(?i)(?:[a-z]\d|\d[a-z])|[a-z][A-Z]|[A-Z][a-z]))[^:\s]{8,}$
if a boundary is found, you are sure to have two different classes.
pattern details:
\b # is a zero width assertion, it's a boundary between a member of
# the \w class and an other character that is not from this class.
.\b. # represents the two characters with the word boundary.
boundary between a letter and a number:
(?i) # make the subpattern case insensitive
(?:
[a-z]\d # a letter and a digit
| # OR
\d[a-z] # a digit and a letter
)
boundary between an uppercase and a lowercase letter:
[a-z][A-Z] | [A-Z][a-z]
since all alternations contains at least two characters from two different character classes, you are sure to obtain the result you hope.
You could use possessive quantifiers (emulated using atomic groups), something like this:
((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}
Since using possessive matching will prevent backtracking, you won't run into the two groups being two consecutive groups of lowercase letters, for instance. So the full regex would be something like:
^(?=.*((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}).{8,}$
Though, were it me, I'd cut the lookahead, just use the expression ((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}, and check the length separately.

Categories

Resources