RegEx: Add Quantifier to Capturing Group for all results - c#

I have the following strings that are valid...
" 1"
" 12"
" 123"
"1234"
" 123"
" 12A"
""
The following string are NOT valid...
" 1234"
" 1234"
"0 12"
"0012"
Currently I use the following regex match to check if the string is valid...
"(|[0-9A-Z\-]{4}| {1}[0-9A-Z\-]{3}| {2}[0-9A-Z\-]{2}| {3}[0-9A-Z\-]{1})"
Note: To be clear, the above regex will NOT meet my requirements, that's why I'm asking this question.
I was hoping there was a simpler match I could use, something like the following...
"(| {0,3}[0-9A-Z\-]{1,4})"
The only problem I have is that the above will also match this like " 1234" which is not acceptable. Is there a way for me to limit the capture group I have to only 4 characters?

If the match can not start with a zero, you could add a negative lookahead as Wiktor previously commented:
"(?="|.{4}")(?! *0)[0-9A-Z -]*"
Explanation
" Match literally
(?="|.{4}") If what is directly on the right is either " or 4 chars followed by "
(?! *0) If what is direcly on the right is not 0+ spaces followed by a zero
[0-9A-Z -]* Match 0+ times what is listed in the character class
" Match literally
Regex demo
If the spaces can only occur at the beginning you could use:
"(?="|.{4}")(?! *0) *[0-9A-Z-]+"
Regex demo

This would pass all your test cases:
"(|[1-9\s][0-9A-Z\s]{2}[0-9A-Z])"
Though I suspect there are cases you might not have mentioned.
Explanation: match either 0 or 4 characters between double quotes. First character may be a space or digit but not a zero. Next two characters are any digit or capital letter or space. Fourth character is a digit or capital but not a space.

To make it a bit more efficient:
"(?:[A-Z\d-]{4}|[ ](?:[A-Z\d-]{3}|[ ](?:[A-Z\d-]|[ ])[A-Z\d-]))"
https://regex101.com/r/1fr9tb/1
"
(?:
[A-Z\d-]{4}
| [ ]
(?:
[A-Z\d-]{3}
| [ ]
(?: [A-Z\d-] | [ ] )
[A-Z\d-]
)
)
"
Benchmarks
Regex1: "(?:[A-Z\d-]{4}|[ ](?:[A-Z\d-]{3}|[ ](?:[A-Z\d-]|[ ])[A-Z\d-]))"
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 7
Elapsed Time: 0.66 s, 663.84 ms, 663843 µs
Matches per sec: 527,233
Regex2: "(|[0-9A-Z\-]{4}|[ ]{1}[0-9A-Z\-]{3}|[ ]{2}[0-9A-Z\-]{2}|[ ]{3}[0-9A-Z\-]{1})"
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 7
Elapsed Time: 0.94 s, 938.44 ms, 938438 µs
Matches per sec: 372,960
Regex3: "(?="|.{4}")(?![ ]*0)[0-9A-Z -]*"
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 6
Elapsed Time: 0.73 s, 728.48 ms, 728484 µs
Matches per sec: 411,814
Regex4: "(|[1-9\s][0-9A-Z\s]{2}[0-9A-Z])"
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 6
Elapsed Time: 0.85 s, 851.48 ms, 851481 µs
Matches per sec: 352,327

Related

How to capture groups

In C# and NET regex engine, I have an input line like this and it is terminated by \n
1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR
I have to obtain
First capture
1. num=1
2. surname=ROSS
3. name=SVETA
4. name=JAMIE
5. title=MRS
6. other=T02XT
Second capture
1. num=2
2. surname=WHITE
3. name=VIKA
4. title=MS
Third capture
1. num=3
2. surname=GREEN
3. name=ANDY
4. title=MR
The first group has two names and there is no space within ANDY and MR in the third group. I am unable to solve this problem. I started using
(^\d|\s\d)
to detect the groups and it works, but after I do not know how to capture till the end of each group and split into subgroups the inside data.
If the title values are set to MR, MRS or MS, you may use
\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)
See the regex demo
Details
\b - word boundary
(?<num>\d) - Group "num": a digit (replace with \d+ if there can be more than 1)
(?<surname>\p{L}+) - Group "surname": 1+ letters
(?:/(?<name>\p{L}+?))+ - one or more sequences of / followed with Group "surname": 1+ letters, as few as possible
(?:\s*(?<title>M(?:RS?|S)))? - an optional sequence of
\s* - 0+ whitespaces
(?<title>M(?:RS?|S)) - Group "title": M followed with R and optional S or followed with S
\b - word boundary
\s* - 0+ whitespaces
(?<other>.*?) - Group "other": 0 or more chars, as few as possible
(?=\b\d\p{L}+/\p{L}|$) - up to the first occurrence of the initial pattern (word boundary, digit, 1+ letters, / and a letter) or end of string.
C# demo:
var text = "1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR";
var pattern = #"\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)";
var result = Regex.Matches(text, pattern);
foreach (Match m in result) {
Console.WriteLine("Num: {0}", m.Groups["num"].Value);
Console.WriteLine("Surname: {0}", m.Groups["surname"].Value);
Console.WriteLine("Names: {0}", string.Join(", ", m.Groups["name"].Captures.Cast<Capture>().Select(x => x.Value)));
Console.WriteLine("Title: {0}", m.Groups["title"].Value);
Console.WriteLine("Other: {0}", m.Groups["other"].Value);
Console.WriteLine("===== NEXT MATCH ======");
}
Output:
Num: 1
Surname: ROSS
Names: SVETA, JAMIE
Title: MRS
Other: T02XT
===== NEXT MATCH ======
Num: 2
Surname: WHITE
Names: VIKA
Title: MS
Other:
===== NEXT MATCH ======
Num: 3
Surname: GREEN
Names: ANDY
Title: MR
Other:
===== NEXT MATCH ======

RegEx to find numbers sequence in string separated by space with predefined maximum length

Sorry for the confusing title, I'll try to explain this with example. Currently we have this expression to find number sequence in a string
\b((\d[ ]{0,1}){13,19})\b
Now I'd like to modify it so it fulfills these rule
- The length should be between 13 to 19 characters, excluding the whitespaces
- Each number cluster must have minimum 3 digits
The expression should mark these as matched:
1234567890123
1234 5678 9012 345
Not match:
123456789012 3
123 12 123 1 23134
Current expression that I have will mark all of them as match.
Example
This is possible using look-around.
The regex can be changed to the following:
\b(?<!\d )(?=(?:\d ?){13,19}(?! ?\d))(?:\d{3,} ?)+\b(?! ?\d)
This works by looking ahead to make sure the number is between 13 and 19 digits long. It then matches groups of 3 or more digits. It then uses negative look ahead after its found all groups of 3 to make sure there aren't any numbers left. If there are, we've found a group smaller than 3. This works on the examples you've provided.
\b Makes sure that its the start of a "word".
(?<!\d ) Make sure there are no numbers behind.
(?=(?:\d ?){13,19}(?! ?\d)) Looks ahead to make sure the number is between 13 and 19 digits long
(?:\d ?){13,19} From original. ?: added to make non-capturing
(?! ?\d) Negative look ahead: if there is still digits left after getting 19 digits, too big therefore discard current match
(?:\d{3,} ?)+ Match any number of clusters bigger than 3 (min 13, max 19 handled by first look ahead)
\b(?! ?\d) Looks for the end of a cluster. If there are still numbers left after the end of the cluster, there must be a cluster that is too small.
Test here
I suggest the following solution also based on lookarounds:
\b\d(?!\d?\b)(?: ?\d(?!(?<= \d)\d?\b)){12,18}\b
See the regex demo
The main point is that we only match the next digit if it is not a part of a 1- or 2-digit group.
Pattern explanation
\b - starting word boundary
\d(?!\d?\b) - a digit that is not followed with 1 or 0 digits and then a trailing word boundary (that is, if it is 12 or 1 like group, it is failed)
(?: ?\d(?!(?<= \d)\d?\b)){12,18} - 12 to 18 occurrences of:
? - 1 or 0 spaces
\d(?!(?<= \d)\d?\b) - any single digit that is not followed with 1 or 0 digits followed with a word boundary (thanks to the (?!\d?\b)), and if that 1 or 0 digits are preceded with space + 1 digit ((?<= \d) lookbehind does that)
\b - a trailing word boundary.
NOTE that in case you want to match these strings in a non-numeric context (that means, if you do not want to allow any digits on the left and on the right) you might also consider adding (?<!\d *) at the front and (?! *\d) at the end of the pattern.
Note that to match any whitespace, you may replace a literal space with \s in the pattern.
If you can use Linq, this will be way easier to maintain:
var myList = new List<string>
{
"1234567890123",
"1234 5678 9012 345",
"123456789012 3",
"123 12 123 1 23134"
};
foreach(var input in myList)
{
var splitted = Regex.Split(input, #"\s+"); // Split on whitespace
var length = splitted.Sum(x => x.Length); // Compute the total length
var smallestGroupSize = splitted.Min(x => x.Length); // Compute the length of the smallest chunck
Console.WriteLine($"Total lenght: {length}, smallest group size: {smallestGroupSize}");
if (length < 13 || length > 19 || smallestGroupSize < 3)
{
Console.WriteLine($"Input '{input}' is incorrect{Environment.NewLine}");
continue;
}
Console.WriteLine($"Input '{input}' is correct!{Environment.NewLine}");
}
which produces:
Total lenght: 13, smallest group size: 13
Input '1234567890123' is correct!
Total lenght: 15, smallest group size: 3
Input '1234 5678 9012 345' is correct!
Total lenght: 13, smallest group size: 1
Input '123456789012 3' is incorrect
Total lenght: 14, smallest group size: 1
Input '123 12 123 1 23134' is incorrect

How to get predecessor id from task predecessor string?

I'm working on MPXJ library. I want to get predecessors id from below string. It's complex for me. Please help me get all predecessors id. Thanks.
Task predecessor string:
Task Predecessors:[[Relation [Task id=12 uniqueID=145 name=Alibaba1] -> [Task id=10 uniqueID=143 name=Alibaba2]],
[Relation [Task id=12 uniqueID=145 name=Alibaba3] -> [Task id=11 uniqueID=144 name=Alibaba4]], [Relation [Task id=12 uniqueID=145 name=Alibaba5] -> [Task id=9 uniqueID=142 name=Alibaba6]]]
I need get the predecessors id: 10, 11, 9
Pattern:
[Task id=12 uniqueID=145 name=Alibaba1] -> [Task id=10 uniqueID=143 name=Alibaba2]]
To grab those ID's you need to look for the Task id after -> You can try the following using Matches method.
Regex rgx = new Regex(#"->\s*\[Task\s*id=(\d+)");
foreach (Match m in rgx.Matches(input))
Console.WriteLine(m.Groups[1].Value);
Working Demo
Explanation:
-> # '->'
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
\[ # '['
Task # 'Task'
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
id= # 'id='
( # group and capture to \1:
\d+ # digits (0-9) (1 or more times)
) # end of \1
No need for capture groups. See full C# online demo
My original answer used capture groups. But we don't need them.
You can use this regex:
(?<=-> \[Task id=)\d+
See the output of at the very bottom of this C# online demo:
10
11
9
The (?<=-> \[Task id=) lookbehind ensures that we are preceded by the section from the arrow to the equal sign
\d+ matches the id
This C# code adds all the codes to resultList:
var myRegex = new Regex(#"(?<=-> \[Task id=)\d+");
Match matchResult = myRegex.Match(s1);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
Console.WriteLine(matchResult.Value);
matchResult = matchResult.NextMatch();
}
Original Version with Capture Groups
To give you a second option, here is my original demo using a capture group.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Percentage Regex with comma

I have this RegEx for C# ASP.NET MVC3 Model validation:
[RegularExpression(#"[0-9]*\,?[0-9]?[0-9]")]
This works for almost all cases, except if the number is bigger than 100.
Any number greater than 100 should show error.
I already tried use [Range], but it doesn't work with commas.
Valid: 0 / 0,0 / 0,00 - 100 / 100,0 / 100,00.
Invalid (Number > 100).
Not sure if zero's are only optional digits at the end but
# (?:100(?:,0{1,2})?|[0-9]{1,2}(?:,[0-9]{1,2})?)
(?:
100
(?: , 0{1,2} )?
|
[0-9]{1,2}
(?: , [0-9]{1,2} )?
)
Zero's only option at end
# (?:100|[0-9]{1,2})(?:,0{1,2})?
(?:
100
| [0-9]{1,2}
)
(?: , 0{1,2} )?
And, the permutations for no leading zero's except for zero itself
# (?:100(?:,0{1,2})?|(?:0|[1-9][0-9]?)(?:,[0-9]{1,2})?)
(?:
100
(?: , 0{1,2} )?
|
(?:
0
|
[1-9] [0-9]?
)
(?: , [0-9]{1,2} )?
)
# (?:100|0|[1-9][0-9])(?:,0{1,2})?
(?:
100
|
0
|
[1-9] [0-9]
)
(?: , 0{1,2} )?
Here's a RegEx that matches your criteria:
^(?:(?:[0-9]|[1-9]{1,2})(?:,[0-9]{1,2})?|(?:100)(?:,0{1,2})?)$
(Given your use case, I have assumed that your character sequence appears by itself and is not embedded within other content. Please let me know if that is not the case.)
And here's a Perl program that demonstrates that RegEx on a sample data set. (Also see live demo.)
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
# A1 => An integer between 1 and 99, without leading zeros.
# (Although zero can appear by itself.)
#
# A2 => A optional fractional component that may contain no more
# than two digits.
#
# -OR-
#
# B1 => The integer 100.
#
# B2 => A optional fractional component following that may
# consist of one or two zeros only.
#
if (/^(?:(?:[0-9]|[1-9]{1,2})(?:,[0-9]{1,2})?|(?:100)(?:,0{1,2})?)$/) {
# ^^^^^^^^A1^^^^^^ ^^^^^A2^^^^ ^B1 ^^^B2^^
print "* [$_]\n";
} else {
print " [$_]\n";
}
}
__DATA__
0
01
11
99
100
101
0,0
0,00
01,00
0,000
99,00
99,99
100,0
100,00
100,000
100,01
100,99
101,00
Expected Output
* [0]
[01]
* [11]
* [99]
* [100]
[101]
* [0,0]
* [0,00]
[01,00]
[0,000]
* [99,00]
* [99,99]
* [100,0]
* [100,00]
[100,000]
[100,01]
[100,99]
[101,00]

Match regex pattern for array of dollar denominations

A user can input the following:
I $1 $5 $10 $20 $50 $100
Ordering is not important, and I'm not worried if they enter a denomination more than once (i.e. I $1 $5 $5). The beginning of the input starts with a capital "I" followed by a space.
What I have so far is this but I'm not too familiar with regex and cannot make it match my desired pattern:
^I\s(\$1|\$[5]|\$10|\$20|\$50|\$[100])$
I want to validate that the input is valid.
regex = "^I(?:\s\$(?:10?0?|20|50?))+$"
^I says begins with 'I'
(?:\s\$ says group, but do not capture whitespace followed by a '$' followed by the next expression
(?:10?0?|20|50?) says group, but do not capture 1 followed by up to two 0's or 20 or 5 followed by up to one 0
+ says at least one match
$ says ends with the preceding
The idea is to expect after I either a space, or a $1, $5, etc...
string[] options =
{
"I $1 $5 $10 $20 $50 $100",
"I $1 $5 $5",
"I wrong",
"$1 $5 $5",
"I",
"I ",
};
var reg = new Regex(#"^I\s(\s|(\$1|\$5|\$10|\$20|\$50|\$100))*$");
foreach (var option in options)
{
var status = reg.Match(option).Success ? "valid" : "invalid";
Console.WriteLine("{0} -> is {1} (length: {2})", option.PadRight(25), status, option.Length);
}
prints:
I $1 $5 $10 $20 $50 $100 -> is valid (length: 24)
I $1 $5 $5 -> is valid (length: 10)
I wrong -> is invalid (length: 7)
$1 $5 $5 -> is invalid (length: 8)
I -> is invalid (length: 1)
I -> is valid (length: 2)
You use regex with check on digits
I[\s\$\d]*$
Howeever I would suggest u use String.Split(' ', input) and go from there

Categories

Resources