How to capture groups - c#

In C# and NET regex engine, I have an input line like this and it is terminated by \n
1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR
I have to obtain
First capture
1. num=1
2. surname=ROSS
3. name=SVETA
4. name=JAMIE
5. title=MRS
6. other=T02XT
Second capture
1. num=2
2. surname=WHITE
3. name=VIKA
4. title=MS
Third capture
1. num=3
2. surname=GREEN
3. name=ANDY
4. title=MR
The first group has two names and there is no space within ANDY and MR in the third group. I am unable to solve this problem. I started using
(^\d|\s\d)
to detect the groups and it works, but after I do not know how to capture till the end of each group and split into subgroups the inside data.

If the title values are set to MR, MRS or MS, you may use
\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)
See the regex demo
Details
\b - word boundary
(?<num>\d) - Group "num": a digit (replace with \d+ if there can be more than 1)
(?<surname>\p{L}+) - Group "surname": 1+ letters
(?:/(?<name>\p{L}+?))+ - one or more sequences of / followed with Group "surname": 1+ letters, as few as possible
(?:\s*(?<title>M(?:RS?|S)))? - an optional sequence of
\s* - 0+ whitespaces
(?<title>M(?:RS?|S)) - Group "title": M followed with R and optional S or followed with S
\b - word boundary
\s* - 0+ whitespaces
(?<other>.*?) - Group "other": 0 or more chars, as few as possible
(?=\b\d\p{L}+/\p{L}|$) - up to the first occurrence of the initial pattern (word boundary, digit, 1+ letters, / and a letter) or end of string.
C# demo:
var text = "1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR";
var pattern = #"\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)";
var result = Regex.Matches(text, pattern);
foreach (Match m in result) {
Console.WriteLine("Num: {0}", m.Groups["num"].Value);
Console.WriteLine("Surname: {0}", m.Groups["surname"].Value);
Console.WriteLine("Names: {0}", string.Join(", ", m.Groups["name"].Captures.Cast<Capture>().Select(x => x.Value)));
Console.WriteLine("Title: {0}", m.Groups["title"].Value);
Console.WriteLine("Other: {0}", m.Groups["other"].Value);
Console.WriteLine("===== NEXT MATCH ======");
}
Output:
Num: 1
Surname: ROSS
Names: SVETA, JAMIE
Title: MRS
Other: T02XT
===== NEXT MATCH ======
Num: 2
Surname: WHITE
Names: VIKA
Title: MS
Other:
===== NEXT MATCH ======
Num: 3
Surname: GREEN
Names: ANDY
Title: MR
Other:
===== NEXT MATCH ======

Related

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

Regex C# Matching string from two words in exact order and returning capture of non-matched words

C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:

RegEx split string into words by space and containing chars

How can one perform this split with the Regex.Split(input, pattern) method?
This is a [normal string ] made up of # different types # of characters
Array of strings output:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
Also it should keep the leading spaces, so I want to preserve everything. A string contains 20 chars, array of strings should total 20 chars across all elements.
What I have tried:
Regex.Split(text, #"(?<=[ ]|# #)")
Regex.Split(text, #"(?<=[ ])(?<=# #")
I suggest matching, i.e. extracting words, not splitting:
string source = #"This is a [normal string ] made up of # different types # of characters";
// Three possibilities:
// - plain word [A-Za-z]+
// - # ... # quotation
// - [ ... ] quotation
string pattern = #"[A-Za-z]+|(#.*?#)|(\[.*?\])";
var words = Regex
.Matches(source, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, words
.Select((w, i) => $"{i + 1}. {w}")));
Outcome:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
You may use
var res = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
See the regex demo
The (\[[^][]*]|#[^#]*#) part is a capturing group whose value is output to the resulting list along with the split items.
Pattern details
(\[[^][]*]|#[^#]*#) - Group 1: either of the two patterns:
\[[^][]*] - [, followed with 0+ chars other than [ and ] and then ]
#[^#]*# - a #, then 0+ chars other than # and then #
| - or
\s+ - 1+ whitespaces
C# demo:
var s = "This is a [normal string ] made up of # different types # of characters";
var results = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
Console.WriteLine(string.Join("\n", results));
Result:
This
is
a
[normal string ]
made
up
of
# different types #
of
characters
It would be easier using matching approach however it can be done using negative lookeaheads :
[ ](?![^\]\[]*\])(?![^#]*\#([^#]*\#{2})*[^#]*$)
matches a space not followed by
any character sequence except [ or ] followed by ]
# followed by an even number of #

Regex with balancing groups

I need to write regex that capture generic arguments (that also can be generic) of type name in special notation like this:
System.Action[Int32,Dictionary[Int32,Int32],Int32]
lets assume type name is [\w.]+ and parameter is [\w.,\[\]]+
so I need to grab only Int32, Dictionary[Int32,Int32] and Int32
Basically I need to take something if balancing group stack is empty, but I don't really understand how.
UPD
The answer below helped me solve the problem fast (but without proper validation and with depth limitation = 1), but I've managed to do it with group balancing:
^[\w.]+ #Type name
\[(?<delim>) #Opening bracet and first delimiter
[\w.]+ #Minimal content
(
[\w.]+
((?(open)|(?<param-delim>)),(?(open)|(?<delim>)))* #Cutting param if balanced before comma and placing delimiter
((?<open>\[))* #Counting [
((?<-open>\]))* #Counting ]
)*
(?(open)|(?<param-delim>))\] #Cutting last param if balanced
(?(open)(?!) #Checking balance
)$
Demo
UPD2 (Last optimization)
^[\w.]+
\[(?<delim>)
[\w.]+
(?:
(?:(?(open)|(?<param-delim>)),(?(open)|(?<delim>))[\w.]+)?
(?:(?<open>\[)[\w.]+)?
(?:(?<-open>\]))*
)*
(?(open)|(?<param-delim>))\]
(?(open)(?!)
)$
I suggest capturing those values using
\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*
See the regex demo.
Details:
\w+(?:\.\w+)* - match 1+ word chars followed with . + 1+ word chars 1 or more times
\[ - a literal [
(?:,?(?<res>\w+(?:\[[^][]*])?))* - 0 or more sequences of:
,? - an optional comma
(?<res>\w+(?:\[[^][]*])?) - Group "res" capturing:
\w+ - one or more word chars (perhaps, you would like [\w.]+)
(?:\[[^][]*])? - 1 or 0 (change ? to * to match 1 or more) sequences of a [, 0+ chars other than [ and ], and a closing ].
A C# demo below:
var line = "System.Action[Int32,Dictionary[Int32,Int32],Int32]";
var pattern = #"\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups["res"].Captures.Cast<Capture>()
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);
UPDATE: To account for unknown depth [...] substrings, use
\w+(?:\.\w+)*\[(?:\s*,?\s*(?<res>\w+(?:\[(?>[^][]+|(?<o>\[)|(?<-o>]))*(?(o)(?!))])?))*
See the regex demo

How to get predecessor id from task predecessor string?

I'm working on MPXJ library. I want to get predecessors id from below string. It's complex for me. Please help me get all predecessors id. Thanks.
Task predecessor string:
Task Predecessors:[[Relation [Task id=12 uniqueID=145 name=Alibaba1] -> [Task id=10 uniqueID=143 name=Alibaba2]],
[Relation [Task id=12 uniqueID=145 name=Alibaba3] -> [Task id=11 uniqueID=144 name=Alibaba4]], [Relation [Task id=12 uniqueID=145 name=Alibaba5] -> [Task id=9 uniqueID=142 name=Alibaba6]]]
I need get the predecessors id: 10, 11, 9
Pattern:
[Task id=12 uniqueID=145 name=Alibaba1] -> [Task id=10 uniqueID=143 name=Alibaba2]]
To grab those ID's you need to look for the Task id after -> You can try the following using Matches method.
Regex rgx = new Regex(#"->\s*\[Task\s*id=(\d+)");
foreach (Match m in rgx.Matches(input))
Console.WriteLine(m.Groups[1].Value);
Working Demo
Explanation:
-> # '->'
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
\[ # '['
Task # 'Task'
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
id= # 'id='
( # group and capture to \1:
\d+ # digits (0-9) (1 or more times)
) # end of \1
No need for capture groups. See full C# online demo
My original answer used capture groups. But we don't need them.
You can use this regex:
(?<=-> \[Task id=)\d+
See the output of at the very bottom of this C# online demo:
10
11
9
The (?<=-> \[Task id=) lookbehind ensures that we are preceded by the section from the arrow to the equal sign
\d+ matches the id
This C# code adds all the codes to resultList:
var myRegex = new Regex(#"(?<=-> \[Task id=)\d+");
Match matchResult = myRegex.Match(s1);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
Console.WriteLine(matchResult.Value);
matchResult = matchResult.NextMatch();
}
Original Version with Capture Groups
To give you a second option, here is my original demo using a capture group.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Categories

Resources