How to match values ending with an optional string through Regex? - c#

I am trying to extract a first name from a text snippet, which optionally has a last name in the same line as: <first_name>name<last_name>
E.g.:
Text: JohnnameSnow -> Result: John
Text: John -> Result: John
So I want to extract the <first_name> part from that line, but if there is no name<last_name> it should return the full line.
I have tried the following Regex:
([A-zÀ-ÿ-]{2,})(?=(?:name))
That works fine if there's actually a last name in the same line, but does not return me the full line when there is not. Unfortunately the solution doesn't seem to be as easy as adding |$.
Can I look for an optional end word and ignore it if it does not occur?

You can use
^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$
See the regex demo. Output:
Details
^ - start of string
(?<first>\p{L}+?) - Group "first": one or more letters, but as few as possible
(?:name(?<last>\p{L}+))? - an optional non-capturing group:
name - a substring
(?<last>\p{L}+) - Group "last": one or more letters
$ - end of string.
See the C# demo:
var strings = new List<string> { "JohnnameSnow", "John" };
foreach (var s in strings)
{
Console.WriteLine(s);
var m = Regex.Match(s, #"^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$");
if (m.Success)
{
Console.WriteLine("First name: {0}, Last name = {1}", m.Groups["first"].Value, m.Groups["last"].Value);
}
else
{
Console.WriteLine("No match!");
}
}
Output:
JohnnameSnow
First name: John, Last name = Snow
John
First name: John, Last name =

Related

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

Is it possible to have overlapping regex matches?

Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.
I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
A single regular expression to return multiple matches on multiple calls? 
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want. 
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”. 
So use a while loop structure.
h saves the pattern space into the hold space. 
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result. 
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space. 
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest. 
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command. 
It says to go back to the beginning of the loop
if the most recent substitution succeeded. 
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it. 
And then change “A|C|D|E” to “A|D|E” and iterate. 
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it. 
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.) 
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.
You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.

Why does my reg ex not capture 2nd and subsequent lines?

Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.

Regular expression to match optional group

I'm expecting a string that may or may not have '#' followed by some numbers. If there's any numbers after '#' I want to capture it as second group, otherwise just capture everything in first group. Following is example for C#
ABC#99999//match Group1: ABC and Group2: 99999
9ABC#8 //match Group1 9ABC and Group
9ABC //match Group1 9ABC
9ABC# //match Group1 9ABC#
The following Regex works, but for 3rd and 4th string, it captures into group 3 instead of group 1. is there a better way for the above scenario?
(?:(.+)#(\d+))|(.+)
Alternatively, I came up with following Regex, but the problem is since the first group doesn't have a fixed format (like length), it captures whole string from 1st and second string instead of capturing 2 groups
(.+)(?:#(\d+))?
To match a whole string and place the part before an optional # + digits into Group 1 and all those digits into Group 2, you may use
^(.+?)(?:#(\d+))?$
See the .NET regex demo. The \r? is added because it is a multiline input demo, you won't need it if you plan to test separate strings against the pattern.
Details
^ - start of string
(.+?) - Group 1: one or more chars as few as possible (due to +? lazy quantifier) (NOTE that if Group 1 value may be missing, use (.*?) instead)
(?:#(\d+))? - an optional non-capturing group matching 1 or 0 occurrences of
# - a # symbol
(\d+) - Group 2: one or more digits
$ - end of string.
Try
(\w+)(?:#(\d+))?
The # is in an optional non-capturing group, and the following digits in a captured group inside the non-capturing group.
https://regex101.com/r/obKPFw/1
An alternative solution, WITHOUT using regular expressions:
public class Program
{
static void Main(string[] args)
{
List<string> inputs = new List<string>
{
"ABC#99999",
"9ABC#8",
"9ABC",
"9ABC#"
};
var groups = new List<Group>();
foreach (string input in inputs)
{
string[] parts = input.Split("#", StringSplitOptions.RemoveEmptyEntries);
var group = new Group
{
Part1 = input
};
if (parts.Length == 2)
{
group.Part1 = parts[0];
group.Part2 = parts[1];
};
groups.Add(group);
Console.WriteLine($"Input: '{input}': {group}");
}
Console.ReadKey();
}
}
public class Group
{
public string Part1 { get; set; }
public string Part2 { get; set; }
/// <inheritdoc />
public override string ToString()
{
return $"Part1: {Part1 ?? "null"}, Part2: {Part2 ?? "[null]"}";
}
}
Output:
Input: 'ABC#99999': Part1: ABC, Part2: 99999
Input: '9ABC#8': Part1: 9ABC, Part2: 8
Input: '9ABC': Part1: 9ABC, Part2: [null]
Input: '9ABC#': Part1: 9ABC#, Part2: [null]

Regex C# Matching string from two words in exact order and returning capture of non-matched words

C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:

Categories

Resources