Why does my reg ex not capture 2nd and subsequent lines? - c#

Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"

You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.

Related

How to match values ending with an optional string through Regex?

I am trying to extract a first name from a text snippet, which optionally has a last name in the same line as: <first_name>name<last_name>
E.g.:
Text: JohnnameSnow -> Result: John
Text: John -> Result: John
So I want to extract the <first_name> part from that line, but if there is no name<last_name> it should return the full line.
I have tried the following Regex:
([A-zÀ-ÿ-]{2,})(?=(?:name))
That works fine if there's actually a last name in the same line, but does not return me the full line when there is not. Unfortunately the solution doesn't seem to be as easy as adding |$.
Can I look for an optional end word and ignore it if it does not occur?
You can use
^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$
See the regex demo. Output:
Details
^ - start of string
(?<first>\p{L}+?) - Group "first": one or more letters, but as few as possible
(?:name(?<last>\p{L}+))? - an optional non-capturing group:
name - a substring
(?<last>\p{L}+) - Group "last": one or more letters
$ - end of string.
See the C# demo:
var strings = new List<string> { "JohnnameSnow", "John" };
foreach (var s in strings)
{
Console.WriteLine(s);
var m = Regex.Match(s, #"^(?<first>\p{L}+?)(?:name(?<last>\p{L}+))?$");
if (m.Success)
{
Console.WriteLine("First name: {0}, Last name = {1}", m.Groups["first"].Value, m.Groups["last"].Value);
}
else
{
Console.WriteLine("No match!");
}
}
Output:
JohnnameSnow
First name: John, Last name = Snow
John
First name: John, Last name =

Using regular expression to match words in the same column

I wonder if this is possible using a regular expression in C#:
I'd like to match the words "FOO" and "BAR" in a multi-line text, but only if those two words start in the same column on consecutive lines.
In other words, this should match, because both words start at the same column:
dha skj dh FOO dd fsdf sdf \n
xdsjk fh f BAR 98kf hkjdsf \n
This should also match, even though there's also a "BAR" at the wrong place:
dha sk jdh FOO dd fsd fs df \n
xd BAR fhf BAR 98 kfhk jdsf \n
This should not match, because the words start on different columns:
dhas kjdh FOO dd fsdfsd ddef \n
xdB2e ARfhf BAR 98kfh kj dsf \n
EDIT
I managed to get matches in case of equal prefixes for both words using a back reference like this:
var pattern = #"(?m)^(.*?)(FOO).*$\n^\1(BAR)" ;
var result = Regex.Match( "xxxFOOyyyy\nxxxBARzzz", pattern ) ;
But what I really want is to back-reference to the length of the first capturing group.
You may use
(?m)^(?<o>.)*?(FOO).*\n(?<-o>.)*?(BAR)(?(o)(?!))
See the regex demo
Details
(?m) - the inline version of the RegexOptions.Multiline modifier that makes ^ match the start of a line
^ - start of a line
(?<o>.)*? - any char but a newline (LF) that is pushed into Group o stack (incrementing it) upon each find
(FOO) - Group 1 that matches FOO
.* - the rest of the line
\n - a newline
(?<-o>.)*? - any char but a newline (LF) that is pushed off Group o stack (decrementing it) upon each find
(BAR) - Group 2: captures BAR substring
(?(o)(?!)) - a conditional construct that fails the match if Group o is not empty (that is, if the number of chars on the first line before FOO is different from the number of chars on the second line before BAR).

C# Replace and Remove text

I am having a little problem with how to replace and remove the text from the label.
label1.Text = Users online: 1 browsing: 1 pages
I am using gethtmldocument to receive the label1.Text to be like above. My problem is I want the text to show only Users Online: (number).
Now I am using label1.Text.Remove(17). So I will get Users online: 1 but the problem is when the users exceed the limit is 10 the text will count to 1 again not 10.
And I am trying to use label1.Text.replace("browsing: 1 pages",""). But when user is online the browsing: 1 pages will change to browsing: 2 pages or others.
So my question is how can I receive the text only Users online: ???
Thank you.
Try using regular expressions: match the groups and represent them in the desired way:
using System.Text.RegularExpressions;
...
string source = "Users online: 479 browsing: 153 pages";
// match.Groups["text"] - "Users online: "
// match.Groups["number"] - "479"
var match = Regex.Match(source, "^(?<text>.*?)(?<number>[0-9]+)");
// Users online: (479)
label1.Text = $"{match.Groups["text"].Value.Trim()} ({match.Groups["number"].Value})";
Edit: Regular expression's pattern ^(?<text>.*?)(?<number>[0-9]+) explanation:
^ - anchor: string's beginning
(?<text> ...) - group named "text" which contains
.*? - any characters, as few as possible
(?<number> ...) - group named "number" which contains
[0-9]+ - digits (char in [0..9] range); "+" - at least one
You could try to use substring. Something like this:
var x = //get the text
var textToDisplay = x.Substring(0, x.IndexOf("b");
Label1.Text = textToDisplay;

Regex C# Matching string from two words in exact order and returning capture of non-matched words

C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

Categories

Resources