Regex match multiple Group.Captures with multiline

Regex match multiple Group.Captures with multiline - c#

I want get all captures from group by 5 lines + empty line between each capture.
I was trying that way, but receive only first capture. When I deleting first capture from test string I receive next capture and so on, so seems my regexp is right to match.
What am I missed?
static void Main(string[] args)
{
var strBackups = #"wbadmin 1.0 - Backup command-line tool
(C) Copyright 2013 Microsoft Corporation. All rights reserved.
Backup time: 01.09.2015 11:51
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/01/2015-06:51
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {060e3b44-7b80-49bf-97c4-3f3b9908dec6}
Backup time: 06.09.2015 10:36
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/06/2015-05:36
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {64af3693-362d-42dc-ae5f-566b3f2d40be}
Backup time: 06.09.2015 11:00
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/06/2015-06:00
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {d9d50a01-6907-40a1-9c57-1f45de76b9ec}
";
var regBackups = new System.Text.RegularExpressions.Regex(".+\r\n.+\r\n\r\n(.+\r\n.+\r\n.+\r\n.+\r\n.+\r\n)+",
System.Text.RegularExpressions.RegexOptions.Compiled | System.Text.RegularExpressions.RegexOptions.Multiline
);
var match = regBackups.Match(strBackups);
if (match.Success)
{
for (var i = 1; i < match.Groups.Count; i++)
{
foreach (var c in match.Groups[i].Captures)
{
Console.WriteLine("=============================");
Console.WriteLine(c);
Console.WriteLine("=============================");
}
}
}
else
Console.WriteLine("<not matched>");
}
Sorry for broken format by multiline strings
Code looks like that without broke formatting:

Split
If you don't need to validate there are 5 consecutive lines, you could simply split by empty lines:
var regBackups = new System.Text.RegularExpressions.Regex("(?:\r\n){2}");
var result = regBackups.Split(strBackups);
foreach (var c in result)
{
Console.WriteLine("=============================");
Console.WriteLine(c);
Console.WriteLine("=============================");
}
This is by far the preferred option.
Example
Match
If you must validate that the text blocks have 5 consecutive lines, you can use the following approach:
var regBackups = new Regex(#"\r\n((?>\r\n.+){5})(?!\r\n.)",
RegexOptions.Compiled
);
foreach (Match m in regBackups.Matches(strBackups))
{
Console.WriteLine("=============================");
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine("=============================");
}
Example
The expression \r\n((?>\r\n.+){5})(?!\r\n.) matches:
\r\n a cr+lf, followed by
\r\n.+ a cr+lf and a line with at least 1 character.
The (?> ... ){5} is to repeat the previous 5 times. It's an atomic group (more efficient in this case) with the quantifier at the end.
(?!\r\n.) not followed by a cr+lf and a character (ie. not followed by another line).

Related

Is it possible to have overlapping regex matches?

Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.

I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021

A single regular expression to return multiple matches on multiple calls? 
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want. 
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”. 
So use a while loop structure.
h saves the pattern space into the hold space. 
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result. 
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space. 
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest. 
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command. 
It says to go back to the beginning of the loop
if the most recent substitution succeeded. 
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it. 
And then change “A|C|D|E” to “A|D|E” and iterate. 
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it. 
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.) 
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.

You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021

It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.

Why does my reg ex not capture 2nd and subsequent lines?

Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"

You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.

Regex to do not match certain sequence

I have a text file as below:
1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..
so i want a regex to get 4 matches in this case for each point. My regex doesn't work as I wish. Please, advice:
private readonly Regex _reactionRegex = new Regex(#"(\d+)\.(\d+)\s*-\s*(.+)", RegexOptions.Compiled | RegexOptions.Singleline);
even this regex isn't very helpful:
(\d+)\.(\d+)\s*-\s*(.+)(?<!\d+\.\d+)

Alex, this regex will do it:
(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)
This is assuming that you want to capture the point, without the numbers, for instance: just Hello
If you want to also capture the digits, for instance 1.1 - Hello, you can use the same regex and display the entire match, not just Group 1. The online demo below will show you both.
How does it work?
The idea is to capture the text you want to Group 1 using (parentheses).
We match in multi-line mode m to allow the anchor ^ to work on each line.
We match in dotall mode s to allow the dot to eat up strings on multiple lines
We use a negative lookahead (?! to stop eating characters when what follows is the beginning of the line with your digit marker
Here is full working code and an online demo.
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main() {
string yourstring = #"1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..";
var resultList = new StringCollection();
try {
var yourRegex = new Regex(#"(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)");
Match matchResult = yourRegex.Match(yourstring);
while (matchResult.Success) {
resultList.Add(matchResult.Groups[1].Value);
Console.WriteLine("Whole Match: " + matchResult.Value);
Console.WriteLine("Group 1: " + matchResult.Groups[1].Value + "\n");
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

This may do for what you're looking for, though there is some ambiguity of the expected result.
(\d+)\.(\d+)\s*-\s*(.+?)(\n)(?>\d|$)
The ambiguity is for example what would you expect to match if data looked like:
1.1 - Hello
1.2 - world!
2.1 - Some
data here and it contains some
32 digits so i cannot use \D+
2.2 - Etc..
Not clear if 32 here starts a new record or not.

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).

Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.

Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}

The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

HOW TO SElect line number in TextBox Multiline

I have large text in System.Windows.Forms.TextBox control in my form (winforms), vs 2008.
I want find a text, and select the line number where I've found that text.
Sample,
I have fat big text, and I find "ERROR en línea", and I want select the line number in textbox multiline.
string textoLogDeFuenteSQL = #"SQL*Plus: Release 10.1.0.4.2 - Production on Mar Jun 1 14:35:43 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
******** MORE TEXT ************
Conectado a:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Data Mining and Real Application Testing options
WHERE LAVECODIGO = 'CO_PREANUL'
ERROR en línea 2:
ORA-00904: ""LAVECODIGO"": identificador no v?lido
INSERT INTO COM_CODIGOS
ERROR en línea 1:
ORA-00001: restricción única (XACO.INX_COM_CODIGOS_PK) violada";
******** MORE TEXT ************
Any sample code about it ?

You might want to look at TextBoxBase.GetLineFromCharIndex method. This method retrieves the line number of character position within the textbox.
string str = textBox2.Text;
int index = textBox1.Text.IndexOf(str);
if (index !=-1)
{
int lineNo = textBox1.GetLineFromCharIndex(index);
}
"This method enables you to determine the line number based on the character index specified in the index parameter of the method. The first line of text in the control returns the value zero. The GetLineFromCharIndex method returns the physical line number where the indexed character is located within the control."

EDIT: This only finds the occurrences of the searched text. To compute the line numbers use Fredrik's answer.
using System.Text.RegularExpressions;
public static void FindErrorInText(string input)
{
Regex rgx = new Regex("ERROR en linea \d*", RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex match multiple Group.Captures with multiline - c#

Related

Is it possible to have overlapping regex matches?

Why does my reg ex not capture 2nd and subsequent lines?

Regex to do not match certain sequence

Need multiple regular expression matches using C#

HOW TO SElect line number in TextBox Multiline

Categories

Resources