I have large text in System.Windows.Forms.TextBox control in my form (winforms), vs 2008.
I want find a text, and select the line number where I've found that text.
Sample,
I have fat big text, and I find "ERROR en línea", and I want select the line number in textbox multiline.
string textoLogDeFuenteSQL = #"SQL*Plus: Release 10.1.0.4.2 - Production on Mar Jun 1 14:35:43 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
******** MORE TEXT ************
Conectado a:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Data Mining and Real Application Testing options
WHERE LAVECODIGO = 'CO_PREANUL'
ERROR en línea 2:
ORA-00904: ""LAVECODIGO"": identificador no v?lido
INSERT INTO COM_CODIGOS
ERROR en línea 1:
ORA-00001: restricción única (XACO.INX_COM_CODIGOS_PK) violada";
******** MORE TEXT ************
Any sample code about it ?
You might want to look at TextBoxBase.GetLineFromCharIndex method. This method retrieves the line number of character position within the textbox.
string str = textBox2.Text;
int index = textBox1.Text.IndexOf(str);
if (index !=-1)
{
int lineNo = textBox1.GetLineFromCharIndex(index);
}
"This method enables you to determine the line number based on the character index specified in the index parameter of the method. The first line of text in the control returns the value zero. The GetLineFromCharIndex method returns the physical line number where the indexed character is located within the control."
EDIT: This only finds the occurrences of the searched text. To compute the line numbers use Fredrik's answer.
using System.Text.RegularExpressions;
public static void FindErrorInText(string input)
{
Regex rgx = new Regex("ERROR en linea \d*", RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
}
Related
Update
I tried adding RegexOptions.Singleline to my regex options. It worked in that it captured the lines that weren't previously captured, but it put the entire text file into the first match instead of creating one match per date as desired.
End of Update
Update #2
Added new output showing matches and groups when using Poul Bak's modification. See screen shot below titled Output from Poul Bak's modification
End of Update #2
Final Update
Updating the target framework from 4.6.1 to 4.7.1 and tweaking Poul Bak's reg ex a little bit solved all problems. See Poul Bak's answer below
End of Final Update
Original Question: Background
I have the following text file test_text.txt:
2018-10-16 12:00:01 - Error 1<CR><LF>
Error 1 text line 1<CR><LF>
Error 1 text line 2<CR><LF>
2018-10-16 12:00:02 AM - Error 2<CR><LF>
Error 2 text line 1<CR><LF>
Error 2 text line 2<CR><LF>
Error 2 text line 3<CR><LF>
Error 2 text line 4<CR><LF>
2018-10-16 12:00:03 PM - Error 3
Objective
My objective is to have each match be comprised of 3 named groups: Date, Delim, and Text as shown below.
Note: apostrophes used only to denote limits of matched text.
Matches I expect to see:
Match 1: '2018-10-16 12:00:01 - Error 1<CR><LF>'
Date group = '2018-10-16 12:00:01'
Delim group = ' - '
Text group = 'Error 1<CR><LF>Error 1 text line 1<CR><LF>Error 1 text line 2<CR><LF>'
Match 2: '2018-10-16 12:00:02 AM - Error 2<CR><LF>'
Date group = '2018-10-16 12:00:02 AM'
Delim group = ' - '
Text group = 'Error 2 text line 1<CR><LF>Error 2 text line 2<CR><LF>Error 2 text line 3<CR><LF>Error 2 text line 4<CR><LF>'
Match 3: `2018-10-16 12:00:03 PM - Error 3`
Date group = '2018-10-16 12:00:03 PM'
Delim group = ' - '
Text group = 'Error 3'
The problem
My regex is not working in that 2nd and subsequent lines of text (e.g., 'Error 1 text line 1', 'Error 2 text line 1') are not being captured. I expect them to be captured because I'm using the Multiline option.
How do I modify my regex to capture 2nd and subsequent lines of text?
Current code
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp_RegEx
{
class Program
{
static void Main(string[] args)
{
string text = System.IO.File.ReadAllText(#"C:\Users\bill\Desktop\test_text.txt");
string pattern = #"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}.*)(?<Delim>\s-\s)(?<Text>.*\n|.*)";
RegexOptions regexOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
Regex rx = new Regex(pattern, regexOptions);
MatchCollection ms = rx.Matches(text);
// Find matches.
MatchCollection matches = rx.Matches(text);
Console.WriteLine("Input Text\n--------------------\n{0}\n--------------------\n", text);
// Report the number of matches found.
Console.WriteLine("Output ({0} matches found)\n--------------------\n", matches.Count);
int m = 1;
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine("Match #{0}: ", m++, match.Value);
int g = 1;
GroupCollection groups = match.Groups;
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", g++, group.Value);
}
Console.WriteLine();
}
Console.Read();
}
}
}
Current Output
Output from Poul Bak's modification (on the right track, but not quite there yet)
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
You can use the following regex, modified from yours:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>([\s\S](?!\d{4}))*)"
I have changed the 'Date' Group so it accepts 'AM' or 'PM' (otherwise it will only match the first).
Then I have changed the 'Text' Group, so it matches any number of any char (including Newlines) until it looks forward and finds a new date.
Edit:
I don't understand it, when you say 'AM' and 'PM' are not matched, they are part of the 'Date' Group. I assume you want them to be part of the 'Delim' Group, so I have moved the check to that Group.
I have also changed a Group to a non capturing Group.
The new regex:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2})(?<Delim>(?:\s\w\w)?\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
BTW: You should change your code for checking Groups, like this:
foreach (Group group in groups)
{
Console.WriteLine(" Group #{0} {1}", group.Name, group.Value);
}
Then you will see your named Groups by Name and Value. When you have named Groups, there's no need for accessing by index.
Edit 2:
About 'group.Name': I had mistakenly used 'Group' (capitalized), it should be: 'group.Name'.
This is what the regex look like now:
#"(?<Date>\d{4}-\d{2}-\d{2}\s{1}\d{2}:\d{2}:\d{2}(?:\s\w\w)?)(?<Delim>\s-\s)(?<Text>(?:[\s\S](?!\d{4}))*)"
I suggest you set the 'RegexOptions.ExplicitCapture' flag, then you only get named groups.
I want get all captures from group by 5 lines + empty line between each capture.
I was trying that way, but receive only first capture. When I deleting first capture from test string I receive next capture and so on, so seems my regexp is right to match.
What am I missed?
static void Main(string[] args)
{
var strBackups = #"wbadmin 1.0 - Backup command-line tool
(C) Copyright 2013 Microsoft Corporation. All rights reserved.
Backup time: 01.09.2015 11:51
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/01/2015-06:51
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {060e3b44-7b80-49bf-97c4-3f3b9908dec6}
Backup time: 06.09.2015 10:36
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/06/2015-05:36
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {64af3693-362d-42dc-ae5f-566b3f2d40be}
Backup time: 06.09.2015 11:00
Backup target: 1394/USB Disk labeled BIGGER2(F:)
Version identifier: 09/06/2015-06:00
Can recover: Volume(s), File(s), Application(s), Bare Metal Recovery, System State
Snapshot ID: {d9d50a01-6907-40a1-9c57-1f45de76b9ec}
";
var regBackups = new System.Text.RegularExpressions.Regex(".+\r\n.+\r\n\r\n(.+\r\n.+\r\n.+\r\n.+\r\n.+\r\n)+",
System.Text.RegularExpressions.RegexOptions.Compiled | System.Text.RegularExpressions.RegexOptions.Multiline
);
var match = regBackups.Match(strBackups);
if (match.Success)
{
for (var i = 1; i < match.Groups.Count; i++)
{
foreach (var c in match.Groups[i].Captures)
{
Console.WriteLine("=============================");
Console.WriteLine(c);
Console.WriteLine("=============================");
}
}
}
else
Console.WriteLine("<not matched>");
}
Sorry for broken format by multiline strings
Code looks like that without broke formatting:
Split
If you don't need to validate there are 5 consecutive lines, you could simply split by empty lines:
var regBackups = new System.Text.RegularExpressions.Regex("(?:\r\n){2}");
var result = regBackups.Split(strBackups);
foreach (var c in result)
{
Console.WriteLine("=============================");
Console.WriteLine(c);
Console.WriteLine("=============================");
}
This is by far the preferred option.
Example
Match
If you must validate that the text blocks have 5 consecutive lines, you can use the following approach:
var regBackups = new Regex(#"\r\n((?>\r\n.+){5})(?!\r\n.)",
RegexOptions.Compiled
);
foreach (Match m in regBackups.Matches(strBackups))
{
Console.WriteLine("=============================");
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine("=============================");
}
Example
The expression \r\n((?>\r\n.+){5})(?!\r\n.) matches:
\r\n a cr+lf, followed by
\r\n.+ a cr+lf and a line with at least 1 character.
The (?> ... ){5} is to repeat the previous 5 times. It's an atomic group (more efficient in this case) with the quantifier at the end.
(?!\r\n.) not followed by a cr+lf and a character (ie. not followed by another line).
Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));
So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.
I need help to parse some information from a mass of text, basically I am importing a PSD file and want to parse some data from it.
Amongst the text are strings such as this:
\r\nj78876 RANDOM TEXT STRINGS 75 £
Now what I want to do is grab all strings that fit this format (maybe the starting "\r\n" and ending "£" can be delimiters) and get the code at the start (j78876) and the price at the end (75). Note price may be more digits that 2.
I want to then grab the code such as j78876 and the price for each string like this which is found as they will occur many times (different codes and prices).
Can anyone suggest a way to do this?
I am not very proficient with Regex so guidance would be great.
thanks.
Note: Here is a snipped of the actual text (there is a lot more in the actual file).
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9449A HP V1810-8G
Switch 139,00\r\nJ9450A HP V1810-24G Switch 359,00\r\nEdge Switches - Managed \r\nHP Layer
2 Switches - Managed Stackables and Chassis\r\nHP Switch 2510 Series\r\nRéférence Ancienne
référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9019B HP E2510-24 Switch 359,00\r
\nJ9020A HP E2510-48 Switch 599,00\r\nJ9279A HP E2510-24G Switch 779,00\r\nJ9280A HP
E2510-48G Switch 1 569,00\r\nHP Switch 2520 Series\r\nRéférence Ancienne référence
3Com/H3C Libellé Remarque Prix en €\r\nJ9137A HP E2520-8-PoE Switch 489,00\r\nJ9138A HP
E2520-24-PoE Switch 779,00\r\nJ9298A HP E2520-8G-PoE Switch 749,00\r\nJ9299A HP E2520-
24G-PoE Switch 1 569,00\r\nHP Layer 2 and 3 Switches - Managed Stackables and Chassis\r
\nThe RBP is a recommended price only. \r\nHP Switch 2600 Series\r\nRéférence Ancienne
Update
I found this:
[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]
Worked for me in regex browser testers but will not work in my C# code
Regex reg = new Regex(#"[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]", RegexOptions.IgnoreCase);
Match matched = reg.Match(str);
if (matched.Success)
{
string code = matched.Groups[1].Value;
string currencyAmt = matched.Groups[2].Value;
}
Final Update:
In the browser testers i had to double escape the \r\n - in my code it was not necessary. Then to loop the groups I used the looping answer.
foreach (Match match in Regex.Matches(content, #"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
}
Regex reg = new Regex(#"\r\n([a-z]\d+\w)\s.*\s(\d+\,?\d+?)\r\n", RegexOptions.IgnoreCase);
string productCode, productCost;
foreach (Match match in reg.Matches(str))
{
productCode = match.Groups[1].Value;
productCost = match.Groups[2].Value;
//do something with values here
}
Edited because my original answer was wrong.
Based on your sample the above works.
Quick regex explanation of the first argument to new Regex(:
# : makes my string constant and keeps me from having to add extra escapes everywhere.
\r\n : starts with.
([a-z]\d+\w)\s : matches your product code, I used the \s to frame it as it appears to be a consistent whitespace.
.* : matches your random string of production description.
\s(\d+\,?\d+?) : matches a whitespace followed by your second capture of currency of some sort.
\r\n : ends with.
If you provided a larger sample data set, I could fine tune the regex.
Alright, your question is a moving target. The actual text sample has (in contradiction to your question) no £ in it. Here's an adapted expression:
new Regex(#"\r\n(\w+?).*?\s+(\d+?,\d\d)")
In prose (this is a learing site after all): Match "\r\n" followed by any alphanumerics until you hit whitespace, then anything until you hit whitespace followed by a number with two digits behind the comma. The parts in italics are captured.
As I said, I don't do Obj-C and thus can't test it. See these C# docs (and other answers here) for how to use it.
I would use named groups to identify the groups easier. The ?<code> part of the expression identifies the group.
You will want to use Matches, as you say there will be several occurrences of the pattern in your text. This will loop through them all..
foreach ( Match match in Regex.Matches(text, #"\r\n(?<code>\S+).*?(?<price>\d+)£") )
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
Console.WriteLine(code);
Console.WriteLine(currencyAmt);
}
Final result was this:
foreach (Match match in Regex.Matches(content, #"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
}
That sample data you added raises more questions than it answers. Are we supposed to treat those \r\n sequences as carriage-return+linefeed (CRLF), or as literal text? Also, it looks like space characters have been inserted at random positions--in some cases even between a \r and \n. Oh, and there are no pound symbols (£), only euro symbols (€), and they're never on the same line as a price, as you originally indicated.
If that sample really is representative of the your data, you should try to clean it up (or have the people who supplied to you clean it up) before you start searching it. I did just that so I could test my regex; if I've made any wrong assumptions, please let me know. And here it is:
Regex rgx = new Regex(#"^(\w+).*?(\d+,\d\d)(?:[\r\n]+|\z)", RegexOptions.Multiline);
string s = #"Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9449A HP V1810-8G Switch 139,00
J9450A HP V1810-24G Switch 359,00
Edge Switches - Managed
HP Layer 2 Switches - Managed Stackables and Chassis
HP Switch 2510 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9019B HP E2510-24 Switch 359,00
J9020A HP E2510-48 Switch 599,00
J9279A HP E2510-24G Switch 779,00
J9280A HP E2510-48G Switch 1 569,00
HP Switch 2520 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9137A HP E2520-8-PoE Switch 489,00
J9138A HP E2520-24-PoE Switch 779,00
J9298A HP E2520-8G-PoE Switch 749,00
J9299A HP E2520-24G-PoE Switch 1 569,00
HP Layer 2 and 3 Switches - Managed Stackables and Chassis
The RBP is a recommended price only.
HP Switch 2600 Series
Référence Ancienne";
foreach (Match m in rgx.Matches(s))
{
Console.WriteLine("code: {0}; price: {1}",
m.Groups[1].Value, m.Groups[2].Value);
}
output:
code: J9449A; price: 139,00
code: J9450A; price: 359,00
code: J9019B; price: 359,00
code: J9020A; price: 599,00
code: J9279A; price: 779,00
code: J9280A; price: 569,00
code: J9137A; price: 489,00
code: J9138A; price: 779,00
code: J9298A; price: 749,00
code: J9299A; price: 569,00
The ^ in multiline mode is sufficient to anchor the match at the beginning of a line; you don't have to match the line separator (\r\n) itself. You should be able to use $ at the end the same way, but that won't work because .NET doesn't regard \r as a line separator character. Instead I did it longhand: (?:[\r\n]+|\z)