Difficult (for me) string parsing in C# (regex?) - c#

I need help to parse some information from a mass of text, basically I am importing a PSD file and want to parse some data from it.
Amongst the text are strings such as this:
\r\nj78876 RANDOM TEXT STRINGS 75 £
Now what I want to do is grab all strings that fit this format (maybe the starting "\r\n" and ending "£" can be delimiters) and get the code at the start (j78876) and the price at the end (75). Note price may be more digits that 2.
I want to then grab the code such as j78876 and the price for each string like this which is found as they will occur many times (different codes and prices).
Can anyone suggest a way to do this?
I am not very proficient with Regex so guidance would be great.
thanks.
Note: Here is a snipped of the actual text (there is a lot more in the actual file).
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9449A HP V1810-8G
Switch 139,00\r\nJ9450A HP V1810-24G Switch 359,00\r\nEdge Switches - Managed \r\nHP Layer
2 Switches - Managed Stackables and Chassis\r\nHP Switch 2510 Series\r\nRéférence Ancienne
référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9019B HP E2510-24 Switch 359,00\r
\nJ9020A HP E2510-48 Switch 599,00\r\nJ9279A HP E2510-24G Switch 779,00\r\nJ9280A HP
E2510-48G Switch 1 569,00\r\nHP Switch 2520 Series\r\nRéférence Ancienne référence
3Com/H3C Libellé Remarque Prix en €\r\nJ9137A HP E2520-8-PoE Switch 489,00\r\nJ9138A HP
E2520-24-PoE Switch 779,00\r\nJ9298A HP E2520-8G-PoE Switch 749,00\r\nJ9299A HP E2520-
24G-PoE Switch 1 569,00\r\nHP Layer 2 and 3 Switches - Managed Stackables and Chassis\r
\nThe RBP is a recommended price only. \r\nHP Switch 2600 Series\r\nRéférence Ancienne
Update
I found this:
[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]
Worked for me in regex browser testers but will not work in my C# code
Regex reg = new Regex(#"[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]", RegexOptions.IgnoreCase);
Match matched = reg.Match(str);
if (matched.Success)
{
string code = matched.Groups[1].Value;
string currencyAmt = matched.Groups[2].Value;
}
Final Update:
In the browser testers i had to double escape the \r\n - in my code it was not necessary. Then to loop the groups I used the looping answer.
foreach (Match match in Regex.Matches(content, #"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
}

Regex reg = new Regex(#"\r\n([a-z]\d+\w)\s.*\s(\d+\,?\d+?)\r\n", RegexOptions.IgnoreCase);
string productCode, productCost;
foreach (Match match in reg.Matches(str))
{
productCode = match.Groups[1].Value;
productCost = match.Groups[2].Value;
//do something with values here
}
Edited because my original answer was wrong.
Based on your sample the above works.
Quick regex explanation of the first argument to new Regex(:
# : makes my string constant and keeps me from having to add extra escapes everywhere.
\r\n : starts with.
([a-z]\d+\w)\s : matches your product code, I used the \s to frame it as it appears to be a consistent whitespace.
.* : matches your random string of production description.
\s(\d+\,?\d+?) : matches a whitespace followed by your second capture of currency of some sort.
\r\n : ends with.
If you provided a larger sample data set, I could fine tune the regex.

Alright, your question is a moving target. The actual text sample has (in contradiction to your question) no £ in it. Here's an adapted expression:
new Regex(#"\r\n(\w+?).*?\s+(\d+?,\d\d)")
In prose (this is a learing site after all): Match "\r\n" followed by any alphanumerics until you hit whitespace, then anything until you hit whitespace followed by a number with two digits behind the comma. The parts in italics are captured.
As I said, I don't do Obj-C and thus can't test it. See these C# docs (and other answers here) for how to use it.

I would use named groups to identify the groups easier. The ?<code> part of the expression identifies the group.
You will want to use Matches, as you say there will be several occurrences of the pattern in your text. This will loop through them all..
foreach ( Match match in Regex.Matches(text, #"\r\n(?<code>\S+).*?(?<price>\d+)£") )
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
Console.WriteLine(code);
Console.WriteLine(currencyAmt);
}

Final result was this:
foreach (Match match in Regex.Matches(content, #"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
string code = match.Groups["code"].Value;
string currencyAmt = match.Groups["price"].Value;
}

That sample data you added raises more questions than it answers. Are we supposed to treat those \r\n sequences as carriage-return+linefeed (CRLF), or as literal text? Also, it looks like space characters have been inserted at random positions--in some cases even between a \r and \n. Oh, and there are no pound symbols (£), only euro symbols (€), and they're never on the same line as a price, as you originally indicated.
If that sample really is representative of the your data, you should try to clean it up (or have the people who supplied to you clean it up) before you start searching it. I did just that so I could test my regex; if I've made any wrong assumptions, please let me know. And here it is:
Regex rgx = new Regex(#"^(\w+).*?(\d+,\d\d)(?:[\r\n]+|\z)", RegexOptions.Multiline);
string s = #"Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9449A HP V1810-8G Switch 139,00
J9450A HP V1810-24G Switch 359,00
Edge Switches - Managed
HP Layer 2 Switches - Managed Stackables and Chassis
HP Switch 2510 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9019B HP E2510-24 Switch 359,00
J9020A HP E2510-48 Switch 599,00
J9279A HP E2510-24G Switch 779,00
J9280A HP E2510-48G Switch 1 569,00
HP Switch 2520 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9137A HP E2520-8-PoE Switch 489,00
J9138A HP E2520-24-PoE Switch 779,00
J9298A HP E2520-8G-PoE Switch 749,00
J9299A HP E2520-24G-PoE Switch 1 569,00
HP Layer 2 and 3 Switches - Managed Stackables and Chassis
The RBP is a recommended price only.
HP Switch 2600 Series
Référence Ancienne";
foreach (Match m in rgx.Matches(s))
{
Console.WriteLine("code: {0}; price: {1}",
m.Groups[1].Value, m.Groups[2].Value);
}
output:
code: J9449A; price: 139,00
code: J9450A; price: 359,00
code: J9019B; price: 359,00
code: J9020A; price: 599,00
code: J9279A; price: 779,00
code: J9280A; price: 569,00
code: J9137A; price: 489,00
code: J9138A; price: 779,00
code: J9298A; price: 749,00
code: J9299A; price: 569,00
The ^ in multiline mode is sufficient to anchor the match at the beginning of a line; you don't have to match the line separator (\r\n) itself. You should be able to use $ at the end the same way, but that won't work because .NET doesn't regard \r as a line separator character. Instead I did it longhand: (?:[\r\n]+|\z)

Related

Extracting dollar prices and numbers with comma as thousand separator from PDF converted to text format

I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried
#"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"\d+\.\d{2}"
Here are some test cases that it needs to match
76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000
Here are some test cases that it should not target
666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717
Please note the c# portion is interfacing with iText 7 pdfSweep.
Guid g = new Guid();
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
string guid = g.ToString();
string input = #"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
string output = #"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";
string regex = #"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";
strategy.Add(new RegexBasedCleanupStrategy(regex));
PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Please share your wisdom
You may use
\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?
Or, if the prices occur on whole lines:
^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$
See the regex demo
Bonus: To obtain only price values, you need to remove the ? after \$ to make it obligatory:
\$([0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?)
(I added a capturing group in case you need to access the number value separately from the $ char).
If you need to support any currency char, not just $, replace \$ with \p{Sc}.
Details
^ - start of string
\$? - an optional dollar symbol
[0-9]{1,3} - one to three digits
(?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
(?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
$ - end of string.
C# check for a match:
if (Regex.IsMatch(str, #"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"))
{
// there is a match
}
pdfSweep notice:
Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is
#"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"
where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

How can I exclude the first match in a regular expression?

I have the following regex, so far:
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
to be used on strings such as:
2x Soup, 2x Meat Balls, 4x Iced Tea
My intent was to capture the number of times something was ordered, as well as the name of item ordered.
In this regular expression however, the multiplier 'x' gets captured before the title.
How can I make it so that the x is ignored, and what comes after the x (and a space) is captured?
You can't ignore something in the middle of the pattern. Therefore you do have your capturing groups.
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
The marked parts of your pattern are stored in capturing groups, because of the brackets around them.
Your number is in group 1 and the name is in group 2. The "x" is not captured in a group.
How you now access your groups depends on the language you are using.
Btw. the {1} is obsolete.
So for c# try this:
string text = "2x Soup, 2x Meat Balls, 4x Iced Tea";
MatchCollection result = Regex.Matches(text, #"([0-9]+)\s*[xX]\s*([A-Za-z\./%\$\s\*]+)");
int counter = 0;
foreach (Match m in result)
{
counter++;
Console.WriteLine("Order {0}: " + m.Groups[1] + " " + m.Groups[2], counter);
}
Console.ReadLine();
Further I would change the regex to this, since it seems you want to match as name every character till the next comma
#"([0-9]+)\s*x\s*([^,]+)"
and use RegexOptions.IgnoreCase to avoid having to write [xX]

Regular Expression to find separate words?

Here's a quickie for your RegEx wizards. I need a regular expression that will find groups of words. Any group of words. For instance, I'd like for it to find the first two words in any sentence.
Example "Hi there, how are you?" - Return would be "hi there"
Example "How are you doing?" - Return would be "How are"
Try this:
^\w+\s+\w+
Explanation: one or more word characters, spaces and more one or more word characters together.
Regular expressions could be used to parse language. Regular expressions are a more natural tool. After gathering the words, use a dictionary to see if they're actually words in a particular language.
The premise is to define a regular expression that will split out %99.9 of possible words, word being a key definition.
I assume C# is going to use a PCRE based on 5.8 Perl.
This is my ascii definition of how to split out words (expanded):
regex = '[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )
and unicode (more has to be added/subtracted to suite specific encodings):
regex = '[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )'
To find ALL of the words, cat the regex string into a regex (i don't know c#):
#matches =~ /$regex/xg
where /xg are the expanded and global modifiers. Note that there is only capture group 1 in the regex string so the intervening text is not captured.
To find just the FIRST TWO:
#matches =~ /(?:$regex)(?:$regex)/x
Below is a Perl sample. Anyway, play around with it. Cheers!
use strict;
use warnings;
binmode (STDOUT,':utf8');
# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;
# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;
my $text = q(
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
);
print "\n**\n$text\n";
my #matches = $text =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# =======================================
my $junk = q(
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
);
print "\n\n**\n$junk\n";
# First 2 words
#matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# All words
#matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
Output:
**
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included
**
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
First 2 words
--------------------
Hi
there
Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
a-b
a-
#Rubens Farias:
Per my comment, here's the code I used:
public int startAt = 0;
private void btnGrabWordPairs_Click(object sender, EventArgs e)
{
Regex regex = new Regex(#"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary
if (startAt <= txtTest.Text.Length)
{
string match = regex.Match(txtArticle.Text, startAt).ToString();
MessageBox.Show(match);
startAt += match.Length; //update the starting position to the end of the last match
}
{
Each time the button is clicked it grabs pairs of words quite nicely, proceeding through the text in the txtTest TextBox and finding the pairs sequentially until the end of the string is reached.
#sln: Thanks for the extremely detailed response!

HOW TO SElect line number in TextBox Multiline

I have large text in System.Windows.Forms.TextBox control in my form (winforms), vs 2008.
I want find a text, and select the line number where I've found that text.
Sample,
I have fat big text, and I find "ERROR en línea", and I want select the line number in textbox multiline.
string textoLogDeFuenteSQL = #"SQL*Plus: Release 10.1.0.4.2 - Production on Mar Jun 1 14:35:43 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
******** MORE TEXT ************
Conectado a:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Data Mining and Real Application Testing options
WHERE LAVECODIGO = 'CO_PREANUL'
ERROR en línea 2:
ORA-00904: ""LAVECODIGO"": identificador no v?lido
INSERT INTO COM_CODIGOS
ERROR en línea 1:
ORA-00001: restricción única (XACO.INX_COM_CODIGOS_PK) violada";
******** MORE TEXT ************
Any sample code about it ?
You might want to look at TextBoxBase.GetLineFromCharIndex method. This method retrieves the line number of character position within the textbox.
string str = textBox2.Text;
int index = textBox1.Text.IndexOf(str);
if (index !=-1)
{
int lineNo = textBox1.GetLineFromCharIndex(index);
}
"This method enables you to determine the line number based on the character index specified in the index parameter of the method. The first line of text in the control returns the value zero. The GetLineFromCharIndex method returns the physical line number where the indexed character is located within the control."
EDIT: This only finds the occurrences of the searched text. To compute the line numbers use Fredrik's answer.
using System.Text.RegularExpressions;
public static void FindErrorInText(string input)
{
Regex rgx = new Regex("ERROR en linea \d*", RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
}

Categories

Resources