Reading numbers from string in C#

Reading numbers from string in C# - c#

What I want?
I want to display weather information on my page.
I want to display the result in the browser specific culture.
What am I doing?
I use MSN RSS for this purpose.
MSN returns the report in XML format. I parse the XML and display results.
What problem am I facing?
When displaying the report, I have to parse an XML node, <data> which will be different values in different culture.
For e.g.,
en-US: "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%"
de-DE: "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%"
I want to read only low, high and chance of precipitation values. i.e., I want to read 46, 67 and 20%.
Can somebody please give me a solution for this?
May be RegX or someother method is also fine with me :-)
Thanks in advance!

You should consider always fetching the RSS using the same culture. That way, you'll have an easier task parsing the content. If you'll only be using the numbers, it shouldn't stop you from emitting culture-specific content to the end user.
So if you go for the en-US version, you could do it like this:
Regex re = new Regex(#"Lo: (\d+)°F. Hi: (\d+)°F. Chance of precipitation: (\d+)%");
var match = re.Match(forecast);
if (match.Success)
{
var groups = match.Groups;
lo = int.Parse(groups[1].Captures[0].Value);
hi = int.Parse(groups[2].Captures[0].Value);
prec = int.Parse(groups[3].Captures[0].Value);
}

If you only want the numbers, you can use a regular expression, for example the following:
(\d+).*?(\d+).*?(\d+%)
A quick test in PowerShell shows that it does work at least for your input data:
PS Home:\> function test ($re) {
>> $a -match $re; $Matches
>> $b -match $re; $Matches
>> }
>>
PS Home:\> $a = "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%"
PS Home:\> $b = "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%"
PS Home:\> test "(\d+).*?(\d+).*?(\d+%)"
True
Name Value
---- -----
3 20%
2 67
1 46
0 46°F. Hi: 67°F. Chance of precipitation: 20%
True
3 20%
2 67
1 46
0 46°F. Höchst: 67°F. Niederschlag %: 20%
However, it won't work anymore if any locale might use numbers in the description strings.
You can add other constraints, like requiring a colon before every match:
: (\d+).*?: (\d+).*?: (\d+%)
This should deal with spurious numbers elsewhere in the string. But the best way overall would actually be to get your data from a source which gives you the data for machine reading, not for human consumption

The following should extract the two numbers and chance of precipitation, as well as the units that are used (for culturally dependent units).
(?<lo>\d+°.).*?(?<hi>\d+°.).*?(?<precipitation>\d+)
If you don't want units extracted, then you can use
(?<lo>\d+)°.*?(?<hi>\d+)°.*?(?<precipitation>\d+)

use regex (but i don't know the regex formula ;) )
You can also do a forloop over the sentence, and check each char if it's a integer. Each time you encounter once, place it in a string. when finding something else than an integer, parse the string to an int and voila. Do this 3 times

Its quite weird you are not getting XML with values in different nodes which would make more sense to me (they you could pick which values use for different locales).
But, if you want to extract data from given strings try this or something simmilar if you are not a fan of RegEx:
string dataUS = "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%";
string dataDE = "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%";
string[] stringValues = dataU.Split(new string[] {": "}, 4, StringSplitOptions.None);
List<int> values = new List<int>();
for (int i = 1; i < 4; i++)
{
StringBuilder sb = new StringBuilder();
foreach (char c in stringValues[i].Trim())
{
if (Char.IsDigit(c))
{
sb.Append(c);
}
else
{
values.Add(Convert.ToInt32(sb.ToString()));
break;
}
}
}
(im spliting on ": " instead of digits)

I suggest using Regex to get the values that you want according to UI culture language one by one :
I mean you can have a Regex to get the Lo temp. "(Lo|Niedrig):(\d+)" , a regex to get Hi temp
"(Hi|Höchst):(\d+)" and a regex to get chance of perception and so on.
In all of the above examples you can get the number from second element of the match.

Related

Extracting dollar prices and numbers with comma as thousand separator from PDF converted to text format

I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried
#"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"\d+\.\d{2}"
Here are some test cases that it needs to match
76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000
Here are some test cases that it should not target
666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717
Please note the c# portion is interfacing with iText 7 pdfSweep.
Guid g = new Guid();
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
string guid = g.ToString();
string input = #"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
string output = #"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";
string regex = #"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";
strategy.Add(new RegexBasedCleanupStrategy(regex));
PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Please share your wisdom

You may use
\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?
Or, if the prices occur on whole lines:
^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$
See the regex demo
Bonus: To obtain only price values, you need to remove the ? after \$ to make it obligatory:
\$([0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?)
(I added a capturing group in case you need to access the number value separately from the $ char).
If you need to support any currency char, not just $, replace \$ with \p{Sc}.
Details
^ - start of string
\$? - an optional dollar symbol
[0-9]{1,3} - one to three digits
(?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
(?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
$ - end of string.
C# check for a match:
if (Regex.IsMatch(str, #"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"))
{
// there is a match
}
pdfSweep notice:
Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is
#"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"
where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.

Display all possible matches for a regex pattern

I have the following RegEx pattern in order to determine some 3-digit exchanges of phone numbers:
(?:2(?:04|[23]6|[48]9|50)|3(?:06|43|65)|4(?:03|1[68]|3[178]|50)|5(?:06|1[49]|79|8[17])|6(?:0[04]|13|39|47)|7(?:0[59]|78|8[02])|8(?:[06]7|19|73)|90[25])
It looks pretty daunting, but it only yields around 40 or 50 numbers. Is there a way in C# to generate all numbers that match this pattern? Offhand, I know I can loop through the numbers 001 thru 999, and check each number against the pattern, but is there a cleaner, built-in way to just generate a list or array of matches?
ie - {"204","226","236",...}

No, there is no off the shelf tool to determine all matches given a regex pattern. Brute force is the only way to test the pattern.
Update
It is unclear why you are using (?: ) which is the "Match but don't capture". It is used to anchor a match, for example take this phone text phone:303-867-5309 where we don't care about the phone: but we want the number.
The pattern used would be
(?:phone\:)(\d{3}-\d{3}-\d{4})
which would match the whole line, but the capture returned would just be the second match of the phone number 303-867-5309.
So the (?: ) as mentioned is used to anchor a match capture at a specific point; with text match text thrown away.
With that said, I have redone your pattern with comments and a test to 2000:
string pattern = #"
^ # Start at beginning of line so no mid number matches erroneously found
(
2(04|[23]6|49|[58]0) # 2 series only match 204, 226, 236, 249, 250, 280
| # Or it is not 2, then match:
3(06|43|65) # 3 series only match 306, 343, 365
)
$ # Further Anchor it to the end of the string to keep it to 3 numbers";
// RegexOptions.IgnorePatternWhitespace allows us to put the pattern over multiple lines and comment it. Does not
// affect regex parsing/processing.
var results = Enumerable.Range(0, 2000) // Test to 2000 so we don't get any non 3 digit matches.
.Select(num => num.ToString().PadLeft(3, '0'))
.Where (num => Regex.IsMatch(num, pattern, RegexOptions.IgnorePatternWhitespace))
.ToArray();
Console.WriteLine ("These results found {0}", string.Join(", ", results));
// These results found 204, 226, 236, 249, 250, 280, 306, 343, 365

I took the advice of #LucasTrzesniewski and just looped through the possible values. Since I know I’m dealing w/ 3-digit numbers, I just looped through the numbers/strings “000” thru “999” and checked for matches like this:
private static void FindRegExMatches(string pattern)
{
for (var i = 0; i < 1000; i++)
{
var numberString = i.ToString().PadLeft(3, '0');
if (!Regex.IsMatch(numberString, pattern)) continue;
Console.WriteLine("Found a match: {0}, numberString);
}
}

Regular expression for maximum value

I have come up with the following regexp to accept values with a .25 interval or in quarter format, like 1.25, 10.75, 11.50, 12, 13.
Regular Expression
^\d+((\.0+)*|(\.250*)|(\.50*)|(\.750*))$
Example
Accepted Values = 0, 0.25, 0.50, 0.75, 3 , 1.25 , 1.50, 1.75, 5 , 10
Not Accepted Values = 0.15, 0.20, 0.26, 0.30, 1.30, 1.55
I have the following questions;
How can I make it not accept .25, but accept 0.25?
How can I limit the value to the maximum number? I want it to accept up to 15.5.

In my opinion, Regex is not the correct tool for that kind of work. All the values you want to accept are decimal values. Simply parse the entered value as decimal and then check if it's correct regarding your accepted values:
decimal number;
if (Decimal.TryParse(value, out number))
// Check if you're in the correct range
It will be much simpler, and errorproof.

This is the wrong way to do this, but a solution nonetheless:
^((1[0-4]|[0-9])?(\.(25|5|75)?0*)?|15(\.(25|5)?0*)?)$
Demo
Regex r = new Regex("^((1[0-4]|[0-9])?(\\.(25|5|75)?0*)?|15(\\.(25|5)?0*)?)$");
string[] arr = {".25", "17.545", "3.75000", "19.5", "10.500", "0.25"};
foreach(string s in arr) {
if (r.IsMatch(s)) {
Console.WriteLine(s);
}
}
It gives:
.25
3.75000
10.500
0.25

It's not accepting .25 as if you have a \d+ at the beginning which means that you are waiting at least one digital before ".". I think you could change \d+ by the following (0|1)?[1-5]? to accept till 15.
Regex details can be found here.

You can use this regex:
^((?:1[0-4]|[0-9])?(?:\.(?:25|5|75))?0*|15(?:\.(?:25|5)?0*)?)$
Demo: http://regex101.com/r/eQ2pA1/7

^(0|1)?[1-4]?(|\.0+|\.250*|\.50*|\.750*)$ for matching 0-14 numbers and make a | to match with numbers 15, 15.25, 15.50* with the following:
15(|\.0+|\.250*|\.50*)
The final one will be
^(((0|1)?[1-4]?(|\.0+|\.250*|\.50*|\.750*)) | (15(|\.0+|\.250*|\.50*)))$

I think Regex is the wrong way to do it, but if you insist then the following should suit you right.
^(\d|1[0-5]?)((\.0+)*|(\.250*)|(\.50*)|(\.750*))$

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).

Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.

Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}

The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

Regex for checking for compass directions

I'm looking to match the 8 main directions as might appear in a street or location prefix or suffix, such as:
N Main
south I-22
124 Grover Ave SE
This is easy to code using a brute force list of matches and cycle through every match possibility for every street address, matching once with a start-of-string anchor and once with a end-of-string anchor. My blunt starting point is shown farther down, if you want to see it.
My question is if anyone has some clever ideas for compact, fast-executing patterns to accomplish the same thing. You can assume:
Compound directions always start with the north / south component. So I need to match South East but not EastSouth
The pattern should not match [direction]-ern words, like "Northern" or "Southwestern"
The match will always be at the very beginning or very end of the string.
I'm using C#, but I'm just looking for a pattern so I'm not emphasizing the language. /s(outh)?/ is just as good as #"s(outh)?" for me or future readers.
SO emphasizes real problems, so FYI this is one. I'm parsing a few hundred thousand nasty, unvalidated user-typed address strings. I want to check if the start or end of the "street" field (which is free-form jumble of PO boxes, streets, apartments, and straight up invalid junk) begins or ends with a compass direction. I'm trying to deconstruct these free form strings to find similar addresses which may be accidental or intentional variations and obfuscations.
My blunt attempt
Core pattern: /n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)/
In a function:
public static Tuple<Match, Match> MatchDirection(String value) {
string patternBase = #"n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)";
Match[] matches = new Match[2];
string[] compassPatterns = new[] { #"^(" + patternBase + #")\b", #"\b(" + patternBase + #")$" };
for (int i = 0; i < 2; i++) { matches[i] = Regex.Match(value, compassPatterns[i], RegexOptions.IgnoreCase); }
return new Tuple<Match, Match>(matches[0], matches[1]);
}
In use, where sourceDt is a table with all the addresses:
var parseQuery = sourceDt.AsEnumerable()
.Select((DataRow row) => {
string addr = ((string)row["ADDR_STREET"]).Trim();
Tuple<Match, Match> dirMatches = AddressParser.MatchDirection(addr);
return new string[] { addr, dirMatches.Item1.Value, dirMatches.Item2.Value };
})

Edit: Actually this is probably wrong answer - so keeping it just so people not suggest the same thing - figuring out tokenization for "South East" is task in itself. Also I still doubt RegExp will be very usable either.
Original answer:
Don't... your initial RegExp attempt is already non-readable.
Dictionary look up for each word you want from the tokenized string ("brute force approach") already gives you linear time on length and constant time per word. And it is very easy to customize with new words.

(^[nesw][^n\s]*)|([nesw][^n\s]*$)
So this will match a line that:
begins or ends with a word that:
Begins with a cardinal direction
Doesn't have an n otherwise in it (to get rid of the "-ern"s)

Perl/PCRE compatible expression:
(?xi)
(^)?
\b
(?:
n(?:orth)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
s(?:outh)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
e(?:ast)?
|
w(?:est)?
)
\b
(?(1)|$)
I think C# supports all the features used here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading numbers from string in C# - c#

The following should extract the two numbers and chance of precipitation, as well as the units that are used (for culturally dependent units). (?<lo>\d+°.).?(?<hi>\d+°.).?(?<precipitation>\d+) If you don't want units extracted, then you can use (?<lo>\d+)°.?(?<hi>\d+)°.?(?<precipitation>\d+)

use regex (but i don't know the regex formula ;) ) You can also do a forloop over the sentence, and check each char if it's a integer. Each time you encounter once, place it in a string. when finding something else than an integer, parse the string to an int and voila. Do this 3 times

Related

Extracting dollar prices and numbers with comma as thousand separator from PDF converted to text format

Display all possible matches for a regex pattern

Regular expression for maximum value

Need multiple regular expression matches using C#

Regex for checking for compass directions

Categories

Resources