I am using regex to parse data from an OCR'd document and I am struggling to match the scenarios where a 1000s comma separator has been misread as a dot, and also where the dot has been misread as a comma!
So if the true value is 1234567.89 printed as 1,234,567.89 but being misread as:
1.234,567.89
1,234.567.89
1,234,567,89
etc
I could probably sort this in C# but I'm sure that a regex could do it. Any regex-wizards out there that can help?
UPDATE:
I realise this is a pretty dumb question as the regex is pretty straight forward to catch all of these, it is then how I choose to interpret the match. Which will be in C#. Thanks - sorry to waste your time on this!
I will mark the answer to Dmitry as it is close to what I was looking for. Thank you.
Please notice, that there's ambiguity since:
123,456 // thousand separator
123.456 // decimal separator
are both possible (123456 and 123.456). However, we can detect some cases:
Too many decimal separators 123.456.789
Wrong order 123.456,789
Wrong digits count 123,45
So we can set up a rule: the separator can be decimal one if it's the last one and not followed by exactly three digits (see ambiguity above), all the
other separators should be treated as thousand ones:
1?234?567?89
^ ^ ^
| | the last one, followed by two digits (not three), thus decimal
| not the last one, thus thousand
not the last one, thus thousand
Now let's implement a routine
private static String ClearUp(String value) {
String[] chunks = value.Split(',', '.');
// No separators
if (chunks.Length <= 1)
return value;
// Let's look at the last chunk
// definitely decimal separator (e.g. "123,45")
if (chunks[chunks.Length - 1].Length != 3)
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
// may be decimal or thousand
if (value[value.Length - 4] == ',')
return String.Concat(chunks);
else
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
}
Now let's try some tests:
String[] data = new String[] {
// you tests
"1.234,567.89",
"1,234.567.89",
"1,234,567,89",
// my tests
"123,456", // "," should be left intact, i.e. thousand separator
"123.456", // "." should be left intact, i.e. decimal separator
};
String report = String.Join(Environment.NewLine, data
.Select(item => String.Format("{0} -> {1}", item, ClearUp(item))));
Console.Write(report);
the outcome is
1.234,567.89 -> 1234567.89
1,234.567.89 -> 1234567.89
1,234,567,89 -> 1234567.89
123,456 -> 123456
123.456 -> 123.456
Try this Regex:
\b[\.,\d][^\s]*\b
\b = Word boundaries
containing: . or comma or digits
Not containing spaces
Responding to update/comments: you do not need regex to do this. Instead, if you can isolate the number string from the surrounding spaces, you can pull it into a string-array using Split(',','.'). Based on the logic you outlined above, you could then use the last element of the array as the fractional part, and concatenate the first elements together for the whole part. (Actual code left as an exercise...) This will even work if the ambiguous-dot-or-comma is the last character in the string: the last element in the split-array will be empty.
Caveat: This will only work if there is always a decimal point--otherwise, you would not be able to differentiate logically between a thousands-place comma and a decimal with thousandths.
Related
I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried
#"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"\d+\.\d{2}"
Here are some test cases that it needs to match
76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000
Here are some test cases that it should not target
666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717
Please note the c# portion is interfacing with iText 7 pdfSweep.
Guid g = new Guid();
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
string guid = g.ToString();
string input = #"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
string output = #"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";
string regex = #"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";
strategy.Add(new RegexBasedCleanupStrategy(regex));
PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Please share your wisdom
You may use
\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?
Or, if the prices occur on whole lines:
^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$
See the regex demo
Bonus: To obtain only price values, you need to remove the ? after \$ to make it obligatory:
\$([0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?)
(I added a capturing group in case you need to access the number value separately from the $ char).
If you need to support any currency char, not just $, replace \$ with \p{Sc}.
Details
^ - start of string
\$? - an optional dollar symbol
[0-9]{1,3} - one to three digits
(?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
(?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
$ - end of string.
C# check for a match:
if (Regex.IsMatch(str, #"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"))
{
// there is a match
}
pdfSweep notice:
Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is
#"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"
where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.
I'm new to Regex and I would like to know how do I detect the number by Regex in C#, that always display in a format : #,###
Ex : 2 000,000 into 2,000
Ex : 15 000.000 into 15,000
Ex : 6.700 into 6,700
Ex : .3.3.3 into 0,300
These are some examples that I'm doing for validation
As the comments suggest, the question is not very clear.
To get your examples working, you can use e.g.
(?:(?<int>\d+)[ .,]?|[.,])
(?<frac>\d+)?
(?:[ .,]\d+)*
to match the "integer part" and the "fractional part" divided by ., , or (wired, but that is what I read out of your examples - since 15 000.000 => 15,000 and 6.700 => 6,700 I assume a comma seperator everywhere).
I'm pretty sure I did not get it right! At least not entirely. The examples you provide look like numbers with different thousands separator, but it seems to have no system.
However, this is what you match with the regular expression above:
int | | frac | anything else
----+-+------+--------------
2 | | 000 | ,000
15 | | 000 | .000
6 |.| 70 |
|.| 3 | .3.3
In addition, it matches numbers without fractional part.
In Detail
(?:(?<int>\d+)[ .,]?|[.,])
Match decimals (one ore more) and store them in a group named int. Match an optional , . or , thereafter.
OR
Match . or ,.
(?<frac>\d+)?
Optionally match the fraction part (one or more decimals).
(?:[ .,]\d+)*
Match , . or , and one or more decimals (repeat this zero or more times).
This last one is to prevent the last parts of e.g. .3.3.3 to match in subsequent calls.
Next
Then you can use a MatchEvaluator-Function (here in form of a delegate) to replace the values.
var rx = new Regex(#"
(?:(?<int>\d+)[ .,]?|[.,])
(?<frac>\d+)?
(?:[ .,]\d+)*
",
RegexOptions.IgnorePatternWhitespace
);
var deDE = new System.Globalization.CultureInfo("de-DE");
text = rx.Replace(text, delegate(Match match) {
int integral;
int fraction;
int fraclen = match.Groups["frac"].Length;
int.TryParse(match.Groups["int"].Value, out integral);
int.TryParse(match.Groups["frac"].Value, out fraction);
var val = integral + fraction / Math.Pow(10, fraclen);
return String.Format(deDE, "{0:0.000}", val);
});
The function is called for every match. Inside, I read out the groups, convert them into integers and then create the matched value with integral + fraction / Math.Pow(10, fraclen) (integral part + fraction part divided by 10^len where len is the string-length of the fraction part, thus "70" becomes 0.7 by calculating 70/10^2 == 70/100 == 0.7).
At the end, I return String.Format with CultureInfo de-DE. This is done because in Germany you use , as decimal seperator. There are others too - and there are many other ways to output such a number..
This is just an example.
Trying to find the last instance of numbers after last dash in a string so
test-123-2-456 would return 456
123-test would return ""
123-test-456 would return 456
123-test-456sdfsdf would return 456
123-test-asd456 would return 456
The expression, #"[^-]*$", does not match the numbers though, and I have tried using [\d] but to no avail.
Sure, the simplest solution would be something like this:
(\d+)[^-]*$
This will match one or more digits, captured in group 1, followed by zero or more of any character other than a hyphen, followed by the end of the string. In other words, it will match any sequence of digits as long as there are no hyphens between that sequence and the end of the string. You then just have to extract group 1 from the match. For example:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
foreach(var str in inputs)
{
var m = Regex.Match(str, #"(\d+)[^-]*$");
Console.WriteLine("{0} --> {1}", str, m.Groups[1].Value);
}
Produces:
test-123-2-456 --> 456
123-test -->
123-test-456 --> 456
123-test-456sdfsdf --> 456
123-test-asd456 --> 456
Alternatively, if you could use a negative lookahead like this:
\d+(?!.*-)
This will match one or more digit characters so long as they are not followed by a hyphen. Only the digits will be included in the match.
Note that these two options behave differently if there are two or more sets of numbers after the last -, e.g. foo-123bar456. In this case it's not entirely clear what you want to happen, but the first pattern will simply match everything starting from the first sequence of digits to the end (123bar456) with group 1 only containing the first sequence of digits (123). If you'd like to change this so that it only captures the last sequence of digits, place a \d inside the character class (i.e. (\d+)[^\d-]*$). The second second pattern would produce a separate match for each sequence digits (in this example, 123 and 456) but the Regex.Match method will only give you the first match.
I suggest to apply two regex-functions. Take the result of the first one as the input for the second one.
The first regex is:
-[0-9]+[^-]+$ // Take the last peace of your string lead by a minus (-)
// followed by digits ([0-9]+)
// and some ugly rest that doesn't contain another minus ([^-]+$)
The second regex is:
-[0-9]+ // Seperate the relevant digits from the ugly rest
// You know that there can only be one minus + digits part in it
Tested here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
The latest group from this RegEx can get the last number for you:
[^-A-z][0-9]+[^A-z]
If you are looking at groups, you can write this code by matching groups to get the latest number:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
var m = Regex.Match(str, #"([0-9]*)");
if(m.Groups.Length>1) //This will avoid the values starting with numbers only.
Console.WriteLine("{0} --> {1}", str, m.Groups[m.Groups.Length-1].Value);
Im using regular expression to get values such as (16.00 + 28.66 = 44.66) as 44.66 ,(26.00) as 26.00
I have trouble to display data when its just(99) as 99 without any decimal.
I have used the below code to till now
string amount = DropDownList1.SelectedItem.Text;
Regex regex = new Regex("(\\d+\\.\\d{2})(?=\\))", RegexOptions.Multiline | RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
Someone please tell me how can i display a value without any decimal..
Eg-(99) as 99
Does your drop down list contain values like these?
(20.01 + 20.01 = 40.02)
(40.02)
(40)
If yes, you can try this Regular Expression
(\\d+(\\.\\d{2})?)(?=\\))
You can do it without Regex using ToString format, you can fix the number of decimal places. The 99 will be 99.00. You can read more about custom numeric formats over here.
string formatedNum = double.Parse(DropDownList1.SelectedItem.Text).ToString(".00");
The "0" custom format specifier serves as a zero-placeholder symbol.
If the value that is being formatted has a digit in the position where
the zero appears in the format string, that digit is copied to the
result string; otherwise, a zero appears in the result string. The
position of the leftmost zero before the decimal point and the
rightmost zero after the decimal point determines the range of digits
that are always present in the result string, MSDN.
My Requirement is that
My first two digits in entered number is of the range 00-32..
How can i check this through regex in C#?
I could not Figure it out !!`
Do you really need a regex?
int val;
if (Int32.TryParse("00ABFSSDF".Substring(0, 2), out val))
{
if (val >= 0 && val <= 32)
{
// valid
}
}
Since this is almost certainly a learning exercise, here are some hints:
Your rexex will be an "OR" | of two parts, both validating the first two characters
The first expression part will match if the first character is a digit is 0..2, and the second character is a digit 0..9
The second expression part will match if the first character is digit 3, and the second character is a digit 0..2
To match a range of digits, use [A-B] range, where A is the lower and B is the upper bound for the digits to match (both bounds are inclusive).
Try something like
Regex reg = new Regex(#"^([0-2]?[0-9]|3[0-2])$");
Console.WriteLine(reg.IsMatch("00"));
Console.WriteLine(reg.IsMatch("22"));
Console.WriteLine(reg.IsMatch("33"));
Console.WriteLine(reg.IsMatch("42"));
The [0-2]?[0-9] matches all numbers from zero to 29 and the 3[0-2] matches 30-32.
This will validate number from 0 to 32, and also allows for numbers with leading zero, eg, 08.
You should divide the region as in:
^[012]\d|3[012]
if(Regex.IsMatch("123456789","^([0-2][0-9]|3[0-2])"))
// match