C# - Getting multiple values with a single key, from a text file - c#

I store multiple values that shares a single key on a text file. The text file looks like that:
Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto
Now I want to store the int value & string value of each key (Brightness, for example).
New to C# and could'nt find something that worked yet.
Thanks

I'd recommend to use custom types to store these settings like these:
public enum DisplaySettingType
{
Manual, Auto, None
}
public class DisplaySetting
{
public string Name { get; set; }
public decimal Value { get; set; }
public DisplaySettingType Type { get; set; }
}
Then you could use following LINQ query using string.Split to get all settings:
decimal value = 0;
DisplaySettingType type = DisplaySettingType.None;
IEnumerable<DisplaySetting> settings = File.ReadLines(path)
.Select(l => l.Trim().Split(new[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries))
.Where(arr => arr.Length >= 3 && decimal.TryParse(arr[1], out value) && Enum.TryParse(arr[2], out type))
.Select(arr => new DisplaySetting { Name = arr[0], Value = value, Type = type });

With a regex and a little bit of linq you can do many things.
Here I assume you Know How to read a Text file.
Pros: If the file is not perfect, the reg exp will just ignore the misformatted line, and won't throw error.
Here is a hardcode version of your file, note that a \r will appears because of it. Depending on the way you read you file but it should not be the case with a File.ReadLines()
string input =
#"Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto";
string regEx = #"(.*) (\d+) , (.*)";
var RegexMatch = Regex.Matches(input, regEx).Cast<Match>();
var outputlist = RegexMatch.Select(x => new { setting = x.Groups[1].Value
, value = x.Groups[2].Value
, mode = x.Groups[3].Value });
Regex explanation:/(.*) (\d+) , (.*)/g
1st Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
matches the character literally (case sensitive)
2nd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
, matches the characters , literally (case sensitive)
3rd Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Disclamer:
Never trust an input! Even if it's a file some other program did, or send by a customer.
From my experience, you have then two ways of handeling bad format:
Read line by line, and register every bad line.
or Ignore them. You don't fit , you don't sit!
And don't tell your self it won't happend, it will!

Related

Extracting dollar prices and numbers with comma as thousand separator from PDF converted to text format

I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried
#"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"\d+\.\d{2}"
Here are some test cases that it needs to match
76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000
Here are some test cases that it should not target
666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717
Please note the c# portion is interfacing with iText 7 pdfSweep.
Guid g = new Guid();
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
string guid = g.ToString();
string input = #"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
string output = #"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";
string regex = #"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";
strategy.Add(new RegexBasedCleanupStrategy(regex));
PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Please share your wisdom
You may use
\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?
Or, if the prices occur on whole lines:
^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$
See the regex demo
Bonus: To obtain only price values, you need to remove the ? after \$ to make it obligatory:
\$([0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?)
(I added a capturing group in case you need to access the number value separately from the $ char).
If you need to support any currency char, not just $, replace \$ with \p{Sc}.
Details
^ - start of string
\$? - an optional dollar symbol
[0-9]{1,3} - one to three digits
(?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
(?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
$ - end of string.
C# check for a match:
if (Regex.IsMatch(str, #"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"))
{
// there is a match
}
pdfSweep notice:
Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is
#"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"
where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.

Regex for decimal number dot instead of comma (.NET)

I am using regex to parse data from an OCR'd document and I am struggling to match the scenarios where a 1000s comma separator has been misread as a dot, and also where the dot has been misread as a comma!
So if the true value is 1234567.89 printed as 1,234,567.89 but being misread as:
1.234,567.89
1,234.567.89
1,234,567,89
etc
I could probably sort this in C# but I'm sure that a regex could do it. Any regex-wizards out there that can help?
UPDATE:
I realise this is a pretty dumb question as the regex is pretty straight forward to catch all of these, it is then how I choose to interpret the match. Which will be in C#. Thanks - sorry to waste your time on this!
I will mark the answer to Dmitry as it is close to what I was looking for. Thank you.
Please notice, that there's ambiguity since:
123,456 // thousand separator
123.456 // decimal separator
are both possible (123456 and 123.456). However, we can detect some cases:
Too many decimal separators 123.456.789
Wrong order 123.456,789
Wrong digits count 123,45
So we can set up a rule: the separator can be decimal one if it's the last one and not followed by exactly three digits (see ambiguity above), all the
other separators should be treated as thousand ones:
1?234?567?89
^ ^ ^
| | the last one, followed by two digits (not three), thus decimal
| not the last one, thus thousand
not the last one, thus thousand
Now let's implement a routine
private static String ClearUp(String value) {
String[] chunks = value.Split(',', '.');
// No separators
if (chunks.Length <= 1)
return value;
// Let's look at the last chunk
// definitely decimal separator (e.g. "123,45")
if (chunks[chunks.Length - 1].Length != 3)
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
// may be decimal or thousand
if (value[value.Length - 4] == ',')
return String.Concat(chunks);
else
return String.Concat(chunks.Take(chunks.Length - 1)) +
"." +
chunks[chunks.Length - 1];
}
Now let's try some tests:
String[] data = new String[] {
// you tests
"1.234,567.89",
"1,234.567.89",
"1,234,567,89",
// my tests
"123,456", // "," should be left intact, i.e. thousand separator
"123.456", // "." should be left intact, i.e. decimal separator
};
String report = String.Join(Environment.NewLine, data
.Select(item => String.Format("{0} -> {1}", item, ClearUp(item))));
Console.Write(report);
the outcome is
1.234,567.89 -> 1234567.89
1,234.567.89 -> 1234567.89
1,234,567,89 -> 1234567.89
123,456 -> 123456
123.456 -> 123.456
Try this Regex:
\b[\.,\d][^\s]*\b
\b = Word boundaries
containing: . or comma or digits
Not containing spaces
Responding to update/comments: you do not need regex to do this. Instead, if you can isolate the number string from the surrounding spaces, you can pull it into a string-array using Split(',','.'). Based on the logic you outlined above, you could then use the last element of the array as the fractional part, and concatenate the first elements together for the whole part. (Actual code left as an exercise...) This will even work if the ambiguous-dot-or-comma is the last character in the string: the last element in the split-array will be empty.
Caveat: This will only work if there is always a decimal point--otherwise, you would not be able to differentiate logically between a thousands-place comma and a decimal with thousandths.

Regex for matching season and episode

I'm making small app for myself, and I want to find strings which match to a pattern but I could not find the right regular expression.
Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi
That is expamle of string I have and I only want to know if it contains substring of S[NUMBER]E[NUMBER] with each number max 2 digits long.
Can you give me a clue?
Regex
Here is the regex using named groups:
S(?<season>\d{1,2})E(?<episode>\d{1,2})
Usage
Then, you can get named groups (season and episode) like this:
string sample = "Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi";
Regex regex = new Regex(#"S(?<season>\d{1,2})E(?<episode>\d{1,2})");
Match match = regex.Match(sample);
if (match.Success)
{
string season = match.Groups["season"].Value;
string episode = match.Groups["episode"].Value;
Console.WriteLine("Season: " + season + ", Episode: " + episode);
}
else
{
Console.WriteLine("No match!");
}
Explanation of the regex
S // match 'S'
( // start of a capture group
?<season> // name of the capture group: season
\d{1,2} // match 1 to 2 digits
) // end of the capture group
E // match 'E'
( // start of a capture group
?<episode> // name of the capture group: episode
\d{1,2} // match 1 to 2 digits
) // end of the capture group
There's a great online test site here: http://gskinner.com/RegExr/
Using that, here's the regex you'd want:
S\d\dE\d\d
You can do lots of fancy tricks beyond that though!
Take a look at some of the media software like XBMC they all have pretty robust regex filters for tv shows
See here, here
The regex I would put for S[NUMBER1]E[NUMBER2] is
S(\d\d?)E(\d\d?) // (\d\d?) means one or two digit
You can get NUMBER1 by <matchresult>.group(1), NUMBER2 by <matchresult>.group(2).
I would like to propose a little more complex regex. I don't have ". : - _"
because i replace them with space
str_replace(
array('.', ':', '-', '_', '(', ')'), ' ',
This is the capture regex that splits title to title season and episode
(.*)\s(?:s?|se)(\d+)\s?(?:e|x|ep)\s?(\d+)
e.g. Da Vinci's Demons se02ep04 and variants
https://regex101.com/r/UKWzLr/3
The only case that i can't cover is to have interval between season and the number, because the letter s or se is becoming part if the title that does not work for me. Anyhow i haven't seen such a case, but still it is an issue.
Edit:
I managed to get around it with a second line
$title = $matches[1];
$title = preg_replace('/(\ss|\sse)$/i', '', $title);
This way i remove endings on ' s' and ' se' if name is part of series

Regex for checking for compass directions

I'm looking to match the 8 main directions as might appear in a street or location prefix or suffix, such as:
N Main
south I-22
124 Grover Ave SE
This is easy to code using a brute force list of matches and cycle through every match possibility for every street address, matching once with a start-of-string anchor and once with a end-of-string anchor. My blunt starting point is shown farther down, if you want to see it.
My question is if anyone has some clever ideas for compact, fast-executing patterns to accomplish the same thing. You can assume:
Compound directions always start with the north / south component. So I need to match South East but not EastSouth
The pattern should not match [direction]-ern words, like "Northern" or "Southwestern"
The match will always be at the very beginning or very end of the string.
I'm using C#, but I'm just looking for a pattern so I'm not emphasizing the language. /s(outh)?/ is just as good as #"s(outh)?" for me or future readers.
SO emphasizes real problems, so FYI this is one. I'm parsing a few hundred thousand nasty, unvalidated user-typed address strings. I want to check if the start or end of the "street" field (which is free-form jumble of PO boxes, streets, apartments, and straight up invalid junk) begins or ends with a compass direction. I'm trying to deconstruct these free form strings to find similar addresses which may be accidental or intentional variations and obfuscations.
My blunt attempt
Core pattern: /n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)/
In a function:
public static Tuple<Match, Match> MatchDirection(String value) {
string patternBase = #"n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)";
Match[] matches = new Match[2];
string[] compassPatterns = new[] { #"^(" + patternBase + #")\b", #"\b(" + patternBase + #")$" };
for (int i = 0; i < 2; i++) { matches[i] = Regex.Match(value, compassPatterns[i], RegexOptions.IgnoreCase); }
return new Tuple<Match, Match>(matches[0], matches[1]);
}
In use, where sourceDt is a table with all the addresses:
var parseQuery = sourceDt.AsEnumerable()
.Select((DataRow row) => {
string addr = ((string)row["ADDR_STREET"]).Trim();
Tuple<Match, Match> dirMatches = AddressParser.MatchDirection(addr);
return new string[] { addr, dirMatches.Item1.Value, dirMatches.Item2.Value };
})
Edit: Actually this is probably wrong answer - so keeping it just so people not suggest the same thing - figuring out tokenization for "South East" is task in itself. Also I still doubt RegExp will be very usable either.
Original answer:
Don't... your initial RegExp attempt is already non-readable.
Dictionary look up for each word you want from the tokenized string ("brute force approach") already gives you linear time on length and constant time per word. And it is very easy to customize with new words.
(^[nesw][^n\s]*)|([nesw][^n\s]*$)
So this will match a line that:
begins or ends with a word that:
Begins with a cardinal direction
Doesn't have an n otherwise in it (to get rid of the "-ern"s)
Perl/PCRE compatible expression:
(?xi)
(^)?
\b
(?:
n(?:orth)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
s(?:outh)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
e(?:ast)?
|
w(?:est)?
)
\b
(?(1)|$)
I think C# supports all the features used here.

How does this regex find triangular numbers?

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.
The first few triangular numbers are:
1 = 1
3 = 1 + 2
6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5
There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:
Given n, we first create a string of length n filled with the same character
We then match this string against the pattern ^(\1.|^.)+$
n is triangular if and only if this pattern matches the string
Here are some snippets to show that this works in several languages:
PHP (on ideone.com)
$r = '/^(\1.|^.)+$/';
foreach (range(0,50) as $n) {
if (preg_match($r, str_repeat('o', $n))) {
print("$n ");
}
}
Java (on ideone.com)
for (int n = 0; n <= 50; n++) {
String s = new String(new char[n]);
if (s.matches("(\\1.|^.)+")) {
System.out.print(n + " ");
}
}
C# (on ideone.com)
Regex r = new Regex(#"^(\1.|^.)+$");
for (int n = 0; n <= 50; n++) {
if (r.IsMatch("".PadLeft(n))) {
Console.Write("{0} ", n);
}
}
So this regex seems to work, but can someone explain how?
Similar questions
How to determine if a number is a prime with regex?
Explanation
Here's a schematic breakdown of the pattern:
from beginning…
| …to end
| |
^(\1.|^.)+$
\______/|___match
group 1 one-or-more times
The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.
Group 1 tries to match this|that alternates:
\1., that is, what group 1 matched (self reference!), plus one of "any" character,
or ^., that is, just "any" one character at the beginning
Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:
"Try to match what I matched last time, plus one more. That's what I'll match this time."
Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.
So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.
Further explorations
Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.
Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.
Both of these points are shown in the following C# snippet (also seen on ideone.com):
Regex r = new Regex(#"^(\1.|^.)+$");
Console.WriteLine(r.IsMatch("aababc")); // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False
for (int n = 0; n <= 50; n++) {
Match m = r.Match("".PadLeft(n));
if (m.Success) {
Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
}
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)
Flavor notes
Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).
In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.
Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.
Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.
One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.
Related questions
(Java) method matches not work well - with examples on how to do prefix/suffix/infix matching
Is there a regex flavor that allows me to count the number of repetitions matched by * and + (.NET!)
Bonus material: Using regex to find power of twos!!!
With very slight modification, you can use the same techniques presented here to find power of twos.
Here's the basic mathematical property that you want to take advantage of:
1 = 1
2 = (1) + 1
4 = (1+2) + 1
8 = (1+2+4) + 1
16 = (1+2+4+8) + 1
32 = (1+2+4+8+16) + 1
The solution is given below (but do try to solve it yourself first!!!!)
(see on ideone.com in PHP, Java, and C#):
^(\1\1|^.)*.$

Categories

Resources