c# regex to parse columns in a txt file - c#

I have a text file looks like this
FieldA FieldB FieldC FieldD FieldE
001 中文 15% 语言
002 法文 20 12% 外文
003 英文 21 外文
004 西班牙语 10% 外文
so basically I have the file read in and split into lines. Now I would like to use regex to split each line into fields. As you can see some fields in the column are actually empty, the fields may not in fixed width, but is separated by at least one white space. Some fields contains Chinese characters.
May I know how to do this? Thanks.

string s = "001 中文 15% 语言";
Match m = Regex.Match(s,
#"(?<A>\d*)\s*" + // Field A: any number of digits
#"(?<B>\p{L}*)\s*" + // Field B: any number of letters
#"(?<C>\d*)\s+" + // Field C: any number of digits
#"(?<D>(\d+%)?)\s*" + // Field D: one or more digits followed by '%', or nothing
#"(?<E>\p{L}*)"); // Field E: any number of letters
string fieldA = m.Groups["A"].Value; // "001"
string fieldB = m.Groups["B"].Value; // "中文"
string fieldC = m.Groups["C"].Value; // ""
string fieldD = m.Groups["D"].Value; // "15%"
string fieldE = m.Groups["E"].Value; // "语言"
All fields are optional. If a field is not present, it will be captured as an empty string, like in fieldC above.

/\s*(\d*)\s*([^\d\s]*)\s*(\d*)\s\s*(\d*%?)\s*([^\d\s]*)/
Here is a regex that will capture all of the content you want, use it on each line.
\s* //any number of whitespace
(\d*) //any number of digits
\s* //any number of whitespace
([^\d\s]*) //any number of characters that aren't whitespace or digits
\s* //any number of whitespace
(\d*)\s //any number of digits with a space after it
\s* //any number of whitespace
(\d*%?) //any number of digits with an optional %
\s* //any number of whitespace
([^\d\s]*) //any number of characters that aren't whitespace or digits

Related

How to match different scenarios with regex in c# and groups

I want to match these different scenarios with a regex pattern. Mainly delimiter is #:
1234-1111-234.011#333 => [id = 1234-1111-234.011 and code =333]
whatever text before 1234-1111-234.011#333 => [textb=whatever text before, id = 1234-1111-234.011 , code =333, texta="]
1234-1111-234.011#333 whatever text after => [ textb="" id = 1234-1111-234.011 ,code =333 , texta=whatever text after]
Text can be both at the beginning or the end
In every case code can contain also a postfix letter W like 1234-1111-234.011#333W => code=333E
textb = text with length maximum 15 characters. Only letters and
numbers.
id = 17 character long with this format XXXX-XXXX-XXX.XXX code - 3
or 4 character long based on W letter is presenting or not
texta = text with length maximum 15 characters. Only letters and
numbers.
I am trying to match these scenarios with this piece of code and groups
pattern ="
?(<textb>[\w\s]{15})#
?(<id>[\d\s]{17,17})#
(?<code>([A-Z]{0,1}\d{2,3}))
(?<wo>[W]{1})
?(<texta>[\w\s]{15})"
and
var textb = Regex.Match(mytext, pattern).Groups["textb"].Value;
var id= Regex.Match(mytext, pattern).Groups["id"].Value;
var code= Regex.Match(mytext, pattern).Groups["code"].Value;
var wo= Regex.Match(mytext, pattern).Groups["wo"].Value;
var texta= Regex.Match(mytext, pattern).Groups["texta"].Value;
A full example is "This is before text 234-1111-234.011#333E This is next text"
Not matching at all.
You could do it with one regular expression and then use Groups to get the parts you need.
void Main()
{
var input = "Before text 1234-1111-234.011#333E After text";
var pattern = #"(?<btext>[\w ]{0,15})(?<id>[\d\-\.]{17})#(?<code>[\d]{2,3})(?<wo>[A-Z]?)(?<atext>[\w ]{0,15})";
var matches = Regex.Match(input, pattern);
var btext = matches.Groups["btext"];
var wo = matches.Groups["wo"];
Console.WriteLine(btext.Value);
Console.WriteLine(wo.Value);
// etc.
}
(?<btext>[\w ]{0,15}) // Match letters, numbers and spaces, minimum 0 chars, maximum 15 chars
(?<id>[\d\-\.]{17}) // match numbers, '-' and '.'. Must be 17 chars
# // Match pound sign
(?<code>[\d]{2,3}) // Match numbers 2 or 3 chars long
(?<wo>[A-Z]?) // Match optional letter after code
(?<atext>[\w ]{0,15}) // Match letters, numbers and spaces, minimum 0 chars, maximum 15 chars

Extracting dollar prices and numbers with comma as thousand separator from PDF converted to text format

I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried
#"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"(?<=each)(((\d*[,|.]\d{2,3}))*)"
#"\d+\.\d{2}"
Here are some test cases that it needs to match
76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000
Here are some test cases that it should not target
666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717
Please note the c# portion is interfacing with iText 7 pdfSweep.
Guid g = new Guid();
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
string guid = g.ToString();
string input = #"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
string output = #"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";
string regex = #"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";
strategy.Add(new RegexBasedCleanupStrategy(regex));
PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Please share your wisdom
You may use
\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?
Or, if the prices occur on whole lines:
^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$
See the regex demo
Bonus: To obtain only price values, you need to remove the ? after \$ to make it obligatory:
\$([0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?)
(I added a capturing group in case you need to access the number value separately from the $ char).
If you need to support any currency char, not just $, replace \$ with \p{Sc}.
Details
^ - start of string
\$? - an optional dollar symbol
[0-9]{1,3} - one to three digits
(?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
(?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
$ - end of string.
C# check for a match:
if (Regex.IsMatch(str, #"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"))
{
// there is a match
}
pdfSweep notice:
Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is
#"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"
where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.

display some part of string on a label

Input 1
string str=" 1 KAUSHAL DUTTA 46 Female WL 19 WL 2";
Input 2
string str1= "1 AYAN PAL 38 Male CNF S5 49 (LB) CNF S5 49 (LB)";
i have two different types of string if user enter string str then the output should be (WL 2) & if user enter string str1 then the output should be(CNF S5 49 (LB))
all the values are dynamic except(WL (number)) (CNF (1 alphabet 1
or 2 number) number (LB))
If you frame your input string with some delimiter, then you can easily split the string and you can store it in some array and proceed.
For example, Frame your string as
string str="1#KAUSHAL DUTTA#46#Female#WL 19#WL 2";
After this split the string like
string[] str1 = str.Split('#');
From str1 array, you can take last value str1[5]
You can use Regex:
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
//This is the pattern for the first case WL followed by a one or more (+) digits (\d)
//followed by any number of characters (.*)
//the parenthesis is to us to be able to group what is inside, for further processing
string pattern1 = #"WL \d+ (.*)";
//Pattern for the second match: CNF followed by a letter (\w) followed by one or two ({1,2})
//digits (\d) followed by one or more (+) digits (\d), followed by (LB) "\(LB\)"
//the backslach is to get the litteral parenthesis
//followed by any number of characters (.*)
//the parenthesis is to us to be able to group what is inside, for further processing
string pattern2 = #"CNF \w\d{1,2} \d+ \(LB\) (.*)";
string result="";
if (Regex.IsMatch(inputString, pattern1))
{
//Groups[0] is the entire match, Groups[1] is the content of the first parenthesis
result = Regex.Match(inputString, pattern1).Groups[1].Value;
}
else if (Regex.IsMatch(inputString, pattern2))
{
//Groups[0] is the entire match, Groups[1] is the content of the first parenthesis
result = Regex.Match(inputString, pattern2).Groups[1].Value;
}

Regular expression match all numbers after the last dash?

Trying to find the last instance of numbers after last dash in a string so
test-123-2-456 would return 456
123-test would return ""
123-test-456 would return 456
123-test-456sdfsdf would return 456
123-test-asd456 would return 456
The expression, #"[^-]*$", does not match the numbers though, and I have tried using [\d] but to no avail.
Sure, the simplest solution would be something like this:
(\d+)[^-]*$
This will match one or more digits, captured in group 1, followed by zero or more of any character other than a hyphen, followed by the end of the string. In other words, it will match any sequence of digits as long as there are no hyphens between that sequence and the end of the string. You then just have to extract group 1 from the match. For example:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
foreach(var str in inputs)
{
var m = Regex.Match(str, #"(\d+)[^-]*$");
Console.WriteLine("{0} --> {1}", str, m.Groups[1].Value);
}
Produces:
test-123-2-456 --> 456
123-test -->
123-test-456 --> 456
123-test-456sdfsdf --> 456
123-test-asd456 --> 456
Alternatively, if you could use a negative lookahead like this:
\d+(?!.*-)
This will match one or more digit characters so long as they are not followed by a hyphen. Only the digits will be included in the match.
Note that these two options behave differently if there are two or more sets of numbers after the last -, e.g. foo-123bar456. In this case it's not entirely clear what you want to happen, but the first pattern will simply match everything starting from the first sequence of digits to the end (123bar456) with group 1 only containing the first sequence of digits (123). If you'd like to change this so that it only captures the last sequence of digits, place a \d inside the character class (i.e. (\d+)[^\d-]*$). The second second pattern would produce a separate match for each sequence digits (in this example, 123 and 456) but the Regex.Match method will only give you the first match.
I suggest to apply two regex-functions. Take the result of the first one as the input for the second one.
The first regex is:
-[0-9]+[^-]+$ // Take the last peace of your string lead by a minus (-)
// followed by digits ([0-9]+)
// and some ugly rest that doesn't contain another minus ([^-]+$)
The second regex is:
-[0-9]+ // Seperate the relevant digits from the ugly rest
// You know that there can only be one minus + digits part in it
Tested here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
The latest group from this RegEx can get the last number for you:
[^-A-z][0-9]+[^A-z]
If you are looking at groups, you can write this code by matching groups to get the latest number:
var inputs = new[] {
"test-123-2-456",
"123-test",
"123-test-456",
"123-test-456sdfsdf",
"123-test-asd456"
};
var m = Regex.Match(str, #"([0-9]*)");
if(m.Groups.Length>1) //This will avoid the values starting with numbers only.
Console.WriteLine("{0} --> {1}", str, m.Groups[m.Groups.Length-1].Value);

What RegEx string will find the last (rightmost) group of digits in a string?

Looking for a regex string that will let me find the rightmost (if any) group of digits embedded in a string. We only care about contiguous digits. We don't care about sign, commas, decimals, etc. Those, if found should simply be treated as non-digits just like a letter.
This is for replacement/incrementing purposes so we also need to grab everything before and after the detected number so we can reconstruct the string after incrementing the value so we need a tokenized regex.
Here's examples of what we are looking for:
"abc123def456ghi" should identify the'456'
"abc123def456ghi789jkl" should identify the'789'
"abc123def" should identify the'123'
"123ghi" should identify the'123'
"abc123,456ghi" should identify the'456'
"abc-654def" should identify the'654'
"abcdef" shouldn't return any match
As an example of what we want, it would be something like starting with the name 'Item 4-1a', extracting out the '1' with everything before being the prefix and everything after being the suffix. Then using that, we can generate the values 'Item 4-2a', 'Item 4-3a' and 'Item 4-4a' in a code loop.
Now If I were looking for the first set, this would be easy. I'd just find the first contiguous block of 0 or more non-digits for the prefix, then the block of 1 or more contiguous digits for the number, then everything else to the end would be the suffix.
The issue I'm having is how to define the prefix as including all (if any) numbers except the last set. Everything I try for the prefix keeps swallowing that last set, even when I've tried anchoring it to the end by basically reversing the above.
How about:
^(.*?)(\d+)(\D*)$
then increment the second group and concat all 3.
Explanation:
^ : Begining of string
( : start of 1st capture group
.*? : any number of any char not greedy
) : end group
( : start of 2nd capture group
\d+ : one or more digits
) : end group
( : start of 3rd capture group
\D* : any number of non digit char
) : end group
$ : end of string
The first capture group will match all characters until the first digit of last group of digits before the end of the string.
or if you can use named group
^(?<prefix>.*?)(?<number>\d+)(?<suffix>\D*)$
Try next regex:
(\d+)(?!.*\d)
Explanation:
(\d+) # One or more digits.
(?!.*\d) # (zero-width) Negative look-ahead: Don't find any characters followed with a digit.
EDIT (OFFTOPIC of the question):: This answer is incorrect but this question has already been answered in other posts so to avoid delete this one I will use this same regex other way, for example in Perl could be used like this to get same result as in C# (increment last digit):
s/(\d+)(?!.*\d)/$1 + 1/e;
You can also try little bit simpler version:
(\d+)[^\d]*$
This should do it:
Regex regexObj = new Regex(#"
# Grab last set of digits, prefix and suffix.
^ # Anchor to start of string.
(.*) # $1: Stuff before last set of digits.
(?<!\d) # Anchor start of last set of digits.
(\d+) # $2: Last set of one or more digits.
(\D*) # $3: Zero or more trailing non digits.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace);
What about not using Regex. Here's code snippet (for console)
string[] myStringArray = new string[] { "abc123def456ghi", "abc123def456ghi789jkl", "abc123def", "123ghi", "abcdef","abc-654def" };
char[] numberSet = new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
char[] filterSet = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z','-'};
foreach (string myString in myStringArray)
{
Console.WriteLine("your string - {0}",myString);
int index1 = myString.LastIndexOfAny(numberSet);
if (index1 == -1)
Console.WriteLine("no number");
else
{
string mySubString = myString.Substring(0,index1 + 1);
string prefix = myString.Substring(index1 + 1);
Console.WriteLine("prefix - {0}", prefix);
int index2 = mySubString.LastIndexOfAny(filterSet);
string suffix = myString.Substring(0, index2 + 1);
Console.WriteLine("suffix - {0}",suffix);
mySubString = mySubString.Substring(index2 + 1);
Console.WriteLine("number - {0}",mySubString);
Console.WriteLine("_________________");
}
}
Console.Read();

Categories

Resources