Using a Regex to find specifically formatted fields

Using a Regex to find specifically formatted fields - c#

In an application I'm developing, someone thought it would be ok idea to include commas in the values of a csv file. So what I'm trying to do select these values and then strip the commas out of these values. But the Regex I've built won't match to anything.
The Regex pattern is: .*,\"<money>(.+,.+)\",?.*
And the sort of values I'm trying to match would be the 2700, 2650 and 2600 in "2,700","2,650","2,600".

Commas are allowed in CSV's and should not cause a problem if you use a text qualifier (usually double quote ").
Here are details: http://en.wikipedia.org/wiki/Comma-separated_values
On to the regex:
This code works for your sample data (but only allows one comma, basically thousands seperated numbers up to 999,999):
string ResultString = null;
try {
ResultString = Regex.Replace(myString, "([0-9]{1,3})(?:(,)?([0-9]{3})?)", "$1$3");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
It takes this:
Test 12,205 26,000 Test. And the sort
of values I'm trying to match would be
the 2700, 2650 and 2600 in
"2,700","2,650","2,600"
and produces this:
Test 12205 26000 Test. And the sort
of values I'm trying to match would be
the 2700 2650 and 2600 in
"2700","2650","2600"

Thanks for the answer Ryan. I should've kept my temper in check a bit more. Though still, commas in a CSV that don't seperate values?
After asking around the office I got pointed towards a free Regex designer application that I used to build the pattern I needed. The application is Rad Software Regular Expression Designer
Oh and for the answer purposes the pattern I came out with is:
\"(?<money>[0-9,.$]*)\"
Edit: Final Andwer
Woah. Completely forgot about this question. I played around with the regex even more and came out with one that would match everything I needed. The regex is:
\"([0-9.$]+(,[0-9.]+)+)\"
That regex has been able to match any deciaml string within a double quote that I've thrown at it.

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.

The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

How to Match a Comma Seperated List and End with a Different Character

One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.

Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.

You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.

C# Regex.Replace not matching and removing double quotes

I googled for an answer and I found some questions here on Stack Exchange asking similar question but they didn't help me. For example, I found C# regex - not matching my string but the answers given are way too complicated for me to understand. I don't know or understand regex. All I want to do is strip a double quote from a string.
To put my question simply, I have a string "\"123.456\"" and I need to remove the "\""
so I made my expression "[^\w\\"]" and after calling
string myString Regex.Replace("\"123.456\"", "[^\\w\\\"]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
myString is "\"123.456\"". I just need to know what my expression should be. I won't be able to understand any lengthy discussions or lectures on learning regex.
I got my example directly from Microsoft at http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.110).aspx so basically all I did was replace the ".#-" with "\"".
UPDATE
Apparently trying to ask a simple question only attracts trolls. I didn't want to get too complicated because I didn't want all you hard working busy people to spend too much time answering the wrong question. I was trying to be nice.
We have a situation where we need to parse input files from several clients and going forwards, the number of clients will increase and there also the number of files from each client will increase.
We found that in several of our clients' transmitted files many fields will have various extra characters. We don't know how or why those characters are in there and our clients aren't telling. (if you want to know why they aren't telling, please move along, these aren't the questions you are looking for)
So, we have many files from many clients each with many rows with many fields of data and we need to strip out "bad" characters.
I took Microsofts method and changed it a bit to be more dynamic.
private string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[^\w\{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
goal here is to be able to strip out any characters that don't belong dynamically But we can't just hard code those characters because not all fields will have any of these characters, and more importantly, some fields will have some bad characters along with other characters which are not to be considered bad for that field but may be considered bad for other fields.
With me so far?
So, in trying to get my work done by Friday (yes, tomorrow), I decided to start slowly with only a couple of known bad characters from 3 input files. So far, those characters are single quote, dash, double quote, dollar sign, comma. But not all the fields in my 3 files need these characters stripped, so I intend to call the CleanInput method only on those fields that need it, and only for the characters that we need stripped.
OK, so while I was testing, I discovered on one field, where we want to strip the comma, single quote, double quote and dollar sign, it was not removing the double quotes (an apparently the backslashes too). So I debugged this issue by first passing in only the comma -that worked. Then I tried passing in only the single quote - that worked. Then I passed in the dollar sign - that worked. Then I passed in the escaped double quote -and that didn't work - the double quotes are still in the string. So I simplified my test in a new console project and I hard coded the string and I called my method just to make sure nothing else could be interfering with it.
I hope and pray no one spends hours of their precious time trying to reconfigure my input files or attempting to teach me the end all be all of regex programming. I have to get this done by tomorrow. Please, I only want to know how to strip the double quote (and apparently the backslashes too) from the given string.

Rather than getting regex involved, perhaps you can just use Replace?
var myString = "\\\"123.456\\\"";
var myCleanString = myString.Replace(#"\""", "");

You are matching on a negated group (the [^] bit). This matches any character not in the square brackets and replaces it. You want to replace anything that is in the group which you can do by just placing the characters you wish to replace inside the square brackets and remove the negation (^):
private static string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
}
You would use the negative version if you knew what you wanted to include rather than exclude. For example if you knew you only wanted numbers and the period character you could do:
string myString = Regex.Replace("\"123.456\"", "[^\\d.]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));

Extracting words from lines that match different patterns

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.
These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).
An example of a subject would be:
"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"
The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".
The format of MZ5051CLA will be:
- Always starts with "MZ"
- Follows by a number
- Always ends with "CLA"
Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?
I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.
Edit:
As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.
Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.

Try this regex:
^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$
Demo

The below regex would match only the first string MZ5051CLA
\bMZ\d+CLA\b
DEMO
But this would match the both strings MZ5051CLA and 11123MTX,
\b[A-Z0-9]+$
All alphanumeric characters present at the last of a line are matched.
DEMO
This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx
(?:\b[A-Z0-9]+$|\b\d+MTX\b)
DEMO

Both Codes in One Pattern
It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:
\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*
In the demo, see how only the correct groups are matched. Of course false positives are possible.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Pulling data out of quotes?

I'm looking for a regex that can pull out quoted sections in a string, both single and double quotes.
IE:
"This is 'an example', \"of an input string\""
Matches:
an example
of an input string
I wrote up this:
[\"|'][A-Za-z0-9\\W]+[\"|']
It works but does anyone see any flaws with it?
EDIT: The main issue I see is that it can't handle nested quotes.

How does it handle single quotes inside of double quotes (or vice versa)?
"This is 'an example', \"of 'quotes within quotes'\""
should match
an example
of 'quotes within quotes'
Use a backreference if you need to support this.
(\"|')[A-Za-z0-9\\W]+?\1
EDIT: Fixed to use a reluctant quantifier.

Like that?
"([\"'])(.*?)\1"
Your desired match would be in sub group 2, and the kind of quote in group one.
The flaw in your regex is 1) the greedy "+" and 2) [A-Za-z0-9] is not really matching an awful lot. Many characters are not in that range.

It works but doesn't match other characters in quotes (e.g., non-alphanumeric, like binary or foreign language chars). How about this:
[\"']([^\"']*)[\"']
My C# regex is a little rusty so go easy on me if that's not exactly right :)

#"(\"|')(.*?)\1"

You might already have one of these, but, in case not, here's a free, open source tool I use all the time to test my regular expressions. I typically have the general idea of what the expression should look like, but need to fiddle around with some of the particulars.
http://renschler.net/RegexBuilder/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using a Regex to find specifically formatted fields - c#

Related

Regex groups expression not capturing content

How to Match a Comma Seperated List and End with a Different Character

C# Regex.Replace not matching and removing double quotes

Extracting words from lines that match different patterns

Pulling data out of quotes?

Categories

Resources