Extracting words from lines that match different patterns - c#

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.
These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).
An example of a subject would be:
"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"
The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".
The format of MZ5051CLA will be:
- Always starts with "MZ"
- Follows by a number
- Always ends with "CLA"
Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?
I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.
Edit:
As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.
Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.

Try this regex:
^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$
Demo

The below regex would match only the first string MZ5051CLA
\bMZ\d+CLA\b
DEMO
But this would match the both strings MZ5051CLA and 11123MTX,
\b[A-Z0-9]+$
All alphanumeric characters present at the last of a line are matched.
DEMO
This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx
(?:\b[A-Z0-9]+$|\b\d+MTX\b)
DEMO

Both Codes in One Pattern
It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:
\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*
In the demo, see how only the correct groups are matched. Of course false positives are possible.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Related

How to Match a Comma Seperated List and End with a Different Character

One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.
Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.
You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.

Validate a renaming pattern with RegEx

I'm trying to validate a pattern used for renaming.
The user will fill value like :
%1% - %3%%2%
I'm able to match with a regex, everything is ok:
[^%]*(%[\d]+%)+[^%]*
But before that I want to validate the string and be able to find when the user made mistakes like :
%1% - %3%2%
%1% - %3%%2
...
Whatever I try, I can get the corrected value but I don't know if the string is well formatted or not. Only to check manually.
Are there any way with regex to answer to this problem ? Or maybe I don't need regex for this...
EDIT FOR CLARIFICATION
For a good example, just take a program which rename your mp3 files.
You define a mapping between %1% and the track title, %2% for the artist, ...
Sorry, my mistake was to provide only one string. But the user can submit :
%1% - %3%%2%
%1%_%2%%3%
%1%%3% %2%
...
Whatever he want. My goal to parse the string if everything is correct, seems ok for me. Unless I find a tricky bad example.
But before I save it, I want to validate and refuse a string like
%1% - %3%%2
My problem was to find the wrong value. What I done, and seems to me not clean, is to use my regex, and then verify if the total of "%" found in the string is even and if this total divided by 2 is equal of the total of group found. But I'm not sure it works always (not sure if my last phrase is clear)
I think this regex is what you're trying to accomplish.
(%[\d]%) - (%[\d]%)*
I don't know if the string is well formatted or not.
This pattern puts in a check for three consecutive %%% which seems to catch a good number of failure bad format scenarios. Then we can require the pattern to validate* for only good items by adding the $ anchor to require only fully formed valid patterns.
The valid pattern of (%\d%) is what we seek:
^ # Start Anchor
(?!.+%%%) # Stop if 3 % anywhere.
%\d% # First \d
\s-\s # Dash and spaces
(%\d%)+ # Groups of numbers
$ # Stop Anchor
It works on the one example you gave %1% - %3%%2% and doesn't match on the 2 failure examples you provided.
Because this pattern is documented you will need to use IgnorePatternWhiteSpace as a regex option. Otherwise delete all comments and join onto one line without spaces.
When one uses * (zero to many) it can create some ungodly backtracking scenarios which can actually fail a good pattern. Is there really going to be zero items?
Your examples don't show it; if not why not use + 1 to many?

Capture Text Surrounding Regex Match .NET

I am building an application, and I have a requirement to capture characters before and after matches. This seems to work okay, except when there are multiple matches within the surrounding capture.
Regex:
.{0,10}(?=abc)
This should capture up to 10 characters before the string "abc" is found.
The issue comes up if there is a recurrence of the match in the preceding text:
"qqqqabcabcqqq"
With the above text, I would expect two captures:
qqqq (the 4 characters before the first abc occurrence)
qqqqabc (the 7 characters before the second abc occurrence)
I am not, however getting these matches. The only match I get is:
qqqqabc
I am certain that I am missing something, but I am not sure what. I believe that my regex is somehow being too greedy, and so it is overlooking the first match in favor of the larger, second one. Here is what I need:
I need a regex that:
1. Is for .NET
2. Looks within a string for X characters before an exact match on string S.
3. Includes any secondary match on S (call S') that is found within X characters before S
4. does not care in the slightest what these characters are.
I assure you, I tried looking for similar answers but I wasn't able to find anything that directly answers this question (which has been plaguing me for two days. Yes, I have to use regular expression). As for Regex flavor, I am working in .NET.
Thank you so much for any help.
Here it is:
(?<=(?<CharsBefore>.{0,10}))(?=abc)
Took me a while to remember that .NET allows positive lookbehinds with variability.
Regex test
Demo in C#
I changed the way your initial version worked a bit.
Hope it helps!
PS: I've named the group, but you are obviously free to keep it nameless and work with numbered groups if you want a less cluttered regex, like so:
(?<=(.{0,10}))(?=abc)

Simple RegEx Example

I'm horrible at regex so please bear with me here:
I need to a match where the first character can be anything and the next two have to be RS.
so...
XRS123445 - Match
I suggest you start reading this. Matching any character at a position is basically the simplest thing you can do with regular expressions. There are many different things you can use too:
Any alphanumeric character(\w)
Any character whatsoever(.)
A range of characters ([A-Z])
Any character in a certain unicode range ([\uxxx-\uxxx])
and more. You should also be careful as certain regex languages have ceratin nuances and certain flags have to be set to get the same result. I wouldn't get into more detail to avoid confusion here.
This is the regex you're looking for:
^.RS.*
This would match on any of these:
XRS123445
4RSabc
YRS
.RS.*
Should match as . means any character and then RS as per your requirements
Use this pattern
var pattern = "^.RS";

Finding optional groups with random order using regex

I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.

Categories

Resources