Finding optional groups with random order using regex

Finding optional groups with random order using regex - c#

I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))

Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.

The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

RegEx to capture text between two delimiter characters including 'shared'

If I have the following text...
The quick :brown:fox: jumped over the lazy :dog:.
I would like a regular expression to capture all the words that are between 2 : characters. In the above example it should return :brown:, :fox:, :dog:.
So far, I have this (\:{1}.\w*\s*\:{1}) which returns :brown: and :dog:. I can't quite figure out how to share the : between the 2 matching groups so that it will also return ':fox:'.

Here is a simple pattern which can be made to work:
(?<=:)(\w+)(?=:)
This uses lookarounds to make sure that one or more word characters are surrounded before and after by colons. Check the demo below to see it working.
The match would be available as the first capture group. Actually, it should also be available as the entire match itself, because lookarounds do not consume anything.
Demo
I like the above lookaround approach because it is clean and simple (at least in my mind). If, for some reason, you don't want any lookarounds, then just use the following pattern:
:(\w+):
But note that now you explicitly have to access the first capture group to obtain the matching word without colons on either side.

Validate a renaming pattern with RegEx

I'm trying to validate a pattern used for renaming.
The user will fill value like :
%1% - %3%%2%
I'm able to match with a regex, everything is ok:
[^%]*(%[\d]+%)+[^%]*
But before that I want to validate the string and be able to find when the user made mistakes like :
%1% - %3%2%
%1% - %3%%2
...
Whatever I try, I can get the corrected value but I don't know if the string is well formatted or not. Only to check manually.
Are there any way with regex to answer to this problem ? Or maybe I don't need regex for this...
EDIT FOR CLARIFICATION
For a good example, just take a program which rename your mp3 files.
You define a mapping between %1% and the track title, %2% for the artist, ...
Sorry, my mistake was to provide only one string. But the user can submit :
%1% - %3%%2%
%1%_%2%%3%
%1%%3% %2%
...
Whatever he want. My goal to parse the string if everything is correct, seems ok for me. Unless I find a tricky bad example.
But before I save it, I want to validate and refuse a string like
%1% - %3%%2
My problem was to find the wrong value. What I done, and seems to me not clean, is to use my regex, and then verify if the total of "%" found in the string is even and if this total divided by 2 is equal of the total of group found. But I'm not sure it works always (not sure if my last phrase is clear)

I think this regex is what you're trying to accomplish.
(%[\d]%) - (%[\d]%)*

I don't know if the string is well formatted or not.
This pattern puts in a check for three consecutive %%% which seems to catch a good number of failure bad format scenarios. Then we can require the pattern to validate* for only good items by adding the $ anchor to require only fully formed valid patterns.
The valid pattern of (%\d%) is what we seek:
^ # Start Anchor
(?!.+%%%) # Stop if 3 % anywhere.
%\d% # First \d
\s-\s # Dash and spaces
(%\d%)+ # Groups of numbers
$ # Stop Anchor
It works on the one example you gave %1% - %3%%2% and doesn't match on the 2 failure examples you provided.
Because this pattern is documented you will need to use IgnorePatternWhiteSpace as a regex option. Otherwise delete all comments and join onto one line without spaces.
When one uses * (zero to many) it can create some ungodly backtracking scenarios which can actually fail a good pattern. Is there really going to be zero items?
Your examples don't show it; if not why not use + 1 to many?

RegEx to split and extract based strict requirement

I’m using Nintex Workflows with a RegEx action. I believe the RegEx is based on .NET. I need to perform a RegEx on some data that is sent to me by users who input it in a different formats based on the person writing the data.
Test: A-BC12 (1,2,3,4,5,6,7,8,9);
Test: A-DE34 (1,2,3,4, words, 5,6,7,8,9);
Test: AFG56 (1,2,3,4 word, 5);
STOP some extra
My goal is this.
Start the extract after Test:
Capture the last 4 of the alpha numeric before the parenthesis
Capture the numbers only inside the parenthesis
Split each data based on ;
End the whole capture when the word STOP is found.
End results
BC12 (1,2,3,4,5,6,7,8,9);
DE34 (1,2,3,4,5,6,7,8,9);
FG56 (1,2,3,4,5);
I have tried splitting the data, forward lookup and exclude and I can’t seem to get everything to work together. If I have to execute multiple RegEx to achieve my results I’m ok with that.
I’ve tried the following to achieve each one of my goals
(?s)(?<=^.*?Test:\s)[a-zA-Z0-9]+ this only capture the first ABC12 or A-BC12 then stops
[,;] split the data so it is easier to maintain. However the word Test: is captured.
I feel I'm going in the right direction, however I'm missing something or taking the wrong approach. Any help would be greatly appreciated.

If you need to omit the first group you can use this regex: Test:\s*A[^;]*;(.*?)STOP.
That way, you can take $1 and split it on ;.
Edit: Clarifications have rendered the above solution obsolete. I've made new stuff that will directly address your steps:
a. Start the extract after Test:
b. Capture the last 4 of the alpha numeric before the parenthesis
c. Capture the numbers only inside the parenthesis
d. Split each data based on ;
e. End the whole capture when the word STOP is found.
You're actually looking for something like:
Use Test:\s*(.*?)STOP. This addresses steps a and e.
Take $1 and use [A-Z0-9]{4}\s*\(([^)]*)\);. This addresses steps b and d.
Take the $1 from the previous step, and use ([0-9]+) to get the numbers. This will get all the numbers, and if given: 9,10 it will produce two matches: 9 and 10.
You may need to use modifiers, like i for case insensitive, s for single line, and g for global.
I hope this is finally what you're looking for!

Extracting words from lines that match different patterns

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.
These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).
An example of a subject would be:
"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"
The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".
The format of MZ5051CLA will be:
- Always starts with "MZ"
- Follows by a number
- Always ends with "CLA"
Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?
I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.
Edit:
As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.
Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.

Try this regex:
^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$
Demo

The below regex would match only the first string MZ5051CLA
\bMZ\d+CLA\b
DEMO
But this would match the both strings MZ5051CLA and 11123MTX,
\b[A-Z0-9]+$
All alphanumeric characters present at the last of a line are matched.
DEMO
This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx
(?:\b[A-Z0-9]+$|\b\d+MTX\b)
DEMO

Both Codes in One Pattern
It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:
\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*
In the demo, see how only the correct groups are matched. Of course false positives are possible.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding optional groups with random order using regex - c#

Related

Regex groups expression not capturing content

RegEx to capture text between two delimiter characters including 'shared'

Validate a renaming pattern with RegEx

RegEx to split and extract based strict requirement

Extracting words from lines that match different patterns

Categories

Resources