RegEx to split and extract based strict requirement - c#

I’m using Nintex Workflows with a RegEx action. I believe the RegEx is based on .NET. I need to perform a RegEx on some data that is sent to me by users who input it in a different formats based on the person writing the data.
Test: A-BC12 (1,2,3,4,5,6,7,8,9);
Test: A-DE34 (1,2,3,4, words, 5,6,7,8,9);
Test: AFG56 (1,2,3,4 word, 5);
STOP some extra
My goal is this.
Start the extract after Test:
Capture the last 4 of the alpha numeric before the parenthesis
Capture the numbers only inside the parenthesis
Split each data based on ;
End the whole capture when the word STOP is found.
End results
BC12 (1,2,3,4,5,6,7,8,9);
DE34 (1,2,3,4,5,6,7,8,9);
FG56 (1,2,3,4,5);
I have tried splitting the data, forward lookup and exclude and I can’t seem to get everything to work together. If I have to execute multiple RegEx to achieve my results I’m ok with that.
I’ve tried the following to achieve each one of my goals
(?s)(?<=^.*?Test:\s)[a-zA-Z0-9]+ this only capture the first ABC12 or A-BC12 then stops
[,;] split the data so it is easier to maintain. However the word Test: is captured.
I feel I'm going in the right direction, however I'm missing something or taking the wrong approach. Any help would be greatly appreciated.

If you need to omit the first group you can use this regex: Test:\s*A[^;]*;(.*?)STOP.
That way, you can take $1 and split it on ;.
Edit: Clarifications have rendered the above solution obsolete. I've made new stuff that will directly address your steps:
a. Start the extract after Test:
b. Capture the last 4 of the alpha numeric before the parenthesis
c. Capture the numbers only inside the parenthesis
d. Split each data based on ;
e. End the whole capture when the word STOP is found.
You're actually looking for something like:
Use Test:\s*(.*?)STOP. This addresses steps a and e.
Take $1 and use [A-Z0-9]{4}\s*\(([^)]*)\);. This addresses steps b and d.
Take the $1 from the previous step, and use ([0-9]+) to get the numbers. This will get all the numbers, and if given: 9,10 it will produce two matches: 9 and 10.
You may need to use modifiers, like i for case insensitive, s for single line, and g for global.
I hope this is finally what you're looking for!

Related

How to Match a Comma Seperated List and End with a Different Character

One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.
Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.
You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.

Validate a renaming pattern with RegEx

I'm trying to validate a pattern used for renaming.
The user will fill value like :
%1% - %3%%2%
I'm able to match with a regex, everything is ok:
[^%]*(%[\d]+%)+[^%]*
But before that I want to validate the string and be able to find when the user made mistakes like :
%1% - %3%2%
%1% - %3%%2
...
Whatever I try, I can get the corrected value but I don't know if the string is well formatted or not. Only to check manually.
Are there any way with regex to answer to this problem ? Or maybe I don't need regex for this...
EDIT FOR CLARIFICATION
For a good example, just take a program which rename your mp3 files.
You define a mapping between %1% and the track title, %2% for the artist, ...
Sorry, my mistake was to provide only one string. But the user can submit :
%1% - %3%%2%
%1%_%2%%3%
%1%%3% %2%
...
Whatever he want. My goal to parse the string if everything is correct, seems ok for me. Unless I find a tricky bad example.
But before I save it, I want to validate and refuse a string like
%1% - %3%%2
My problem was to find the wrong value. What I done, and seems to me not clean, is to use my regex, and then verify if the total of "%" found in the string is even and if this total divided by 2 is equal of the total of group found. But I'm not sure it works always (not sure if my last phrase is clear)
I think this regex is what you're trying to accomplish.
(%[\d]%) - (%[\d]%)*
I don't know if the string is well formatted or not.
This pattern puts in a check for three consecutive %%% which seems to catch a good number of failure bad format scenarios. Then we can require the pattern to validate* for only good items by adding the $ anchor to require only fully formed valid patterns.
The valid pattern of (%\d%) is what we seek:
^ # Start Anchor
(?!.+%%%) # Stop if 3 % anywhere.
%\d% # First \d
\s-\s # Dash and spaces
(%\d%)+ # Groups of numbers
$ # Stop Anchor
It works on the one example you gave %1% - %3%%2% and doesn't match on the 2 failure examples you provided.
Because this pattern is documented you will need to use IgnorePatternWhiteSpace as a regex option. Otherwise delete all comments and join onto one line without spaces.
When one uses * (zero to many) it can create some ungodly backtracking scenarios which can actually fail a good pattern. Is there really going to be zero items?
Your examples don't show it; if not why not use + 1 to many?

Finding optional groups with random order using regex

I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.

why is this single-line regex not returning ALL matches?

I just asked a similar question to this one, and there was an excellent and accurate answer, but it turns out I now have a brand new problem. It turns out I have a single line of relevant input. I'm not sure how to ask this in an abstract way so I'll just jump right into my input:
(EDITED to provide a better example)
bear999bear888bear777bear666fox---bear222bear333bear444bear555fox
(The items between the markers are not necessarily numeric)
This is the expression (EDITED to match updated input example):
bear.*bear(?<matchString>(.(?!bear.*bear))*?)bear.*fox
It's returning 444. Is there a way that I can tweak this to return both 444 and 777? It seems to be skipping over the first match and favoring only the latter. I have the ! exclusion so that it matches only the innermost on the left side.
I've been testing here:
http://regexlib.com/RETester.aspx
This works great when I break it into two lines and turn on multi-line. Why does it stop working when the input is on a single line?
Any advice would be appreciated!
This should work (it does work in that regex tester you've linked in the question):
(?<=bear)(?:(?!bear).)*(?=bear(?:(?!bear).)*fox)
It reads like "let's match something that is preceded by bear, has no bear sequence within, and is followed by the bear - no bear - fox sequence".
The capturing groups are absent here; the whole match is what you need.
And yes, I just can't help wondering why should this be done with a single regex when it actually looks like a work for a tokenizer. ) For example, you can split your line by 'fox' first, then split each part by 'bear' - and take the one before the last one of each result.
Your first .* is greedy. This will work:
xxx.*?xxx.*?xxx(?<matchString>.*?)xxx.*?yyy

RegEx for a specific string pattern

Using C#, I will be handling character arrays of info, looking for the following pattern:
a pipe (0x7C), 2 to 7 pairs of characters, followed by another pipe (0x7C).
Stated another way:
|1122[33][44][55][66][77]|
The character pairs consist of characters whose range is from 33-124 decimal ( '!' to '|').
Pairs 3 through 7 are optional, but occur in order, if they occur, so you could have
|1122| <---shortest
|112233|
|11223344|
|1122334455|
|112233445566|
|11223344556677| <---longest
I want to 1) find out if this pattern exists in the character array, 2) extract the individual pairs. These tasks can be separate. I think the best approach to this would be a RegEx, but so far I haven't been able to dream-up an expression to get the job done.
Is a RegEx the way to go and what would a solution for the RegEx itself be?
Is there a better way?
Chuck
If I understand your question correctly the correct pattern would be:
\|([!-|]{2}){2,7}\|
Or to capture each set
\|([!-|]{2})([!-|]{2})([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?\|
Not sure if the range will work directly like that or not, so you may need to do [A-Za-Z!##$......] if the simplified range doesn't work
Also, I think you don't want to include pipe(|) in the range as it could mess up the rest so [!-{] might be better

Categories

Resources