Validate a renaming pattern with RegEx - c#

I'm trying to validate a pattern used for renaming.
The user will fill value like :
%1% - %3%%2%
I'm able to match with a regex, everything is ok:
[^%]*(%[\d]+%)+[^%]*
But before that I want to validate the string and be able to find when the user made mistakes like :
%1% - %3%2%
%1% - %3%%2
...
Whatever I try, I can get the corrected value but I don't know if the string is well formatted or not. Only to check manually.
Are there any way with regex to answer to this problem ? Or maybe I don't need regex for this...
EDIT FOR CLARIFICATION
For a good example, just take a program which rename your mp3 files.
You define a mapping between %1% and the track title, %2% for the artist, ...
Sorry, my mistake was to provide only one string. But the user can submit :
%1% - %3%%2%
%1%_%2%%3%
%1%%3% %2%
...
Whatever he want. My goal to parse the string if everything is correct, seems ok for me. Unless I find a tricky bad example.
But before I save it, I want to validate and refuse a string like
%1% - %3%%2
My problem was to find the wrong value. What I done, and seems to me not clean, is to use my regex, and then verify if the total of "%" found in the string is even and if this total divided by 2 is equal of the total of group found. But I'm not sure it works always (not sure if my last phrase is clear)

I think this regex is what you're trying to accomplish.
(%[\d]%) - (%[\d]%)*

I don't know if the string is well formatted or not.
This pattern puts in a check for three consecutive %%% which seems to catch a good number of failure bad format scenarios. Then we can require the pattern to validate* for only good items by adding the $ anchor to require only fully formed valid patterns.
The valid pattern of (%\d%) is what we seek:
^ # Start Anchor
(?!.+%%%) # Stop if 3 % anywhere.
%\d% # First \d
\s-\s # Dash and spaces
(%\d%)+ # Groups of numbers
$ # Stop Anchor
It works on the one example you gave %1% - %3%%2% and doesn't match on the 2 failure examples you provided.
Because this pattern is documented you will need to use IgnorePatternWhiteSpace as a regex option. Otherwise delete all comments and join onto one line without spaces.
When one uses * (zero to many) it can create some ungodly backtracking scenarios which can actually fail a good pattern. Is there really going to be zero items?
Your examples don't show it; if not why not use + 1 to many?

Related

Capture Text Surrounding Regex Match .NET

I am building an application, and I have a requirement to capture characters before and after matches. This seems to work okay, except when there are multiple matches within the surrounding capture.
Regex:
.{0,10}(?=abc)
This should capture up to 10 characters before the string "abc" is found.
The issue comes up if there is a recurrence of the match in the preceding text:
"qqqqabcabcqqq"
With the above text, I would expect two captures:
qqqq (the 4 characters before the first abc occurrence)
qqqqabc (the 7 characters before the second abc occurrence)
I am not, however getting these matches. The only match I get is:
qqqqabc
I am certain that I am missing something, but I am not sure what. I believe that my regex is somehow being too greedy, and so it is overlooking the first match in favor of the larger, second one. Here is what I need:
I need a regex that:
1. Is for .NET
2. Looks within a string for X characters before an exact match on string S.
3. Includes any secondary match on S (call S') that is found within X characters before S
4. does not care in the slightest what these characters are.
I assure you, I tried looking for similar answers but I wasn't able to find anything that directly answers this question (which has been plaguing me for two days. Yes, I have to use regular expression). As for Regex flavor, I am working in .NET.
Thank you so much for any help.
Here it is:
(?<=(?<CharsBefore>.{0,10}))(?=abc)
Took me a while to remember that .NET allows positive lookbehinds with variability.
Regex test
Demo in C#
I changed the way your initial version worked a bit.
Hope it helps!
PS: I've named the group, but you are obviously free to keep it nameless and work with numbered groups if you want a less cluttered regex, like so:
(?<=(.{0,10}))(?=abc)

RegEx to split and extract based strict requirement

I’m using Nintex Workflows with a RegEx action. I believe the RegEx is based on .NET. I need to perform a RegEx on some data that is sent to me by users who input it in a different formats based on the person writing the data.
Test: A-BC12 (1,2,3,4,5,6,7,8,9);
Test: A-DE34 (1,2,3,4, words, 5,6,7,8,9);
Test: AFG56 (1,2,3,4 word, 5);
STOP some extra
My goal is this.
Start the extract after Test:
Capture the last 4 of the alpha numeric before the parenthesis
Capture the numbers only inside the parenthesis
Split each data based on ;
End the whole capture when the word STOP is found.
End results
BC12 (1,2,3,4,5,6,7,8,9);
DE34 (1,2,3,4,5,6,7,8,9);
FG56 (1,2,3,4,5);
I have tried splitting the data, forward lookup and exclude and I can’t seem to get everything to work together. If I have to execute multiple RegEx to achieve my results I’m ok with that.
I’ve tried the following to achieve each one of my goals
(?s)(?<=^.*?Test:\s)[a-zA-Z0-9]+ this only capture the first ABC12 or A-BC12 then stops
[,;] split the data so it is easier to maintain. However the word Test: is captured.
I feel I'm going in the right direction, however I'm missing something or taking the wrong approach. Any help would be greatly appreciated.
If you need to omit the first group you can use this regex: Test:\s*A[^;]*;(.*?)STOP.
That way, you can take $1 and split it on ;.
Edit: Clarifications have rendered the above solution obsolete. I've made new stuff that will directly address your steps:
a. Start the extract after Test:
b. Capture the last 4 of the alpha numeric before the parenthesis
c. Capture the numbers only inside the parenthesis
d. Split each data based on ;
e. End the whole capture when the word STOP is found.
You're actually looking for something like:
Use Test:\s*(.*?)STOP. This addresses steps a and e.
Take $1 and use [A-Z0-9]{4}\s*\(([^)]*)\);. This addresses steps b and d.
Take the $1 from the previous step, and use ([0-9]+) to get the numbers. This will get all the numbers, and if given: 9,10 it will produce two matches: 9 and 10.
You may need to use modifiers, like i for case insensitive, s for single line, and g for global.
I hope this is finally what you're looking for!

Extracting words from lines that match different patterns

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.
These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).
An example of a subject would be:
"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"
The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".
The format of MZ5051CLA will be:
- Always starts with "MZ"
- Follows by a number
- Always ends with "CLA"
Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?
I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.
Edit:
As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.
Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.
Try this regex:
^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$
Demo
The below regex would match only the first string MZ5051CLA
\bMZ\d+CLA\b
DEMO
But this would match the both strings MZ5051CLA and 11123MTX,
\b[A-Z0-9]+$
All alphanumeric characters present at the last of a line are matched.
DEMO
This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx
(?:\b[A-Z0-9]+$|\b\d+MTX\b)
DEMO
Both Codes in One Pattern
It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:
\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*
In the demo, see how only the correct groups are matched. Of course false positives are possible.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Match an expression in pattern matching 0 or more times C#

I am using C# pattern matching with Regular Expressions using Visual Studio 2010.
So my issue is I want to match strings like:
dog1dog235cat7 Winners
lizard2433cat23dog44 Losers
dog23 Winners
where I have letters followed by some digits then followed by 0 or more letter/digit combos. There will always be a space followed by some phrase.
I am trying to figure out how to discriminate against things like "dog7 bones and treats".
The pattern I currently came up with is:
[a-zA-Z]+[0-9]+([a-zA-Z]+[0-9]+)*\s\w
Issue is I have not found any good information on testing for a block of pattern that occurs 0 or more times. So I don't know if there is a good grouping character that indicates this block can occur 0 or more times. I am attempting this with the parenthesis in the ([a-zA-Z]+[0-9]+)* though I believe that is normally used with the Group keyword to pull out instances of part of a pattern for later use.
So does anyone know how I can get the piece of the pattern that is [a-zA-Z]+[0-9]+ be checked for occuring 0 or more times?
(I've looked around, but I haven't seen a C# version about matching a group of characters occuring 0 or more times).
if it helps I am comparing strings to the patterns. But again I am just seeing if there's a way to discriminate against the extra stuff. Since "dog7 bones and treats" does have a segment that does match my pattern (dog7 bones), but I was wondering if there was a way to say if there's extra after this then it's not a match (the extra being "and treats").
So I have been looking at this and from what I see when dealing with pattern matching it's more like "what matches what I'm hunting for" not "if something doesn't match this EXACTLY then reject". Based on what I've read on MSDN and what I've tried after looking at http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx multiple times I have concluded that for my needs that I will simply have to add additional code to split my strings based on white space and if the length of that split is greater than 2 then I know that there was additional input that I didn't want. Such as
string[] words = myInput.Split(' ');
if(words.Length > 2)
//ignore this string
or I can also do the opposite and check if the length is 2 and if so then it's good to do my work with.
I will use the pattern matching to make sure the string inputs are still what they need to be, but I'm going to have to use this additional stuff to discriminate against this extra unwanted stuff.
But again unless someone else knows how to make strings like "dog7 bones and treats" ignorable when I'm looking for things like "AB23454 CD43" this is the solution to my problem.

Finding optional groups with random order using regex

I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.

Categories

Resources