Advanced Regex - Capture Whole Group of Complex Statement inside Replace - c#

I'm working on a project, and I need to parse related data... the tools I work with is fully command based, and return all kind of stuff, so the regex come handy instead of guess that this line is that, and the other is this, ... so I need to parse this like:
1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S
which depending on the condition may appear on many shapes, but, this will work hopefully:
.*((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
There is just an issue, I need to capture only this part:
YR VC MC and there's no guarantee that there's always three of them... I tried parentheses grouping, as well as naming as you can see, I don't know how to capture a group in C#, though I think it use the Regex->Replace and then replace the whole data with the selected group (in hear 'Class' group), but it only match the last part,.. of inner parentheses, not the whole of it. for example in the above line it will returns "MC" not three of them, i also tried to replace (\w{2}\s+)+) with (\w{2}\s+|\w{2}\s+\w{2}\s+|\w{2}\s+\w{2}\s+\w{2}\s+) but it didn't worked either.
Any one can help me with this matter?
Thank you.

Capture Groups
Let's back up a bit. First, we need to understand what capture groups are. Everything put within parenthesis will be a capturing group. So, for instance, the regex (\d)(\d) with the string 89 will capture 8 in the first group and 9 in the second group. Let's say you make the second digit optional, so (\d)(\d?). Now, if you try to match just 8, the first group will be 8, and the second group will just be an empty string. In this way, we can match all groups, even if some are 'missing'.
Non-Capture Groups
Your regular expression seems to have a ton of unnecessary capture groups. If you don't need it, don't use parenthesis. For example, for (/)?, you can simply remove the parenthesis. What if you want to match the string "123" ten times? You'd probably do something like (123){10}. But hey, that's another unneeded capture group! You can create a non-capture group by using (?:) instead of (). This way, you won't be capturing whatever is within the parenthesis, but you'll be effectively using the parentheses to your convenience.
Your Regex
Removing all unneccessary capture groups from your regex, we end up with:
.*/?(\w{2}\s+)+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
Which includes the space within the capture group, so let's bring that out:
.*/?(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
At this point, the capture group (\w{2}) only matches the MC in your sample string, so let's do what you did and split it off into three different capture groups. Note that we can't do something like (\w{2}){1,3} (which will match \w{2} one to three times), because this still only has one single set of parenthesis, so it only has one single capture group. As such, we will need to expand our (\w{2})\s+ to (\w{2})\s+(\w{2})\s+(\w{2})\s+. This regex will correctly capture your three strings.
Regex in C#
In C#, we have this handy Regex class in System.Text.RegularExpressions. This is how you would use it:
string regex = #".*/?(\w{2})\s+(\w{2})\s+(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*";
string sample = "1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S";
Match matches = Regex.Match (sample, regex);
string[] stringGroups = matches.Groups
.Cast<Group> ()
.Select (el => el.Value)
.ToArray ();
Here, stringGroups will be a string array with all the capture groups. stringGroups[0] will be the entire match (so in this case, 1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S), stringGroups[1] will be the first capture group (YR in this case), stringGroups[2] the second, and stringGroups[3] the third.
PS: I highly recommend Debuggex for testing this type of stuff.

Make it un-greedy:
.*?((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
^
Or remove both greedy dots from both ends. You don't need them:
/?(?<Class>(?:\w{2}\s+)+)(?:\w{2}\d{2}\w{3})?\s+\w{6}

Related

Match anything except for the character selected by the group previously

I have this regular expression:
(?=(([a-z]{1})([a-z]{1})\2))
Through which I am trying to fetch all the palindromic strings. So, if this is my string:
mnonopooo
My regular expression does select all the palindromic strings in the string but it selects ooo also and I know the reason, it is because of this center part of my regular expression:
(?=(([a-z]{1}) "([a-z]{1})" \2))
This part should be like this, match everything except for the backreference group \2.
So I tried something like this, but it didn't work:
(?=(([a-z]{1}) (?!\2) \2))
So basically, my regular expression has three parts:
Select any single character (This is working)
Match any single character not equal to the character matched in point 1 (Not working)
Select the same character that is matched in point 1 (Working using backreference)
So, the second part I am not able to make.
Can anybody please help
Just add a negative Lookahead (i.e., (?!\2)) to make sure the first matched letter is not repeated and keep the 3rd group as is (you still need it):
(?=(([a-z])(?!\2)([a-z])\2))
Please note that the usage of {1} is redundant so I removed them.
Demo: https://regex101.com/r/BVvwnp/1

repeat a group of characters

I have the following input to be matched by a regex:
1.1.1.1
1.01.1.1
01.01.091.01
1.10.100.0010
So I have allways four groups consisting of digits. While the first three ones should match, the last one should not.
So I wrote this regex:
^(\d*[1-9]+\.){4}$
In general this regex should return all those strings where any of the digits in any of the groups is not followed by a zero. Or more easily: I want to not match all numbers with trailing zeros.
However this doesn´t match anything. regex1010.com tells this:
A repeated capturing group will only capture the last iteration. Put a
capturing group around the repeated group to capture all iterations or
use a non-capturing group instead if you're not interested in the data
But when I add a further capturing group I get the same message:
^((\d*[1-9]+\.)){4}$
The same applies to a non-capturing group:
^(?:\d*[1-9]+\.){4}$
Of course I could just write the same group four times, but that´s fairly clumsy and hard to read.
As mentioned by others the dot is the point, so we have three identical groups and one without the dot.
So this regex does it for me:
(?:\d*[1-9]\.){3}(?:\d*[1-9])
You never specify the dot in your patterns. What you ask for is, in fact, not a repetition of four, it is a specific single pattern of four numbers separated with dots.
^(\d*[1-9]+\.\d*[1-9]+\.\d*[1-9]+\.\d*[1-9]+)$
The only thing in there you could consider a repetition is the "number + dot" part, but then you repeat that three times and add another number. Then the regex would become this:
^((\d*[1-9]+\.){3}\d*[1-9]+)$
However, your third line contains a space at the end, so you may want to add extra checks to trim those off.
The problem with your regex is by not including the . your regex fails to find four matches of digits because they always have dots in between.'
Try this instead:
(?:(\d*[1-9])\.?){4}

Weird Regex behavior in C#

I am trying to extract some alfanumeric expressions out of a longer word in C# using regular expressions. For example I have the word "FooNo12Bee". I use the the following regular expression code, which returns me two matches, "No12" and "No" as results:
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"(No|Num)\d{1,3}");
If I use the following expression, without paranthesis and without any alternative for "No" it works the way I am expecting, it returns only "No12":
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"No\d{1,3}");
What is the difference between these two expressions, why using paranthesis results in a redundant result for "No"?
Parenthesis in regex are capture groups; meaning what's in between the paren will be captured and stored as a capture group.
If you don't want a capture group but still need a group for the alternation, use a non-capture group instead; by putting ?: after the first paren:
Match m = Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");
Usually, if you don't want to change the regex for some reason, you can simply retrieve the group 0 from the match to get only the whole match (and thus ignore any capture groups); in your case, using m.Groups[0].Value.
Last, you can improve the efficiency of the regex by a notch using:
Match m = Regex.Match(alfaNumericWord, #"N(?:o|um)\d{1,3}");
i can't explain how they call it, but it is because putting parentheses around it is creating a new group. it is well explained here
Besides grouping part of a regular expression together, parentheses
also create a numbered capturing group. It stores the part of the
string matched by the part of the regular expression inside the
parentheses.
The regex Set(Value)? matches Set or SetValue. In the first case, the
first (and only) capturing group remains empty. In the second case,
the first capturing group matches Value.
It is because the parentheses are creating a group. You can remove the group with ?: like so
Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");

Simple regex doesn't work

I want to match the strings "F1" to "F12". I only need the number. I'm out of training - my first try:
var r = new Regex(#"^(?:[F])[\d]{1,2}$");
matches - but returns "F1" - but i expect to get "1"?
What have I done wrong?
Maybe you want to use lookbehind:
var r = new Regex(#"^(?<=F)\d\d?$");
Even though you are using a non-capturing group for the "F", the overall match for your Regex will return the entire string it matched. Groups are used to outline sub-expressions within your regular expression that you want be able to extract the value of. Non-capturing groups are used if you want to specify a sub-expression without having it be stored in a group. They allow you to apply quantifiers to your sub-expression, but do not allow you to extract their resulting value after running the regex against a string. They are typically used for performance gains, since capturing groups add extra overhead.
If you want to get just the number, you need to put the number portion in a capturing group and look at the Groups property of the resulting Match (assuming you are calling the r.Match function).
The updated Regex would be:
var r = new Regex(#"^(?:[F])([\d]{1,2})$");
Since our number is inside of the first set of parenthesis associated with a capturing group, it will be group 1. You could also name your group to avoid confusion or possible errors if the regex gets updated at a later date.
Alternately, you can just use look-behind as M42 has suggested.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources