Simple regex doesn't work - c#

I want to match the strings "F1" to "F12". I only need the number. I'm out of training - my first try:
var r = new Regex(#"^(?:[F])[\d]{1,2}$");
matches - but returns "F1" - but i expect to get "1"?
What have I done wrong?

Maybe you want to use lookbehind:
var r = new Regex(#"^(?<=F)\d\d?$");

Even though you are using a non-capturing group for the "F", the overall match for your Regex will return the entire string it matched. Groups are used to outline sub-expressions within your regular expression that you want be able to extract the value of. Non-capturing groups are used if you want to specify a sub-expression without having it be stored in a group. They allow you to apply quantifiers to your sub-expression, but do not allow you to extract their resulting value after running the regex against a string. They are typically used for performance gains, since capturing groups add extra overhead.
If you want to get just the number, you need to put the number portion in a capturing group and look at the Groups property of the resulting Match (assuming you are calling the r.Match function).
The updated Regex would be:
var r = new Regex(#"^(?:[F])([\d]{1,2})$");
Since our number is inside of the first set of parenthesis associated with a capturing group, it will be group 1. You could also name your group to avoid confusion or possible errors if the regex gets updated at a later date.
Alternately, you can just use look-behind as M42 has suggested.

Related

Advanced Regex - Capture Whole Group of Complex Statement inside Replace

I'm working on a project, and I need to parse related data... the tools I work with is fully command based, and return all kind of stuff, so the regex come handy instead of guess that this line is that, and the other is this, ... so I need to parse this like:
1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S
which depending on the condition may appear on many shapes, but, this will work hopefully:
.*((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
There is just an issue, I need to capture only this part:
YR VC MC and there's no guarantee that there's always three of them... I tried parentheses grouping, as well as naming as you can see, I don't know how to capture a group in C#, though I think it use the Regex->Replace and then replace the whole data with the selected group (in hear 'Class' group), but it only match the last part,.. of inner parentheses, not the whole of it. for example in the above line it will returns "MC" not three of them, i also tried to replace (\w{2}\s+)+) with (\w{2}\s+|\w{2}\s+\w{2}\s+|\w{2}\s+\w{2}\s+\w{2}\s+) but it didn't worked either.
Any one can help me with this matter?
Thank you.
Capture Groups
Let's back up a bit. First, we need to understand what capture groups are. Everything put within parenthesis will be a capturing group. So, for instance, the regex (\d)(\d) with the string 89 will capture 8 in the first group and 9 in the second group. Let's say you make the second digit optional, so (\d)(\d?). Now, if you try to match just 8, the first group will be 8, and the second group will just be an empty string. In this way, we can match all groups, even if some are 'missing'.
Non-Capture Groups
Your regular expression seems to have a ton of unnecessary capture groups. If you don't need it, don't use parenthesis. For example, for (/)?, you can simply remove the parenthesis. What if you want to match the string "123" ten times? You'd probably do something like (123){10}. But hey, that's another unneeded capture group! You can create a non-capture group by using (?:) instead of (). This way, you won't be capturing whatever is within the parenthesis, but you'll be effectively using the parentheses to your convenience.
Your Regex
Removing all unneccessary capture groups from your regex, we end up with:
.*/?(\w{2}\s+)+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
Which includes the space within the capture group, so let's bring that out:
.*/?(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
At this point, the capture group (\w{2}) only matches the MC in your sample string, so let's do what you did and split it off into three different capture groups. Note that we can't do something like (\w{2}){1,3} (which will match \w{2} one to three times), because this still only has one single set of parenthesis, so it only has one single capture group. As such, we will need to expand our (\w{2})\s+ to (\w{2})\s+(\w{2})\s+(\w{2})\s+. This regex will correctly capture your three strings.
Regex in C#
In C#, we have this handy Regex class in System.Text.RegularExpressions. This is how you would use it:
string regex = #".*/?(\w{2})\s+(\w{2})\s+(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*";
string sample = "1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S";
Match matches = Regex.Match (sample, regex);
string[] stringGroups = matches.Groups
.Cast<Group> ()
.Select (el => el.Value)
.ToArray ();
Here, stringGroups will be a string array with all the capture groups. stringGroups[0] will be the entire match (so in this case, 1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S), stringGroups[1] will be the first capture group (YR in this case), stringGroups[2] the second, and stringGroups[3] the third.
PS: I highly recommend Debuggex for testing this type of stuff.
Make it un-greedy:
.*?((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
^
Or remove both greedy dots from both ends. You don't need them:
/?(?<Class>(?:\w{2}\s+)+)(?:\w{2}\d{2}\w{3})?\s+\w{6}

Weird Regex behavior in C#

I am trying to extract some alfanumeric expressions out of a longer word in C# using regular expressions. For example I have the word "FooNo12Bee". I use the the following regular expression code, which returns me two matches, "No12" and "No" as results:
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"(No|Num)\d{1,3}");
If I use the following expression, without paranthesis and without any alternative for "No" it works the way I am expecting, it returns only "No12":
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"No\d{1,3}");
What is the difference between these two expressions, why using paranthesis results in a redundant result for "No"?
Parenthesis in regex are capture groups; meaning what's in between the paren will be captured and stored as a capture group.
If you don't want a capture group but still need a group for the alternation, use a non-capture group instead; by putting ?: after the first paren:
Match m = Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");
Usually, if you don't want to change the regex for some reason, you can simply retrieve the group 0 from the match to get only the whole match (and thus ignore any capture groups); in your case, using m.Groups[0].Value.
Last, you can improve the efficiency of the regex by a notch using:
Match m = Regex.Match(alfaNumericWord, #"N(?:o|um)\d{1,3}");
i can't explain how they call it, but it is because putting parentheses around it is creating a new group. it is well explained here
Besides grouping part of a regular expression together, parentheses
also create a numbered capturing group. It stores the part of the
string matched by the part of the regular expression inside the
parentheses.
The regex Set(Value)? matches Set or SetValue. In the first case, the
first (and only) capturing group remains empty. In the second case,
the first capturing group matches Value.
It is because the parentheses are creating a group. You can remove the group with ?: like so
Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");

Why does Regex.Match include noncapturing groups in the result?

In matching a regular expression, I want to exclude noncapturing groups from the result. I incorrectly assumed that they'd be excluded by default since, well, they're called noncapturing groups.
For some reason, though, Regex.Match behaves as though I hadn't even specified a noncapturing group. Try running this in the Immediate window:
System.Text.RegularExpressions.Regex.Match("b3a",#"(?:\d)\w").Value
I expected the result to be
"a"
but it's actually
"3a"
This question suggested I look at the Groups, but there is only one Group in the result and it too is "3a". It contains one Capture, also "3a".
What's going on here? Is Regex bugged, or is there an option I need to set?
Matching is not the same thing as capturing. (?:\d) simply means match a subpattern containing \d, but don't bother putting it in a capture group. Your entire pattern (?:\d)\w looks for a (?:\d) followed by a \w; it's functionally equivalent to \d\w.
If you're trying to match a \w only when it is preceded by a \d, use a lookbehind assertion instead:
System.Text.RegularExpressions.Regex.Match("b3a", #"(?<=\d)\w").Value
Non-capturing group means it does not make a group. Matching string are included in the resulting string.
If you want exclude that part, use something like lookbehind assertion.
#"(?<=\d)\w"
You are misunderstanding the purpose of noncapturing groups.
In general, groups (defined by a pair of parentheses ()) mean two things:
The contained regular expression is grouped, so any quantifiers after the brackets apply to the whole expression rather than just the previous single character.
The substring matching the group is stored as a subcapture in the Groups property.
Sometimes, you do not want the second result for certain groups, which is why noncapturing groups were introduced: They allow you to group a sub-expression without having any matches of it stored in an item in the Groups property.
You have observed that your Groups property contains one item, though - which is true, as by default, the first group is always the capture of the complete expression. cf. in the docs:
If the regular expression engine can find a match, the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern.
You can still use groups to achieve what you want, by placing the string you want to capture into a group:
\d(\w)
(I have left out the noncapturing group again as it does not change anything in your above expression.)
With this modified expression, the Groups property in your match should have 2 items:
The complete match (of \d\w)
Only the part of the above string you seem to be interested in, matched by \w

C# Regex string parsing

I have the expression already written, but whenever I run the code I get the entire string and a whole bunch of null values:
Regex regex = new Regex(#"y=\([0-9]\)\([0-9]\)(\s|)\+(\s+|)[0-9]");
Match match = regex.Match("y=(4)(5)+6");
for (int i = 0; i < match.Length; i++)
{
MessageBox.Show(i+"---"+match.Groups[i].Value);
}
Expected output: 4, 5, 6 (in different MessageBoxes
Actual output: y=(4)(5)+6
It finds if the entered string is correct, but once it does I can't get the specific values (the 4, 5, and 6). What can I do to possibly get that code? This is probably something very simple, but I've tried looking at the MSDN match.NextMatch article and that doesn't seem to help either.
Thank you!
As it currently is, you don't have any groups specified. (Except for around the spaces.)
You can specify groups using parenthesis. The parenthesis you are currently using have backslashes, so they are being used as part of the matching. Add an extra set of parenthesis inside of those.
Like so:
new Regex(#"y=\(([0-9]+)\)\(([0-9]+)\)\+([0-9]+)");
And with spaces:
new Regex(#"y\s*=\s*\(([0-9]+)\)\s*\(([0-9]+)\)\s*\+\s*([0-9]+)");
This will also allow for spaces between the parts to be optional, since * means 0 or more. This is better than (?:\s+|) that was given above, since you don't need a group for the spaces. It is also better since the pipe means 'or'. What you are saying with \s+| is "One or more spaces OR nothing". This is the same as \s*, which would be "Zero or more spaces".
Also, I used [0-9]+, because that means 1 or more digits. This allows numbers with multiple digits, like 10 or 100, to be matched. And another side note, using [0-9] is better than \d since \d refers to more than just the numbers we are used to.
You need to name your groups so that you can pull them out later. How do I access named capturing groups in a .NET Regex?
Regex regex = new Regex(#"y=\((?<left\>[0-9])\)\((?<right>[0-9])\)(\s|)\+(\s+|)(?<offset>[0-9])");
Then you can pull them out like this:
regex.Match("y=(4)(5)+6").Groups["left"];
Use (named) capturing groups. You will also need to use (?:) instead of () for the groups you don't want to capture. Otherwise, they will be in the result groups, too.
Regex regex = new Regex(#"y=(\([0-9]\))((\([0-9]\))(?:\s|)\+(?:\s+|)([0-9])");
Match match = regex.Match("y=(4)(5)+6");
Console.WriteLine("1: " + match.Groups[1] + ", 2: " + match.Groups[2] + ", 3: " + match.Groups[3]);
If the pattern found a match, the groups of that match are written into the property which can either be accessed via an index (index 0 contains the complete match).
You can also name those groups to have more readable code:
Regex regex = new Regex(#"y=(?<first>\([0-9]\))(?<second>(\([0-9]\))(?:\s|)\+(?:\s+|)(?<third>[0-9])");
Now, you can access the capturing groups by using match.Groups["first"] and so on.
C# is outside my area of expertise, but this may work:
#"y=\(([0-9])\)\(([0-9])\)(?:\s|)\+(?:\s+|)([0-9])"
It's basically your original regex, but with capturing groups around the numbers, and with the undesired capturing groups changed into non-capturing groups: (?: ... )
Group[0] will give always you the string that was matched, the null values are coming from (\s|).
This will work: y=\((\d)\)\((\d)\)\s*\+\s*(\d)
It's the groups starting from 1 that counts the brackets you use, but if you escape them they don't count (because you're telling it they're just text to match), so those digits need their own brackets. It's also not a good idea to use (x|) when something like ? or * would be more suitable, since you're not capturing that bit.
This will probably be even better y=\((\d+)\)\((\d+)\)\s*\+\s*(\d+) because it supports values larger than ten.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources