Weird Regex behavior in C# - c#

I am trying to extract some alfanumeric expressions out of a longer word in C# using regular expressions. For example I have the word "FooNo12Bee". I use the the following regular expression code, which returns me two matches, "No12" and "No" as results:
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"(No|Num)\d{1,3}");
If I use the following expression, without paranthesis and without any alternative for "No" it works the way I am expecting, it returns only "No12":
alfaNumericWord = "FooNo12Bee";
Match m = Regex.Match(alfaNumericWord, #"No\d{1,3}");
What is the difference between these two expressions, why using paranthesis results in a redundant result for "No"?

Parenthesis in regex are capture groups; meaning what's in between the paren will be captured and stored as a capture group.
If you don't want a capture group but still need a group for the alternation, use a non-capture group instead; by putting ?: after the first paren:
Match m = Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");
Usually, if you don't want to change the regex for some reason, you can simply retrieve the group 0 from the match to get only the whole match (and thus ignore any capture groups); in your case, using m.Groups[0].Value.
Last, you can improve the efficiency of the regex by a notch using:
Match m = Regex.Match(alfaNumericWord, #"N(?:o|um)\d{1,3}");

i can't explain how they call it, but it is because putting parentheses around it is creating a new group. it is well explained here
Besides grouping part of a regular expression together, parentheses
also create a numbered capturing group. It stores the part of the
string matched by the part of the regular expression inside the
parentheses.
The regex Set(Value)? matches Set or SetValue. In the first case, the
first (and only) capturing group remains empty. In the second case,
the first capturing group matches Value.

It is because the parentheses are creating a group. You can remove the group with ?: like so
Regex.Match(alfaNumericWord, #"(?:No|Num)\d{1,3}");

Related

Regex to extract string between quotes

I'm trying to extract a string between two quotes, and I thought I had my regex working, but it's giving me two strings in my GroupCollection, and I can't get it to ignore the first one, which includes the first quote and ID=
The string that I want to parse is
Test ID="12345" hello
I want to return 12345 in a group, so that I can manipulate it in code later. I've tried the following regex: http://regexr.com/3bgtl, with this code:
nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
The problem is that the GroupCollection contains two entries:
ID="12345
12345
I just want it to return the second one.
Use positive lookbehind operator:
GroupCollection ids = Regex.Match(nodeValue, "(?<=ID=\")[^\"]*").Groups;
You also used a capturing group (the parenthesis), this is why you get 2 results.
There are a few ways to accomplish this. I like named capture groups for readability.
Regex with named capture group:
"(?<capture>.*?)"
And your code would be:
match.Groups["capture"].Value
Your code is totally OK and is the most efficient from all the solutions suggested here. Capturing groups allow the quickest and least resource-consuming way to match substrings inside larger texts.
All you need to do with your regex is just access the captured group 1 that is defined by the round brackets. Like this:
var nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
Console.WriteLine(ids[1].Value);
// or just on one line
// Console.WriteLine(Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups[1].Value);
See IDEONE demo
Please have a look at Grouping Constructs in Regular Expressions:
Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string. You can use grouping constructs to do the following:
Match a subexpression that is repeated in the input string.
Apply a quantifier to a subexpression that has multiple regular expression language elements. For more information about quantifiers, see [Quantifiers in Regular Expressions][3].
Include a subexpression in the string that is returned by the [Regex.Replace][4] and [Match.Result][5] methods.
Retrieve individual subexpressions from the [Match.Groups][6] property and process them separately from the matched text as a whole.
Note that if you do not need overlapping matches, capturing group mechanism is the best solution here.

What is a regular expression for inserting a string into a Find and Replace match?

I need a regular expression to replace all instances of:
Session["ANYWORD"] ==
with
Session["ANYWORD"].ToString() ==
I have Session\["\w+"]\s==, which correctly finds the right matches, but I don't know how to insert .ToString() into the match.
What, or perhaps more appropriately, is there a regular expression to do what I need to do?
You will need to put the value that is between the square brackets into a capture group, and substitute that in your replacement.
In short, this will do it:
Regex.Replace(input, #"Session\[(""\w+"")]\s==", #"Session[$1].ToString() ==");
where $1 will insert the contents of your first capture group (determined by parenthesis in the pattern -> ()).
You can also use named groups if you like, then it becomes:
Regex.Replace(input, #"Session\[(?<anyword>""\w+"")]\s==", #"Session[${anyword}].ToString() ==");
Here is the MSDN doc for that particular overload of Regex.Replace.
For more information about capture group substitution in .NET, look here.

Simple regex doesn't work

I want to match the strings "F1" to "F12". I only need the number. I'm out of training - my first try:
var r = new Regex(#"^(?:[F])[\d]{1,2}$");
matches - but returns "F1" - but i expect to get "1"?
What have I done wrong?
Maybe you want to use lookbehind:
var r = new Regex(#"^(?<=F)\d\d?$");
Even though you are using a non-capturing group for the "F", the overall match for your Regex will return the entire string it matched. Groups are used to outline sub-expressions within your regular expression that you want be able to extract the value of. Non-capturing groups are used if you want to specify a sub-expression without having it be stored in a group. They allow you to apply quantifiers to your sub-expression, but do not allow you to extract their resulting value after running the regex against a string. They are typically used for performance gains, since capturing groups add extra overhead.
If you want to get just the number, you need to put the number portion in a capturing group and look at the Groups property of the resulting Match (assuming you are calling the r.Match function).
The updated Regex would be:
var r = new Regex(#"^(?:[F])([\d]{1,2})$");
Since our number is inside of the first set of parenthesis associated with a capturing group, it will be group 1. You could also name your group to avoid confusion or possible errors if the regex gets updated at a later date.
Alternately, you can just use look-behind as M42 has suggested.

Why does Regex.Match include noncapturing groups in the result?

In matching a regular expression, I want to exclude noncapturing groups from the result. I incorrectly assumed that they'd be excluded by default since, well, they're called noncapturing groups.
For some reason, though, Regex.Match behaves as though I hadn't even specified a noncapturing group. Try running this in the Immediate window:
System.Text.RegularExpressions.Regex.Match("b3a",#"(?:\d)\w").Value
I expected the result to be
"a"
but it's actually
"3a"
This question suggested I look at the Groups, but there is only one Group in the result and it too is "3a". It contains one Capture, also "3a".
What's going on here? Is Regex bugged, or is there an option I need to set?
Matching is not the same thing as capturing. (?:\d) simply means match a subpattern containing \d, but don't bother putting it in a capture group. Your entire pattern (?:\d)\w looks for a (?:\d) followed by a \w; it's functionally equivalent to \d\w.
If you're trying to match a \w only when it is preceded by a \d, use a lookbehind assertion instead:
System.Text.RegularExpressions.Regex.Match("b3a", #"(?<=\d)\w").Value
Non-capturing group means it does not make a group. Matching string are included in the resulting string.
If you want exclude that part, use something like lookbehind assertion.
#"(?<=\d)\w"
You are misunderstanding the purpose of noncapturing groups.
In general, groups (defined by a pair of parentheses ()) mean two things:
The contained regular expression is grouped, so any quantifiers after the brackets apply to the whole expression rather than just the previous single character.
The substring matching the group is stored as a subcapture in the Groups property.
Sometimes, you do not want the second result for certain groups, which is why noncapturing groups were introduced: They allow you to group a sub-expression without having any matches of it stored in an item in the Groups property.
You have observed that your Groups property contains one item, though - which is true, as by default, the first group is always the capture of the complete expression. cf. in the docs:
If the regular expression engine can find a match, the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern.
You can still use groups to achieve what you want, by placing the string you want to capture into a group:
\d(\w)
(I have left out the noncapturing group again as it does not change anything in your above expression.)
With this modified expression, the Groups property in your match should have 2 items:
The complete match (of \d\w)
Only the part of the above string you seem to be interested in, matched by \w

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources