Regex match one digit or two - c#

If this
(°[0-5])
matches °4
and this
((°[0-5][0-9]))
matches °44
Why does this
((°[0-5])|(°[0-5][0-9]))
match °4 but not °44?

Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):
((°[0-5])|(°[0-5][0-9]))
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

You are using shorter match first in regex alternation. Better use this regex to match both strings:
°[0-5][0-9]?
RegEx Demo

Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression
(a|ab|abc)
when fed this input:
abcdefghi
will only ever match a. However, if the regular expression is changed to
(a|ab|abc)d
It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.
Why would you not reduce your regular expression from
((°[0-5])|(°[0-5][0-9]))
to this?
°[0-5][0-9]?
It's simpler and easier to understand.

Related

How to match regular expression starting exactly at a given index?

With the .NET Regex class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
regular expression ab
input string: ababab
Now, I can search for matches for the regular expression (named expr in the following) in the input string, for instance, starting at character index 2:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2.
If I pass index 1, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this.
As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
#
Work-Around
Drawback
1
I could check the resulting match to see whether its Index property matches the supplied index.
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached).
2
I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index.
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.)
3
I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group.
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean.
Lastly, the Match overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.
It appears you can use the \G operator, \Gab pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(#"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(#"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?
You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")
You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.
You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Regex alternation construct eats part of previous group

I am trying to fashion a Regex to capture a function style argument list, which should be straight forward enough, but I'm encountering a behaviour I don't understand.
In the snippet below the first example behaves as you would expect, capturing the function name into the first group and the argument list into the second group.
In the second example I want to replace the 'zero or more' quantifier that captures the argument list with the 'one or more' quantifier so that the second group will fail if there are no arguments. I'm expecting the regex to capture just the function name, but for some reason the regex is eating the '1' off the end of the function name, and I cannot for the life of me see why it would be doing that. Can anyone see what's going wrong with this please?
// {func1} {blah, blah, blah}
Match m13 = Regex.Match("func1(blah, blah, blah)", #"(\w+) (?([(]) [(]([^)]*) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
// {func}
Match m14 = Regex.Match("func1()", #"(\w+) (?([(]) [(]([^)]+) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Your expression can be adjusted to:
(\w+) (?([(]) [(]([^)]*) )
^ rather than +
The reason the expression returns an unexpected result relates to backtracking. The regex engine effectively takes the following steps:
(\w) matches func1.
func1 is immediately followed by a (, matching the zero-width expression in the conditional matching construct.
The conditional construct requires a ( literal followed by one or more characters that are not ). This condition fails for the input func1(), since there are zero characters between ( and ).
The engine backtracks to step (1) and removes a character, so that (\w) now matches func instead of func1.
func is immediately followed by a 1, which does not satisfy the zero-width expression in the conditional matching construct.
Since the conditional matching construct does not match and there is no alternate expression, the regex completes successfully with func in the first captured group, and no match in the second captured group.
The issue emerges in Step 3, where the expression fails to allow () as a legal argument list. Adjusting the expression to allow zero characters between the opening and closing parentheses (as illustrated above) allows this sequence. An expression such as ^(\w+)(?:\((.*)\))?$ may also address the underlying issue without the need for a conditional construct.

Regular expression match text between tag

I need a help with regular expression as I do not have good knowledge in it.
I have regular expression as:
Regex myregex = new Regex("testValue=\"(.+?)\"");
What does (.+?) indicate?
The string it matches is "testValue=123e4567" and returns 123e4567 as output.
Now I need help in regular expression to match a string "<helpMe>123e4567</helpMe>" where I need 123e4567 as output. How do I write a regular expression for it?
This means:
( Begin captured group
. Match any character
+ One or more times
? Non-greedy quantifier
) End captured group
In the case of your regex, the non-greedy quantifier ? means that your captured group will begin after the first double-quote, and then end immediately before the very next double-quote it encounters. If it were greedy (without the ?), the group would extend to the very last double-quote it encounters on that line (i.e., "greedily" consuming as much of the line as possible).
For your "helpMe" example, you'd want this regex:
<helpMe>(.+?)</helpMe>
Given this string:
<div>Something<helpMe>ABCDE</helpMe></div>
You'd get this match:
ABCDE
The value of the non-greedy quantifier is evident in this variation:
Regex: <helpMe>(.+)</helpMe>
String: <div>Something<helpMe>ABCDE</helpMe><helpMe>FGHIJ</helpMe></div>
The greedy capture would look like this:
ABCDE</helpMe><helpMe>FGHIJ
There are some useful interactive tools to play with these variations:
Regex Tester
Regex Pal
Ken Redler has a great answer regarding your first question. For the second question try:
<(helpMe)>(.*?)</\1>
Using the back reference \1 you can find values between the set of matching tags. The first group finds the tag name, the second group matches the content itself, and the \1 back reference re-uses the first group's match (in this case the tag name).
Also, in C# you can use named groups, like: <(helpMe)>(?<value>.*?)</\1> where now match.Groups["value"].Value contains your value.
What does (.+?) indicate?
It means match any character (.) one or more times (+?)
A simple regex to match your second string would be
<helpMe>([a-z0-9]+)<\/helpMe>
This will match any character of a-z and any digit inside <helpme> and </helpMe>.
The pharanteses are used to capture a group. This is useful if you need to reference the value inside this group later.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources