Failing to recognize both keywords - c#

Let me know if I'm asking the question in a wrong way. Not sure if I'm approaching it from the right angle.
My regex looks like this.
^.+(ef?)|(mn?).+$
I'm trying to match line 2 and 4 in the text below.
abcd
efgh
ijkl
mnop
qrst
As it seem, only the last one catches the editors eye. What am I missing?
I've tried to follow some examples for detecting e.g. "ALPHA" and "BETA" words but, apparently, I'm too ignorant of how it works.

regex engine would split the below regex into two parts.
^.+(ef?)|(mn?).+$
Part 1| Part 2
At-first, part1 will be executed.
^.+(ef?)
.+ ensures that there must be atleast a single character present before e, but there isn't. So it fails to match the second one. And fails for all the others because there isn't a character e present in the remaining strings.
| OR
Now the regex engine moves to the second part,
(mn?).+$
Matches the string which contains the letter m. m is present only in the fourth string. So it matches the m plus the following one or more characters because of .+.
The correct approach to match the 2 and 4th strings is:
^.*(ef?).*$|^.*(mn?).*$
OR
^.*(?:(ef?)|(mn?)).*$
DEMO
Use ^.*(?:(ef?)|(mn?)).+$, if there must be a character follows e and an optional f or m and an optional n
If you want to match the strings starts with e or m, then use the below regex.
^(ef?|mn?).+$
Note:
.* matches any character zero or more times.
.+ matches any character one or more times.

Related

Is there a regular expression for matching a string that has no more than 2 repeating characters? [duplicate]

I want to match strings that do not contain more than 3 of the same character repeated in a row. So:
abaaaa [no match]
abawdasd [match]
abbbbasda [no match]
bbabbabba [match]
Yes, it would be much easier and neater to do a regex match for containing the consecutive characters, and then negate that in the code afterwards. However, in this case that is not possible.
I would like to open out the question to x consecutive characters so that it can be extended to the general case to make the question and answer more useful.
Negative lookahead is supported in this case.
Use a negative lookahead with back references:
^(?:(.)(?!\1\1))*$
See live demo using your examples.
(.) captures each character in group 1 and the negative look ahead asserts that the next 2 chars are not repeats of the captured character.
To match strings not containing a character repeated more than 3 times consecutively:
^((.)\2?(?!\2\2))+$
How it works:
^ Start of string
(
(.) Match any character (not a new line) and store it for back reference.
\2? Optionally match one more exact copies of that character.
(?! Make sure the upcoming character(s) is/are not the same character.
\2\2 Repeat '\2' for as many times as you need
)
)+ Do ad nauseam
$ End of string
So, the number of /2 in your whole expression will be the number of times you allow a character to be repeated consecutively, any more and you won't get a match.
E.g.
^((.)\2?(?!\2\2\2))+$ will match all strings that don't repeat a character more than 4 times in a row.
^((.)\2?(?!\2\2\2\2))+$ will match all strings that don't repeat a character more than 5 times in a row.
Please be aware this solution uses negative lookahead, but not all not all regex flavors support it.
I'm answering this question :
Is there a regular expression for matching a string that has no more than 2 repeating characters?
which was marked as an exact duplicate of this question.
Its much quicker to negate the match instead
if (!Regex.Match("hello world", #"(.)\1{2}").Success) Console.WriteLine("No dups");

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?
You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")
You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.
You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Match at least one character and one number, regardless of order, with no suffix?

I need a RegEx that fulfills this statement:
There is at least one character and one digit, regardless of order, and there is no suffix (i.e. domain name) at the end.
So I have this test list:
ra182
jas182
ra1z4
And I have this RegEx:
[a-z]+[0-9]+$
It's matching the first two fully, but it's only matching the z4 in the last one. Though it makes sense to me why it's only matching that piece of the last entry, I need a little help getting this the rest of the way.
You can check the first two conditions with lookaheads:
/^(?=.*[a-z])(?=.*[0-9])/i
... and if the third one is just about the absence of ., it's simple to check too:
/^(?=.*[a-z])(?=.*[0-9])[^.]+$/i
But I'd probably prefer to use three separate tests instead: with first check for symbols (are you sure it's enough to check just for a range - [a-z] - and not for a Unicode Letter property?), the second for digits, and the final one for this pesky dot, like this:
if (Regex.IsMatch(string, "[a-zA-Z]")
&& Regex.IsMatch(string, "[0-9]")
&& ! Regex.IsMatch(string, #"\.") )
{
// string IS valid, proceed
}
The regex in the question will try to match one or more symbols, followed by one or more digits; it obviously will fail for the strings like 9a.
I suggest to use
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)(?!.*\.).*");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)(?!.*[.]).*");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)[^.]*$");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])[^.]*\d[^.]*$");
if (match.Success) ...
You need to match alphanumeric strings that have at least one letter and one number? Try something like this:
\w*[a-z]\w*[0-9]\w*
This will make sure you have at least one letter and one number, with the number after the letter. If you want to take into account numbers before letters, just use the corresponding expressiong (numbers before letters) and | the two.

regular expression greedy on left side only (.net)

I am trying to capture matches between two strings.
For example, I am looking for all text that appears between Q and XYZ, using the "soonest" match (not continuing to expand outwards). This string:
circus Q hello there Q SOMETEXT XYZ today is the day XYZ okay XYZ
Should return:
Q SOMETEXT XYZ
But instead, it returns:
Q hello there Q SOMETEXT XYZ
Here is the expression I'm using:
Q.*?XYZ
It's going too far back to the left. It's working fine on the ride side when I use the question mark after the asterisk. How can I do the same for the left side, and stop once I hit that first left Q, making it work the same as the right side works? I've tried question marks and other symbols from http://msdn.microsoft.com/en-us/library/az24scfc.aspx, but there's something I'm just not figuring out.
I'm a regex novice, so any help on this would be appreciated!
Well, the non Greedy match is working - it gets the shortest string that satisfies the regex. The thing that you have to remember is that regex is a left to right process. So it matches the first Q, then gets the shortest number of characters followed by an XYZ. If you want it not to go past any Qs, you have to use a negated character class:
Q[^Q]*?XYZ
[^Q] matches any one character that is not a Q. Mind that this will only work for a single character. If your opening delimeter is multiple characters, you have to do it a different way. Why? Well, take the delimiter 'PQR' and the string is
foo PQR bar XYZ
If you try to use the regex from before, but you extended the character class to :
PQR[^PQR]*?XYZ
then you'll get
'PQR bar XYZ'
As you expected. But if your string is
foo PQR Party Time! XYZ
You'll get no matches. It's because [] delineates a "character class" - which matches exactly one character. Using these classes, you can match a range of characters, simply by listing them.
th[ae]n
will match both 'than' and 'then', but not 'thin'. Placing a carat ('^') at the beginning negates the class - meaning "match anything but these characters" - so by turning our one-character delimiter into [^PQR], rather than saying "not 'PQR'", you're saying "not 'P', 'Q', or 'R'". You can still use this if you want, but only if you're 100% sure that the characters from your delimiter will only be in your delimiter. If that's the case, it's faster to use greedy matching and only negate the first character of your delimiter. The regex for that would be:
PQR[^P]*XYZ
But, if you can't make that guarantee, then match with:
PQR(?:.(?!PQR))*?XYZ
Regex doesn't directly support negative string matching (because it's impossible to define, when you think about it), so you have to use a negative lookahead.
(?!PQR)
is just such a lookahead. It means "Assert that the next few characters are not this internal regex", without matching any characters, so
.(?!PQR)
matches any character not followed by PQR. Wrap that in a group so that you can lazily repeat it,
(.(?!PQR))*?
and you have a match for "string that doesn't contain my delimiter". The only thing I did was add a ?: to make it a non-capturing group.
(?:.(?!PQR))*?
Depending on the language you use to parse your regex, it may try to pass back every matched group individually (useful for find and replace). This keeps it from doing that.
Happy regexing!
The concept of greediness only works on the right side.
To make the expression only match from the last Q before XYZ, make it not match Q between them:
Q[^Q]*?XYZ

C# Regex Phone Number Check

I have the following to check if the phone number is in the following format
(XXX) XXX-XXXX. The below code always return true. Not sure why.
Match match = Regex.Match(input, #"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}");
// Below code always return true
if (match.Success) { ....}
The general complaint about regex patterns for phone numbers is that they require one to put in the truly optional characters as dashes and other items.
Why can't they be optional and have the pattern not care if they are there or not?
The below pattern makes dashes, periods and parenthesis optional for the user and focuses on the numbers as a result using named captures.
The pattern is commented (using the # and spans multiple lines) so use the Regex option IgnorePatternWhitespace unless one removes the comments. For that flag doesn't affect regex processing, it only allows for commenting of the pattern via the # character and line break .
string pattern = #"
^ # From Beginning of line
(?:\(?) # Match but don't capture optional (
(?<AreaCode>\d{3}) # 3 digit area code
(?:[\).\s]?) # Optional ) or . or space
(?<Prefix>\d{3}) # Prefix
(?:[-\.\s]?) # optional - or . or space
(?<Suffix>\d{4}) # Suffix
(?!\d) # Fail if eleventh number found";
The above pattern just looks for 10 numbers and ignores any filler characters such as a ( or a dash - or a space or a tab or even a .. Examples are
(555)555-5555 (OK)
5555555555 (ok)
555 555 5555(ok)
555.555.5555 (ok)
55555555556 (not ok - match failure - too many digits)
123.456.789 (failure)
Different Variants of same pattern
Pattern without comments no longer need to use IgnorePatternWhiteSpace:
^(?:\(?)(?<AreaCode>\d{3})(?:[\).\s]?)(?<Prefix>\d{3})(?:[-\.\s]?)(?<Suffix>\d{4})(?!\d)
Pattern when not using Named Captures
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d)
Pattern if ExplicitCapture option is used
^\(?(?<AreaCode>\d{3})[\).\s]?(?<Prefix>\d{3})[-\.\s](?<Suffix>\d{4})(?!\d)
It doesn't always match, but it will match any string that contains three digits, followed by a hyphen, followed by four more digits. It will also match if there's something that looks like an area code on the front of that. So this is valid according to your regex:
%%%%%%%%%%%%%%(999)123-4567%%%%%%%%%%%%%%%%%
To validate that the string contains a phone number and nothing else, you need to add anchors at the beginning and end of the regex:
#"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$"
Alan Moore did a good explaining what your exp is actually doing. +1
If you want to match exactly "(XXX) XXX-XXXX" and absolutely nothing else, then what you want is
#"^\(\d{3}\) \d{3}-\d{4}$"
Here is the C# code I use. It is designed to get all phone numbers from a page of text. It works for the following patters: 0123456789, 012-345-6789, (012)-345-6789, (012)3456789 012 3456789, 012 345 6789, 012 345-6789, (012) 345-6789, 012.345.6789
List<string> phoneList = new List<string>();
Regex rg = new Regex(#"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})");
MatchCollection m = rg.Matches(html);
foreach (Match g in m)
{
if (g.Groups[0].Value.Length > 0)
phoneList.Add(g.Groups[0].Value);
}
none of the comments above takes care of international numbers like +33 6 87 17 00 11 (which is a valid phone number for France for example).
I would do it in a two-step approach:
1. Remove all characters that are not numbers or '+' character
2. Check the + sign is at the beginning or not there. Check length (this can be very hard as it depends on local country number schemes).
Now if your number starts with +1 or you are sure the user is in USA, then you can apply the comments above.

Categories

Resources