What is the meaning of the Regular Expression ^(.)\1+$ - c#

Since I am new of regular expression I well versed with Regex.
Can someone please help with the meaning of this Regex?
^(.)\1+$

^(.)\1+$ is a clever regular expression that matches to any full line that has two or more letters all of which are identical. E.g.:
aaa
BBBBBBB
See comments for explanation of what each part of the regular expression does.

The short answer: The first character in the subject is repeated one or more times and occupies the entire test subject. Put differently, the entire subject is occupied by two or more of the same character.
Since you're learning:
'^' is the begin of subject "anchor." It does not consume any data, just asserts a position. Similar to the metacharacter sequence \A if you encounter that, although the latter is not affected by line mode (research regex "mode modifiers").
'$' is the end of subject anchor. Again, a non consuming assertion. Similar to \Z metacharacter, although the latter is not affected by line mode. Close cousin of \z (although the latter has no regard for newlines and line modes). Anytime you see a regex framed with ^...$ it's asserting that the match condition is front to back.
"()" are capturing parenthesis. This means you can "refer back" to what was captured using \N where N is 1-9 corresponding to the order of capturing parenthesis, left to right. In your example, we only have one capture group, so it's referred to as \1. In addition to capturing for reference, groups are used for quantification--for repeating a pattern a specified number of times, and for alternation of strings or patterns, e.g.: (^|$), where "|" is a logical "or," this regex would test to see if the subject begins or ends with an underscore.
'.' (dot) is a metacharacter that represents any single character. (Roughly speaking, look up various match "modes" to see how you can control whether or not dot matches line break. Don't confuse "character" with octet or byte. Locales and character set encodings are too grand a subject to elaborate, but just be aware that the definition of a "character" is somewhat contextual.)
'+' is the one-or-more quantifier. \1+ means whatever character was captured by capture group 1 repeats one-or-more times (i.e., occurs two or more times). "aa" would match, where the first 'a' is matched by the dot captured in group \1, and the second 'a' matches because it satisfies the one-or-more quantification.

Related

How to match string by using regular expression which will not allow same special character at same time?

I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)
You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1
The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo
Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.

Is there a regular expression for matching a string that has no more than 2 repeating characters? [duplicate]

I want to match strings that do not contain more than 3 of the same character repeated in a row. So:
abaaaa [no match]
abawdasd [match]
abbbbasda [no match]
bbabbabba [match]
Yes, it would be much easier and neater to do a regex match for containing the consecutive characters, and then negate that in the code afterwards. However, in this case that is not possible.
I would like to open out the question to x consecutive characters so that it can be extended to the general case to make the question and answer more useful.
Negative lookahead is supported in this case.
Use a negative lookahead with back references:
^(?:(.)(?!\1\1))*$
See live demo using your examples.
(.) captures each character in group 1 and the negative look ahead asserts that the next 2 chars are not repeats of the captured character.
To match strings not containing a character repeated more than 3 times consecutively:
^((.)\2?(?!\2\2))+$
How it works:
^ Start of string
(
(.) Match any character (not a new line) and store it for back reference.
\2? Optionally match one more exact copies of that character.
(?! Make sure the upcoming character(s) is/are not the same character.
\2\2 Repeat '\2' for as many times as you need
)
)+ Do ad nauseam
$ End of string
So, the number of /2 in your whole expression will be the number of times you allow a character to be repeated consecutively, any more and you won't get a match.
E.g.
^((.)\2?(?!\2\2\2))+$ will match all strings that don't repeat a character more than 4 times in a row.
^((.)\2?(?!\2\2\2\2))+$ will match all strings that don't repeat a character more than 5 times in a row.
Please be aware this solution uses negative lookahead, but not all not all regex flavors support it.
I'm answering this question :
Is there a regular expression for matching a string that has no more than 2 repeating characters?
which was marked as an exact duplicate of this question.
Its much quicker to negate the match instead
if (!Regex.Match("hello world", #"(.)\1{2}").Success) Console.WriteLine("No dups");

Matching a number preceeded by a know string, followed by an unknown number of characters

[SOME_WORDS:200:1000]
Trying to match just the last 1000 part. Both numbers are variable and can contain an unknown number of characters (although they are expected to contain digits, I cannot rule out that they may also contain other characters). The SOME_WORDS part is known and does not change.
So I begin by doing a positive lookbehind for [SOME_WORDS: followed by a positive lookahead for the trailing ]
That gives us the pattern (?<=\[SOME_WORDS:).*(?=])
And captures the part 200:1000
Now because I don't know how many characters are after SOME_WORDS:, but I know that it ends with another : I use .*: to indicate any character any amount of time followed by :
That gives us the pattern (?<=\[SOME_WORDS:.*:).*(?=])
However at this point the pattern no longer matches anything and this is where I become confused. What am I doing wrong here?
If I assume that the first number will always be 3 characters long I can replace .* with ... to get the pattern (?<=\[SOME_WORDS:...:).*(?=]) and this correctly captures just the 1000 part. However I don't understand why replacing ... with .* makes the pattern not capture anything.
EDIT:
It seems like the online tool I was using to test the regex pattern wasn't working correctly. The pattern (?<=\[SOME_WORDS:.*:).*(?=]) matches the 1000 with no issues when actually done in .net
You usually cannot use a + or a * in a lookbehind, only in a lookahead.
If c# does allow these than you could use a .*? instead of a .* as the .* will eat the second :
Try this:
(?<=\[SOME_WORDS:)(?=\d+:(\d+)])
The match wil be in the first capture group
Quote from http://www.regular-expressions.info/lookaround.html
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.
As Robert Smit mentions this is due to the * being a greedy operator. Greedy operators consume as many characters as they possibly can when they are matched first. They only give up characters if the match fails. If you make the greedy operator lazy(*?), then matching consumes as little number of characters as possible for the match to succeed, so the : is not consumed by *. You can also use [^:]* which is match any character other than :.

Assist me on building my own regex

I'm completely new in this area, I need a regex that follows these rules:
Only numbers and symbols are allowed.
Must start with a number and ends with a number.
Must not contain more than 1 symbol in a row. (for example 123+-4567 is not accepted but 12+345-67 is accepted.
I tried ^[0-9]*[+-*/][0-9]*$ but I think it's a stupid try.
You were close with your attempt. This one should work.
^[0-9]+([+*/-][0-9]+)*$
explanation:
^ matches beginning of the string
[0-9]+ matches 1 or more digits.
[+*/-] matches one from specified symbols
([+*/-][0-9]+)* matches group of symbol followed by at least one digit, repeated 0 or more times
$ matches end of string
We'll build that one from individual parts and then we'll see how we can be smarter about that:
Numbers
\d+
will match an integer. Not terribly fancy, but we need to start somewhere.
Must start with a number and end with a number:
^\d+.*\d+$
Pretty straightforward. We don't know anything about the part in between, though (also the last \d+ will only match a single digit; we might want to fix that eventually).
Only numbers and symbols are allowed. Depending on the complexity of the rest of the regex this might be easier by explicitly spelling it out or using a negative lookahead to make sure there is no non-(number|symbol) somewhere in the string. I'll go for the latter here because we need that again:
(?!.*[^\d+*/-])
Sticking this to the start of the regex makes sure that the regex won't match if there is any non-(number|symbol) character anywhere in the string. Also note that I put the - at the end of the character class. This is because it has a certain special meaning when used between two other characters in a character class.
Must not contain more than one symbol in a row. This is a variation on the one before. We just make sure that there never is more than one symbol by using a negative lookahead to disallow two in sequence:
(?!.*[+/*-]{2})
Putting it all together:
(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$
Testing it:
PS Home:\> '123+-4567' -match '(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$'
False
PS Home:\> '123-4567' -match '(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$'
True
However, I only literally interpreted your rules. If you're trying to match arithmetic expressions that can have several operands and operators in sequence (but without parentheses), then you can approach that problem differently:
Numbers again
\d+
Operators
[+/*-]
A number followed by an operator
\d+[+/*-]
Using grouping and repetition to match a number followed by any number of repetitions of an operator and another number:
\d+([+/*-]\d+)*
Anchoring it so we match the whole string:
^\d+([+/*-]\d+)*$
Generally, for problems where it works, this latter approach works better and leads to more understandable expressions. The former approach has its merits, but most often only in implementing password policies (apart from »cannot repeat any of your previous 30689 passwords«).

What does .* do in regex?

After extensive search, I am unable to find an explanation for the need to use .* in regex. For example, MSDN suggests a password regex of
#\"(?=.{6,})(?=(.*\d){1,})(?=(.*\W){1,})"
for length >= 6, 1+ digit and 1+ special character.
Why can't I just use:
#\"(?=.{6,})(?=(\d){1,})(?=(\W){1,})"
.* just means "0 or more of any character"
It's broken down into two parts:
. - a "dot" indicates any character
* - means "0 or more instances of the preceding regex token"
In your example above, this is important, since they want to force the password to contain a special character and a number, while still allowing all other characters. If you used \d instead of .*, for example, then that would restrict that portion of the regex to only match decimal characters (\d is shorthand for [0-9], meaning any decimal). Similarly, \W instead of .*\W would cause that portion to only match non-word characters.
A good reference containing many of these tokens for .NET can be found on the MSDN here: Regular Expression Language - Quick Reference
Also, if you're really looking to delve into regex, take a look at http://www.regular-expressions.info/. While it can sometimes be difficult to find what you're looking for on that site, it's one of the most complete and begginner-friendly regex references I've seen online.
Just FYI, that regex doesn't do what they say it does, and the way it's written is needlessly verbose and confusing. They say it's supposed to match more than seven characters, but it really matches as few as six. And while the other two lookaheads correctly match at least one each of the required character types, they can be written much more simply.
Finally, the string you copied isn't just a regex, it's an XML attribute value (including the enclosing quotes) that seems to represent a C# string literal (except the closing quote is missing). I've never used a Membership object, but I'm pretty sure that syntax is faulty. In any case, the actual regex is:
(?=.{6,})(?=(.*\d){1,})(?=(.*\W){1,})
..but it should be:
(?=.{8,})(?=.*\d)(?=.*\W)
The first lookahead tries to match eight or more of any characters. If it succeeds, the match position (or cursor, if you prefer) is reset to the beginning and the second lookahead scans for a digit. If it finds one, the cursor is reset again and the third lookahead scans for a special character. (Which, by the way, includes whitespace, control characters, and a boatload of other esoteric characters; probably not what the author intended.)
If you left the .* out of the latter two lookaheads, you would have (?=\d) asserting that the first character is a digit, and (?=\W) asserting that it's not a digit. (Digits are classed as word characters, and \W matches anything that's not a word character.) The .* in each lookahead causes it to initially gobble up the whole string, then backtrack, giving back one character at a time until it reaches a spot where the \d or \W can match. That's how they can match the digit and the special character anywhere in the string.
The .* portion just allows for literally any combination of characters to be entered. It's essentially allowing for the user to add any level of extra information to the password on top of the data you are requiring
Note: I don't think that MSDN page is actually suggesting that as a password validator. It is just providing an example of a possible one.

Categories

Resources