Regular expression to match following criterias [duplicate] - c#

I am using the following regular expression without restricting any character length:
var test = /^(a-z|A-Z|0-9)*[^$%^&*;:,<>?()\""\']*$/ // Works fine
In the above when I am trying to restrict the characters length to 15 as below, it throws an error.
var test = /^(a-z|A-Z|0-9)*[^$%^&*;:,<>?()\""\']*${1,15}/ //**Uncaught SyntaxError: Invalid regular expression**
How can I make the above regular expression work with the characters limit to 15?

You cannot apply quantifiers to anchors. Instead, to restrict the length of the input string, use a lookahead anchored at the beginning:
// ECMAScript (JavaScript, C++)
^(?=.{1,15}$)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*$
^^^^^^^^^^^
// Or, in flavors other than ECMAScript and Python
\A(?=.{1,15}\z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\z
^^^^^^^^^^^^^^^
// Or, in Python
\A(?=.{1,15}\Z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\Z
^^^^^^^^^^^^^^^
Also, I assume you wanted to match 0 or more letters or digits with (a-z|A-Z|0-9)*. It should look like [a-zA-Z0-9]* (i.e. use a character class here).
Why not use a limiting quantifier, like {1,15}, at the end?
Quantifiers are only applied to the subpattern to the left, be it a group or a character class, or a literal symbol. Thus, ^[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']{1,15}$ will effectively restrict the length of the second character class [^$%^&*;:,<>?()\"'] to 1 to 15 characters. The ^(?:[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*){1,15}$ will "restrict" the sequence of 2 subpatterns of unlimited length (as the * (and +, too) can match unlimited number of characters) to 1 to 15 times, and we still do not restrict the length of the whole input string.
How does the lookahead restriction work?
The (?=.{1,15}$) / (?=.{1,15}\z) / (?=.{1,15}\Z) positive lookahead appears right after ^/\A (note in Ruby, \A is the only anchor that matches only start of the whole string) start-of-string anchor. It is a zero-width assertion that only returns true or false after checking if its subpattern matches the subsequent characters. So, this lookahead tries to match any 1 to 15 (due to the limiting quantifier {1,15}) characters but a newline right at the end of the string (due to the $/\z/\Z anchor). If we remove the $ / \z / \Z anchor from the lookahead, the lookahead will only require the string to contain 1 to 15 characters, but the total string length can be any.
If the input string can contain a newline sequence, you should use [\s\S] portable any-character regex construct (it will work in JS and other common regex flavors):
// ECMAScript (JavaScript, C++)
^(?=[\s\S]{1,15}$)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*$
^^^^^^^^^^^^^^^^^
// Or, in flavors other than ECMAScript and Python
\A(?=[\s\S]{1,15}\z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\z
^^^^^^^^^^^^^^^^^^
// Or, in Python
\A(?=[\s\S]{1,15}\Z)[a-zA-Z0-9]*[^$%^&*;:,<>?()\"']*\Z
^^^^^^^^^^^^^^^^^^

Related

Regular expression that Must have at least one letter

I have a case where I am using a queue of regular expressions to filter out specific items in an Observer pattern. The filter will place the values in specific controls based on their values. However 1 of the controls pattern is that it can accept ANY ASCII Character. Let me list the filters in their order with the RegEx
Column Rule Regex
Receiving 7 DIGITS #"^[1-9]([0-9]{6}$)" --->Works
Count 2 digits, no leading 0 #"^[1-9]([0-9]{0,1})$" --->Works
Producer any ASCII char. #".*" --->too broad
MUST contain a letter
Is there a regular expression that will accept any set of ASCII characters, but 1 of them MUST be a letter (upper or lower case)?
#"^(?=.*[A-Za-z])$" -->Didn't work
examples that would need to go into expression
123 red
red
123 red123
red - 123
red
If you want to match the whole rang of ASCII chars you may use
#"^(?=[^A-Za-z]*[A-Za-z])[\x00-\x7F]*$"
If only printable chars are allowed use
#"^(?=[^A-Za-z]*[A-Za-z])[ -~]*$"
Note the (?=[^A-Za-z]*[A-Za-z]) positive lookahead is located right after ^, that is, it is only triggered at the start of a string. It requires an ASCII letter after any 0 or more chars other than an ASCII letter.
Your ^(?=.*[A-Za-z])$ pattern did not work because you wanted to match an empty string (^$) that contains (?=...) at least one ASCII letter ([A-Za-z]) after any 0+ chars other than newline (.*).
You could try [A-Za-z]+.
It matches when there is at least one letter. You want something more specific?
How about
^.*[a-zA-Z]+.*$ ?
Between start and end of line, accept any number of any characters, then at least one a-z/A-Z character, then again any number of any characters.

Regex for first name

I am quite new to regex thing and need regex for first name which satisfies following conditions:
First Name must contain letters only. It may contain spaces, hyphens, or apostrophes.
It must begin with letters.
All other characters and numbers are not valid.
Special characters ‘ and – cannot be together (e.g. John’-s is not allowed)
An alphabet should be present before and after the special characters ‘ and – (e.g. John ‘s is not allowed)
Two consecutive spaces are not allowed (e.g. Annia St is not allowed)
Can anyone help? I tried this ^([a-z]+['-]?[ ]?|[a-z]+['-]?)*?[a-z]$ but it's not working as expected.
Regexes are notoriously difficult to write and maintain.
One technique that I've used over the years is to annotate my regexes by using named capture groups. It's not perfect, but can greatly help with the readability and maintainability of your regex.
Here is a regex that meets your requirements.
^(?<firstchar>(?=[A-Za-z]))((?<alphachars>[A-Za-z])|(?<specialchars>[A-Za-z]['-](?=[A-Za-z]))|(?<spaces> (?=[A-Za-z])))*$
It is split down into the following parts:
1) (?<firstchar>(?=[A-Za-z])) This ensures the first character is an alpha character, upper or lowercase.
2) (?<alphachars>[A-Za-z]) We allow more alpha chars.
3) (?<specialchars>[A-Za-z]['-](?=[A-Za-z])) We allow special characters, but only with an alpha character before and after.
4) (?<spaces> (?=[A-Za-z])) We allow spaces, but only one space, which must be followed by alpha characters.
You should use a testing tool when writing regexes, I'd recommend https://regex101.com/
You can see from the screenshot below how this regex performs.
Take the regex I've given you, run it in https://regex101.com/ with samples you'd like to match against, and tweak it to fit your requirements. Hopefully I've given you enough information to be self sufficient in customising it to your needs.
You can use this link to run the regex https://regex101.com/r/O2wFfi/1/
Edit
I've updated to address the issue in your comment, rather than just give you the code, I will explain the problem and how I fixed it.
For your example "Sam D'Joe", if we run the original regex, the following happens.
^(?<firstchar>[A-Za-z])((?<alphachars>[A-Za-z])|(?<specialchars>[A-Za-z]['-][A-Za-z])|(?<spaces> [A-Za-z]))*$
1) ^ matches the start of the string
2) (?<firstchar>[A-Za-z]) matches the first character
3) (?<alphachars>[A-Za-z]) matches every character up to the space
4) (?<spaces> [A-Za-z]) matches the space and the subsequent alpha char
Matches consume the characters that they match
This is where we run into a problem. Our "specialchars" part of the regex matches an alpha char, our special char and then another alpha char ((?<specialchars>[A-Za-z]['-](?=[A-Za-z]))).
The thing you need to know about regexes, is each time you match a character, that character is then consumed. We've already matched the alpha char before the special character, so our regex will never match.
Each step actually looks like this:
1) ^ matches the start of the string
2) (?<firstchar>[A-Za-z]) matches the first character
3) (?<alphachars>[A-Za-z]) matches every character up to the space
4) (?<spaces> [A-Za-z]) matches the space and the subsequent alpha char
and then we're left with the following
We cannot match this, because one of our rules is "An alphabet should be present before and after the special characters ‘ and –".
Lookahead
Regex has a concept called "lookahead". A lookahead allows you to match a character without consuming it!
The syntax for a lookahead is ?= followed by what you want to match. E.g. ?=[A-Z] would look ahead for a single character that is an uppercase letter.
We can fix our regex, by using lookaheads.
1) ^ matches the start of the string
2) (?<firstchar>[A-Za-z]) matches the first character
3) (?<alphachars>[A-Za-z]) matches every character up to the space
4) We now change our "spaces" regex, to lookahead to the alpha char, so we don't consume it. We change (?<spaces> [A-Za-z]) to (?<spaces> ?=[A-Za-z]). This matches the space and looks ahead to the subsequent alpha char, but doesn't consume it.
5) (?<specialchars>[A-Za-z]['-][A-Za-z]) matches the alpha char, the special char, and the subsequent alpha char.
6) We use a wildcard to repeat matching our previous 3 rules multiple times, and we match until the end of the line.
I also added lookaheads to the "firstchar", "specialchars" and "spaces" capture groups, I've bolded the changes below.
^(?<firstchar>(?=[A-Za-z]))((?<alphachars>[A-Za-z])|(?<specialchars>[A-Za-z]['-](?=[A-Za-z]))|(?<spaces> (?=[A-Za-z])))*$
This short regex should do it ^([a-zA-Z]+?)([-\s'][a-zA-Z]+)*?$ ,
([a-zA-Z]+?) - Means the String should start with alphabets.
([-\s'][a-zA-Z]+)*? - Means the string must have hyphen,space or apostrophe followed by alphabets.
^ and $ - denote start and end of string
Here's the link to regex demo.
Try this one
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))[A-Za-z- '.]{2,}$
Demo

Only replace pattern if the whole line matches regex

I am sure there is a trivial solution to this question but I can't seem to get it right:
I want to replace a specific pattern in a line only if the whole line matches the regex.
So in my case three pipes | should be replaced by underscores _ only if the whole line is numbers and pipes:
|||10|||-80|||-120|||400 ---> replace
|||10|||asdf|||-120|||400 ---> don't replace
|||10|||-80|||400 ---> replace
|||10|||-80|||-120|||400|||test ---> don't replace
Expected result:
___10___-80___-120___400
|||10|||asdf|||-120|||400
___10___-80___400
|||10|||-80|||-120|||400|||test
My attempts:
\|\|\|(?=\-?\d+)
replaces the pipes if followed by numbers as expected but of course also in the "invalid" lines
^(\|\|\|\-?\d+){1,}$
matches the whole line and therefore I can't replace only the pipes
I understand why my patterns don't work and perhaps I have to simply do it with two passes but it feels like this should totally be possible.
Without more details, it seems you can use
(?<=^(?:\|{3}-?\d+)*)\|{3}(?=-?\d+(?:\|{3}-?\d+)*$)
Or, if you need to process lines in a larger string:
(?m)(?<=^(?:\|{3}-?\d+)*)\|{3}(?=-?\d+(?:\|{3}-?\d+)*\r?$)
See the regex demo.
Details:
(?<=^(?:\|{3}-?\d+)*) - a positive lookbehind that requires that, immediately to the left of the current location, there is:
^ - start of string anchor
(?:\|{3}-?\d+)* - zero or more sequences of 3 |s followed with an optional - (-?) and then 1 or more digits
\|{3} - 3 pipes
(?=-?\d+(?:\|{3}-?\d+)*$) - a positive lookahead that requires that, immediately to the right of the current location, there is
-?\d+ - an optional - and then 1+ digits
(?:\|{3}-?\d+)* - 0 or more sequences of 3 |s + an optional - and then 1+ digits
$ - end of string anchor.
C#:
var res = Regex.Replace(s, #"(?<=^(?:\|{3}-?\d+)*)\|{3}(?=-?\d+(?:\|{3}-?\d+)*$)", "___", RegexOptions.ECMAScript);
The RegexOptions.ECMAScript flag is used to make \d only match ASCII digits.

Star vs. plus quantifier in the variable-width negative lookbehind

Silly question here... I'm trying to match white-space inside the line while ignoring the leading spaces/tabs and came up with these regex strings, but I can't figure out why only one is working (C# regex engine):
(?<!^[ \t]*)[ \t]+ // regex 1. (with *)
(?<!^[ \t]+)[ \t]+ // regex 2. (with +)
Note the star vs. plus repetitions in the negative look-ahead. When matching these against " word1 word2" (2 leading spaces):
⎵⎵word1⎵word2
^ // 1 match for regex 1. (*)
⎵⎵word1⎵word2
^^ ^ // 2 matches for regex 2. (+)
^ ^ // why not match like this?
Why does only version 1. (star) work here and version 2. (plus) not match the second leading space?
I presume that it's because of the higher priority of the greedy + from [ \t]+ over the look-ahead's, but how can I rationalize to expect this?
In short:
The negative lookbehind just checks if the current position is not preceded with the lookbehind pattern and the result of the check is either true (yes, go on matching) or false (stop processing the pattern, go for the next match). The check is not affecting the regex index, the engine remains at one and the same location after performing the check.
In the current expressions, the lookbehind pattern is checked first (as the pattern is parsed from left to right, not vice versa), and only if the lookbehind check returns true the [ \t]+ pattern is tried. In the first expression, the negative lookbehind returns false as the lookbehind pattern finds a match (the start of string). The second expression negative lookbehind returns true because there is no start of string followed with 1 or more spaces/tabs at the beginning of a string.
Here is the logic behind the 2 expressions:
The lookbehind check is performed first. In the first expression, (?<!^[ \t]*) is trying to match at the beginning of a string. A beginning of a string has no beginning of a string (^) followed with 0+ spaces or tabs. It is important to note that a lookbehind implementation in .NET checks the string in the opposite direction, flips the string, and searches for zero or more tabs and the string boundary. In case of (?<!^[ \t]*), the lookbehind returns false because there is a start position before 0 spaces or tabs (note we are still at the beginning of a string). The second expression lookbehind, (?<!^[ \t]+), returns true, because there is no tab or space before the start of string at the 0th index in the string, and thus, the [ \t]+ consuming pattern grabs the leading horizontal whitespace. That moves the regex index further and another match is found later in the string.
After failure at the beginning of the string, the first expression tries to match after the first space. However, the (?<!^[ \t]*) returns false because there is beginning of string followed with 1 space (the first one). Same story repeats with the position after the second space. The only spaces matched with the first (?<!^[ \t]*)[ \t]+ expression are those that are not at the beginning of the string.
Lookahead analogy
Check the analogous lookahead patterns: a [ \t]+(?![ \t]+$) pattern will find both whitespace chunks in "bb bb ", while [ \t]+(?![ \t]*$) will not match those at the end of the string. The same logic applies: 1) the * version allows matching an empty string, so the end of string is found and the negative lookahead returns false, the match is failed. When the + version encounters and consumes the trailing whitespaces, the regex engine, staying at the end of string, cannot find 1 or more spaces/tabs followed with another end of string, thus, the negative lookahead returns true and the trailing whitespaces are matched.

Better way to write this RegEx?

I have this password regex for an application that is being built its purpose is to:
Make sure users use between 6 - 12 characters.
Make sure users use either one special character or one number.
Also that its case insensitive.
The application is in .net I have the following regex:
I have the following regex for the password checker, bit lengthy but for your viewing if you feel any of this is wrong please let me know.
^(?=.*\d)(?=.*[A-Za-z]).{6-12}$|^(?=.*[A-Za-z])(?=.*[!#$%&'\(\)\*\+-\.:;<=>\?#\[\\\]\^_`\{\|\}~0x0022]|.*\s).{6,12}$
Just a break down of the regex to make sure your all happy it’s correct.
^ = start of string ”^”
(?=.*\d) = must contain “?=” any set of characters “.*” but must include a digit “\d”.
(?=.*[A-Za-z]) = must contain “?=” any set of characters “.*” but must include an insensitive case letter.
.{6-12}$ = must contain any set of characters “.” but must have between 6-12 characters and end of string “$”.
|^ = or “|” start of string “^”
(?=.*[A-Za-z]) = must contain “?=” any set of characters “.*” but must include an insensitive case letter.
(?=.*[!#$%&'\(\)\*\+-\.:;<=>\?#\[\\\]\^_`\{\|\}~0x0022]|.*\s) = must contain “?=” any set of characters “.*” but must include at least special character we have defined or a space ”|.*\s)”. “0x0022” is Unicode for single quote “ character.
.{6,12}$ = set of characters “.” must be between 6 – 12 and this is the end of the string “$”
It's quite long winded, seems to be doing the job but I want to know if there is simpler methods to write this sort of regex and I want to know how I can shorten it if its possible?
Thanks in Advanced.
Does it have to be regex? Looking at the requirements, all you need is String.Length and String.IndexOfAny().
First, good job at providing comments for your regex. However, there is a much better way. Simply write your regex from the get-go in free-spacing mode with lots of comments. This way you can document your regex right in the source code (and provide indentation to improve readability when there are lots of parentheses). Here is how I would write your original regex in C# code:
if (Regex.IsMatch(usernameString,
#"# Validate username having a digit and/or special char.
^ # Either... Anchor to start of string.
(?=.*\d) # Assert there is a digit AND
(?=.*[A-Za-z]) # assert there is an alpha.
.{6-12} # Match any name with length from 6 to 12.
$ # Anchor to end of string.
| ^ # Or... Anchor to start of string
(?=.*[A-Za-z]) # Assert there is an alpha AND
(?=.* # assert there is either a special char
[!#$%&'\(\)\*\+-\.:;<=>\?#\[\\\]\^_`\{\|\}~\x22]
| .*\s # or a space char.
) # End specialchar-or-space assertion.
.{6-12} # Match any name with length from 6 to 12.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace)) {
// Valid username.
} else {
// Invalid username.
}
The code snippet above uses the preferable #"..." string syntax which simplifies the escaping of metacharacters. This original regex erroneously separates the two numbers of the curly brace quantifier using a dash, i.e. .{6-12}. The correct syntax is to separate these numbers with a comma, i.e. .*{6,12}. (Maybe .NET allows using the .{6-12} syntax?) I've also changed the 0x0022 (the " double quote char) to \x22.
That said, yes the original regex can be improved a bit:
if (Regex.IsMatch(usernameString,
#"# Validate username having a digit and/or special char.
^ # Anchor to start of string.
(?=.*?[A-Za-z]) # Assert there is an alpha.
(?: # Group for assertion alternatives.
(?=.*?\d) # Either assert there is a digit
| # or assert there is a special char
(?=.*?[!#$%&'()*+-.:;<=>?#[\\\]^_`{|}~\x22\s]) # or space.
) # End group of assertion alternatives.
.{6,12} # Match any name with length from 6 to 12.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace)) {
// Valid username.
} else {
// Invalid username.
}
This regex eliminates the global alternative and instead uses a non-capture group for the "digit or specialchar" assertion alternatives. Also, you can eliminate the non-capture group for the "special char or whitespace" alternatives by simply adding the \s to the list of special chars. I've also added a lazy modifier to the dot-stars in the assertions, i.e. .*? - (this may make the regex match a bit faster.) A bunch of unnecessary escapes were removed from the specialchar character class.
But as Stema cleverly pointed out, you can combine the digit and special char to simplify this even further:
if (Regex.IsMatch(usernameString,
#"# Validate username having a digit and/or special char.
^ # Anchor to start of string
(?=.*?[A-Za-z]) # Assert there is an alpha.
# Assert there is a special char, space
(?=.*?[!#$%&'()*+-.:;<=>?#[\\\]^_`{|}~\x22\s\d]) # or digit.
.{6,12} # Match any name with length from 6 to 12.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace)) {
// Valid username.
} else {
// Invalid username.
}
Other than that, there is really nothing wrong with your original regex with regard to accuracy. However, logically, this formula allows a username to end with whitespace which is probably not a good idea. I would also explicitly specify a whitelist of allowable chars in the name rather than using the overly permissive "." dot.
I am not sure if it makes sense what you are doing, but to achieve that, your regex can be simpler
^(?=.*[A-Za-z])(?=.*[\d\s!#$%&'\(\)\*\+-\.:;<=>\?#\[\\\]\^_`\{\|\}~0x0022]).{6,12}$
Why using alternatives? Just Add \d and \s to the character class.

Categories

Resources