Difficulty finding where to insert "word exclusion" in a regex - c#

I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?

ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)

use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo

Related

How to match string by using regular expression which will not allow same special character at same time?

I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)
You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1
The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo
Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.

Simple phone number regex to match numbers, spaces, etc

I'm trying to modify a fairly basic regex pattern in C# that tests for phone numbers.
The patterns is -
[0-9]+(\.[0-9][0-9]?)?
I have two questions -
1) The existing expression does work (although it is fairly restrictive) but I can't quite understand how it works. Regexps for similar issues seem to look more like this one -
/^[0-9()]+$/
2) How could I extend this pattern to allow brackets, periods and a single space to separate numbers. I tried a few variations to include -
[0-9().+\s?](\.[0-9][0-9]?)?
Although i can't seem to create a valid pattern.
Any help would be much appreciated.
Thanks,
[0-9]+(\.[0-9][0-9]?)?
First of all, I recommend checking out either regexr.com or regex101.com, so you yourself get an understanding of how regex works. Both websites will give you a step-by-step explanation of what each symbol in the regex does.
Now, one of the main things you have to understand is that regex has special characters. This includes, among others, the following: []().-+*?\^$. So, if you want your regex to match a literal ., for example, you would have to escape it, since it's a special character. To do so, either use \. or [.]. Backslashes serve to escape other characters, while [] means "match any one of the characters in this set". Some special characters don't have a special meaning inside these brackets and don't require escaping.
Therefore, the regex above will match any combination of digits of length 1 or more, followed by an optional suffix (foobar)?, which has to be a dot, followed by one or two digits. In fact, this regex seems more like it's supposed to match decimal numbers with up to two digits behind the dot - not phone numbers.
/^[0-9()]+$/
What this does is pretty simple - match any combination of digits or round brackets that has the length 1 or greater.
[0-9().+\s?](\.[0-9][0-9]?)?
What you're matching here is:
one of: a digit, round bracket, dot, plus sign, whitespace or question mark; but exactly once only!
optionally followed by a dot and one or two digits
A suitable regex for your purpose could be:
(\+\d{2})?((\(0\)\d{2,3})|\d{2,3})?\d+
Enter this in one of the websites mentioned above to understand how it works. I modified it a little to also allow, for example +49 123 4567890.
Also, for simplicity, I didn't include spaces - so when using this regex, you have to remove all the spaces in your input first. In C#, that should be possible with yourString.Replace(" ", ""); (simply replacing all spaces with nothing = deleting spaces)
The + after the character set is a quantifier (meaning the preceeding character, character set or group is repeated) at least one, and unlimited number of times and it's greedy (matched the most possible).
Then [0-9().+\s]+ will match any character in set one or more times.

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)
First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

What does .* do in regex?

After extensive search, I am unable to find an explanation for the need to use .* in regex. For example, MSDN suggests a password regex of
#\"(?=.{6,})(?=(.*\d){1,})(?=(.*\W){1,})"
for length >= 6, 1+ digit and 1+ special character.
Why can't I just use:
#\"(?=.{6,})(?=(\d){1,})(?=(\W){1,})"
.* just means "0 or more of any character"
It's broken down into two parts:
. - a "dot" indicates any character
* - means "0 or more instances of the preceding regex token"
In your example above, this is important, since they want to force the password to contain a special character and a number, while still allowing all other characters. If you used \d instead of .*, for example, then that would restrict that portion of the regex to only match decimal characters (\d is shorthand for [0-9], meaning any decimal). Similarly, \W instead of .*\W would cause that portion to only match non-word characters.
A good reference containing many of these tokens for .NET can be found on the MSDN here: Regular Expression Language - Quick Reference
Also, if you're really looking to delve into regex, take a look at http://www.regular-expressions.info/. While it can sometimes be difficult to find what you're looking for on that site, it's one of the most complete and begginner-friendly regex references I've seen online.
Just FYI, that regex doesn't do what they say it does, and the way it's written is needlessly verbose and confusing. They say it's supposed to match more than seven characters, but it really matches as few as six. And while the other two lookaheads correctly match at least one each of the required character types, they can be written much more simply.
Finally, the string you copied isn't just a regex, it's an XML attribute value (including the enclosing quotes) that seems to represent a C# string literal (except the closing quote is missing). I've never used a Membership object, but I'm pretty sure that syntax is faulty. In any case, the actual regex is:
(?=.{6,})(?=(.*\d){1,})(?=(.*\W){1,})
..but it should be:
(?=.{8,})(?=.*\d)(?=.*\W)
The first lookahead tries to match eight or more of any characters. If it succeeds, the match position (or cursor, if you prefer) is reset to the beginning and the second lookahead scans for a digit. If it finds one, the cursor is reset again and the third lookahead scans for a special character. (Which, by the way, includes whitespace, control characters, and a boatload of other esoteric characters; probably not what the author intended.)
If you left the .* out of the latter two lookaheads, you would have (?=\d) asserting that the first character is a digit, and (?=\W) asserting that it's not a digit. (Digits are classed as word characters, and \W matches anything that's not a word character.) The .* in each lookahead causes it to initially gobble up the whole string, then backtrack, giving back one character at a time until it reaches a spot where the \d or \W can match. That's how they can match the digit and the special character anywhere in the string.
The .* portion just allows for literally any combination of characters to be entered. It's essentially allowing for the user to add any level of extra information to the password on top of the data you are requiring
Note: I don't think that MSDN page is actually suggesting that as a password validator. It is just providing an example of a possible one.

Regex setting word characters and matching exact word

I need my C# regex to only match full words, and I need to make sure that +-/*() delimit words as well (I'm not sure if the last part is already set that way.) I find regexes very confusing and would like some help on the matter.
Currently, my regex is:
public Regex codeFunctions = new Regex("draw_line|draw_rectangle|draw_circle");
Thank you! :)
Try
public Regex codeFunctions = new Regex(#"\b(draw_line|draw_rectangle|draw_circle)\b");
The \b means match a word boundary, i.e. a transition from a non-word character to a word character (or vice versa).
Word characters include alphabet characters, digits, and the underscore symbol. Non-word characters include everything else, including +-/*(), so it should work fine for you.
See the Regex Class documentation for more details.
The # at the start of the string makes the string a verbatim string, otherwise you have to type two backslashes to make one backslash.
Do you want to match any words, or just the words listed above? To match an arbitrary word, substitute this for the bit that creates the Regex object:
new Regex (#"\b(\w+)\b");
In the future, if you want more characters to be treated as whitespace (for example, underscores), I would recommend String.Replace-ing them to a space character. There may be a clever way to get the same effect with regular expressions, but personally I think it would be too clever. The String.Replace version is obvious.
Also, I can't help but recommend that you read up on regular expressions. Yes, they look like line noise until you get used to them, but once you do they're convenient and there are plenty of good resources out there to help you.

Categories

Resources