I know that regex questions have been asked many times before, but I just can't make it to work as I need. What I need is a regex, with a minimum of 8 characters, containing at least one digit (digits can appear in the start, end or after other characters), and supporting Unicode, so that Hebrew, Arabic etc. characters can be used.
Here's the basic regex:
^(?=.*?\d).{8}
^.{8} will match any string that has at least 8 characters. (?=.*?\d) will assert there's a digit in there.
As for the Unicode support, that's up to the regex engine. If Unicode is supported, . should match a Unicode character. If you want to match graphemes instead, your regex flavor may support \X, which you could use instead of ..
If you want to allow non-latin digits, you may need to replace \d with \p{N} depending on your regex engine.
Update for the .NET flavor:
\d already matches Unicode digits so you don't need to use \p{N}
\X is not supported so you'll have to stick with . or use a workaround like (?>\P{M}\p{M}*).
Assuming you are using a C# or Java like regex flavor, and you mean
with characters a character of the unicode category "letter" you can
use:
(?=\p{L}*?\p{Nd})[\p{L}\p{Nd}]{8,}
Related
I'm trying to modify a fairly basic regex pattern in C# that tests for phone numbers.
The patterns is -
[0-9]+(\.[0-9][0-9]?)?
I have two questions -
1) The existing expression does work (although it is fairly restrictive) but I can't quite understand how it works. Regexps for similar issues seem to look more like this one -
/^[0-9()]+$/
2) How could I extend this pattern to allow brackets, periods and a single space to separate numbers. I tried a few variations to include -
[0-9().+\s?](\.[0-9][0-9]?)?
Although i can't seem to create a valid pattern.
Any help would be much appreciated.
Thanks,
[0-9]+(\.[0-9][0-9]?)?
First of all, I recommend checking out either regexr.com or regex101.com, so you yourself get an understanding of how regex works. Both websites will give you a step-by-step explanation of what each symbol in the regex does.
Now, one of the main things you have to understand is that regex has special characters. This includes, among others, the following: []().-+*?\^$. So, if you want your regex to match a literal ., for example, you would have to escape it, since it's a special character. To do so, either use \. or [.]. Backslashes serve to escape other characters, while [] means "match any one of the characters in this set". Some special characters don't have a special meaning inside these brackets and don't require escaping.
Therefore, the regex above will match any combination of digits of length 1 or more, followed by an optional suffix (foobar)?, which has to be a dot, followed by one or two digits. In fact, this regex seems more like it's supposed to match decimal numbers with up to two digits behind the dot - not phone numbers.
/^[0-9()]+$/
What this does is pretty simple - match any combination of digits or round brackets that has the length 1 or greater.
[0-9().+\s?](\.[0-9][0-9]?)?
What you're matching here is:
one of: a digit, round bracket, dot, plus sign, whitespace or question mark; but exactly once only!
optionally followed by a dot and one or two digits
A suitable regex for your purpose could be:
(\+\d{2})?((\(0\)\d{2,3})|\d{2,3})?\d+
Enter this in one of the websites mentioned above to understand how it works. I modified it a little to also allow, for example +49 123 4567890.
Also, for simplicity, I didn't include spaces - so when using this regex, you have to remove all the spaces in your input first. In C#, that should be possible with yourString.Replace(" ", ""); (simply replacing all spaces with nothing = deleting spaces)
The + after the character set is a quantifier (meaning the preceeding character, character set or group is repeated) at least one, and unlimited number of times and it's greedy (matched the most possible).
Then [0-9().+\s]+ will match any character in set one or more times.
I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.
I use this ^[a-zA-Z''-'\s]{1,40}$ regex for name validator according to MSDN.
Now I want add NON-English characters to this.
How I can do this?
To support all BMP and astral planes, you need both \p{L} (all letters) and \p{M} (all diacritics) Unicode category classes:
^[\p{L}\p{M}\s'-]{1,40}$
Note that \p{L} already includes [a-zA-Z], and all lower- and uppercase letters.
Or, since \s matches newlines (I doubt you really need newline symbols to match), you can use \p{Zs} - Unicode separator class (various kinds of spaces):
^[\p{L}\p{M}\p{Zs}'-]{1,40}$
Placing the hyphen at the end is just best practice, although it would be handled as a literal hyphen in your regex, too.
You can try this:
^[\p{L}'\s-]{1,40}$
Note that \p{L} is Unicode property and it matches everything that has the property letter.
I need a help regarding regular expression.
I have to match string like this:
âãa34dc
Pattern that i have used:
\s*[a-zA-Z]+[a-zA-Z_0-9]*\s
but this pattern is not good enough to identify this kind of string e.g. âãa34dc
P.S. âã these are swedish character.
Please help me for find out correct pattern for this kind of string.
Do you actually want to restrict it to Swedish characters? In other words, should a German character not match? If so, then you'll probably have to enumerate the whole alphabet, and include that.
If what you really want is to match every alphabetic character, use the regular expression terms for matching all letters.
\w matches any word character, but that includes numbers & some punctuation. That's close, but not exactly what you want for your second term.
For the first term, where you don't want to include numbers, specifying that the character should be a Unicode 'letter' class will work. \p{L} specifies all Unicode characters that are a letter. This includes [a-zA-Z], and all the Swedish characters, and German, and Russian, etc.
Therefore, I think this regular expression is what you want:
\s*[\p{L}][\p{L}_0-9]*\s
If you want to include digits from other character sets, and some other punctuation, then you can use [\w]* for the second term.
please give a set of rules.
according to your question :
[X-Ya-zA-Z]{3}[0-9]{2}[a-zA-Z]{2}
Replace X with the first swedish letter
Replace Y with the last swedish letter
John Machin provides a great answer for this. Adapting his pattern, what you need is probably something similar to: \s*[^\W\d_]\w*\s*
P.S. I removed the + quantifier from your first part. Any subsequent letters would be matched by the subsequent quantified \w.
What is the regex for a alpha numeric word, at least 6 characters long (but at most 50).
/[a-zA-Z0-9]{6,50}/
You can use word boundaries at the beginning/end (\b) if you want to actually match a word within text.
/\b[a-zA-Z0-9]{6,50}\b/
\b\w{6,50}\b
\w is any 'word' character - depending on regex flavour it might be just [a-z0-9_] or it might include others (e.g. accented chars/etc).
{6,50} means between 6 and 50 (inclusive)
\b means word boundary (ensuring the word does not exceed the 50 at either end).
After re-reading, it appears that what you want do is ensure the entire text matches? If so...
^\w{6,50}$
With PCRE regex you could do this:
/[a-zA-Z0-9]{6,50}/
It would be very hard to do in regex without the min/max quantifiers so hopefully your language supports them.