Regex to allow non-ascii and foreign letters? - c#

Is it possible to create a regular expression to allow non-ascii letters along with Latin alphabets, for example Chinese or Greek symbols(eg. A汉语AbN漢語 allowed)?
I currently have the following ^[\w\d][\w\d_\-\.\s]*$ which only allows Latin alphabets.

In .NET,
^[\p{L}\d_][\p{L}\d_.\s-]*$
is equivalent to your regex, additionally allowing other Unicode letters.
Explanation:
\p{L} is a shorthand for the Unicode property "Letter".
Caveat: I think you wanted to not allow the underscore as initial character (evidenced by its presence only in the second character class). Since \w includes the underscore, your regex did allow it, though. You might want to remove it from the first character class in my solution (it's not included in \p{L}, of course).
In ECMAScript, things are not so easy. You would have to define your own Unicode character ranges. Fortunately, a fellow StackOverflow user has already risen to the occasion and designed a JavaScript regex converter:
https://stackoverflow.com/a/8933546/20670

Related

Domain Name Regex Including IDN Characters c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.
Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties
"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

C# regex for English char and non-English

I use this ^[a-zA-Z''-'\s]{1,40}$ regex for name validator according to MSDN.
Now I want add NON-English characters to this.
How I can do this?
To support all BMP and astral planes, you need both \p{L} (all letters) and \p{M} (all diacritics) Unicode category classes:
^[\p{L}\p{M}\s'-]{1,40}$
Note that \p{L} already includes [a-zA-Z], and all lower- and uppercase letters.
Or, since \s matches newlines (I doubt you really need newline symbols to match), you can use \p{Zs} - Unicode separator class (various kinds of spaces):
^[\p{L}\p{M}\p{Zs}'-]{1,40}$
Placing the hyphen at the end is just best practice, although it would be handled as a literal hyphen in your regex, too.
You can try this:
^[\p{L}'\s-]{1,40}$
Note that \p{L} is Unicode property and it matches everything that has the property letter.

How to make a regular expression for combining characters?

I am working in an application in which i have required a regular expression of to detect combining characters.I have made following regex
string regex = #"^([~.][a-z])";
I have to detect combining characters which are separated from character because they don not exist in the font so i have to check two characters, one is symbol and other is any character i.e ~a.
Problem is that i am not able to paste exact shape of symbols. I am using this link
http://en.wikipedia.org/wiki/Combining_character
When i paste them in regex there shape is changed.
How to make a regex that detect specific combining characters provided in regex.
Use Unicode properties:
\p{L}\p{M}*+
\p{L} any kind of letter from any language (but not combined ones!)
\p{M} a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
See regular-expressions.info/unicode for more details (chapter Unicode Categories)

Pattern matching for swedish character

I need a help regarding regular expression.
I have to match string like this:
âãa34dc
Pattern that i have used:
\s*[a-zA-Z]+[a-zA-Z_0-9]*\s
but this pattern is not good enough to identify this kind of string e.g. âãa34dc
P.S. âã these are swedish character.
Please help me for find out correct pattern for this kind of string.
Do you actually want to restrict it to Swedish characters? In other words, should a German character not match? If so, then you'll probably have to enumerate the whole alphabet, and include that.
If what you really want is to match every alphabetic character, use the regular expression terms for matching all letters.
\w matches any word character, but that includes numbers & some punctuation. That's close, but not exactly what you want for your second term.
For the first term, where you don't want to include numbers, specifying that the character should be a Unicode 'letter' class will work. \p{L} specifies all Unicode characters that are a letter. This includes [a-zA-Z], and all the Swedish characters, and German, and Russian, etc.
Therefore, I think this regular expression is what you want:
\s*[\p{L}][\p{L}_0-9]*\s
If you want to include digits from other character sets, and some other punctuation, then you can use [\w]* for the second term.
please give a set of rules.
according to your question :
[X-Ya-zA-Z]{3}[0-9]{2}[a-zA-Z]{2}
Replace X with the first swedish letter
Replace Y with the last swedish letter
John Machin provides a great answer for this. Adapting his pattern, what you need is probably something similar to: \s*[^\W\d_]\w*\s*
P.S. I removed the + quantifier from your first part. Any subsequent letters would be matched by the subsequent quantified \w.

Categories

Resources