Domain Name Regex Including IDN Characters c# - c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.

Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties

"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

Related

C# / .NET Core way to filter out non-Roman characters but allow all accents and diacritics of Roman letters in all languages that use them

I'm looking for an efficient way to validate a website textbox and textarea input elements. The input is for human readable text only, like name, address, comments, question, survey answer, etc. In addition the valid input should only allow for all variety of Roman/Latin characters, including those included in Latin1, Latin2, Latin3, and Latin4 character sets (see wikipedia of ISO-8859 parts). This is because our call center can only read Roman characters (no Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek, etc.), because at least when the language is not English, they can use Google translate, or when the text input is used for address, it can still make sense on the address label or invoice.
Since it is web input, the UTF-8 characters transmitted via HTTP are converted by the C# system into Unicode (UTF-16) internally. I want a function returning a boolean that can say whether there is a non-Roman/Latin character in the string, but it should not be too stringent to disallow uncommon accented roman letter such as the French Œ, the German ẞ, the Irish Ṡ, the Finnish Ž, the Danish Ǿ, etc. (all those are not in Latin1, not to mention ASCII). Of course all punctuation marks should trigger a false; this should take care of HTML/JS/SQL injection issue. A second validator (not part of this question) will filter allowable punctuation mark like hyphen, period, apostrophe, etc.
I'm looking for ideas, not necessarily code. I have a feeling that there is a NuGet package out there or an already made function that uses .NET facility like System.Char.IsLetter and System.Globalization.UnicodeCategory enum.
The value of this question comes from other developers requiring the same kind of validation. Partial answers are welcome, and I will post the final solution on this question for everyone to use. (Let's see whether this question edit can redeem the current -2 vote for this question :-) )
EDIT:
Responding to negative comments below, I realize "non-Roman" is a little vague for computer geeks who like precision. But we are in the age of cloud where all people speaking all kinds of language are entering stuff into a web page. I want to restrict the input to all varieties of Roman / Latin characters. By "Roman" I mean anything derived from a,b,c,d,e,...x,y,z. Pretty common sense, don't you think? So I want to allow characters similar to those letters used by speakers of French, German, Danish, Norwegian, Bulgarians, etc. BUT excluding Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek characters. Nothing wrong with them, but it's simply a business policy so the characters in the database are at least readable and sortable.
So I'm not looking for anything super precise here, and a basic guideline is that it needs to include all letters defined in Latin1, Latin2, Latin3, and Latin4 character sets, but I require the filter to detect them as unicode (so has numerical value of a unicode character, not Latin3 character set). I think the criteria is specific enough.
You can try using regular expressions, which support named Unicode blocks.
Your regex may look something like
(\s|\p{IsBasicLatin}|\p{IsCombiningDiacriticalMarks})+
You could also have a broader range with exclusions. For example:
[\u0000-\u036F-[\p{P}\p{IsIPAExtensions}]]
Of course, you'd need to test and tweak the exact regex to allow/disallow punctuation and other character classes.
After reviewing tips from Sten, Scott Hannen and Prix, I decided to go with the following:
private static string AllowedCharacterRegexPattern = #"^([a-zA-Z0-9\(\)\+,\-\.'/#_#& ]|[\u00C0-\u024F]|[\u1E00-\u1EFF])+$";
public static bool AllowedCharacter(string s)
{
// Decision: Characters to include:
// Basic Latin: 0x0030-0039, 0x0041-0x005A, 0x0061-0x007A: 0-9, A-Z, a-z : (https://unicode.org/charts/PDF/U0000.pdf)
// Latin1: 0x00C0 - 0x00FF (https://unicode.org/charts/PDF/U0080.pdf)
// Latin Extended A: 0x0100-0x017F (https://unicode.org/charts/PDF/U0100.pdf)
// Latin Extended-B: 0x0180-0x24F (https://unicode.org/charts/PDF/U0180.pdf)
// Latin Extended Additional: 0x1E00-0x1EFF (https://unicode.org/charts/PDF/U1E00.pdf)
// Some punctuation: ( ) + , - . ' / # _ # &
return Regex.IsMatch(s, AllowedCharacterRegexPattern);
}

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

Password Regex Validation: Preventing Spaces

Okay, so I'm trying to adhere to the following password rule:
Must be 6 to 15 characters, include at least one lowercase letter, one uppercase letter and at least one number. It should also contain no spaces.
Now, for everything but the spaces, I've got:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{6,15}$
Problem is, that allows spaces.
After looking around, I've tried using \s, but that messes up my lowercase and uppercase requirements. I also seen another suggestion to replace the * with a +, but that seemed to break the entire thing.
I've created a REFiddle if you want to have a live test.
To clarify, this is a client requirement unfortunately, I'm never usually this strict with passwords.
How about:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)\S{6,15}$
\S stands for any NON space character.

UnicodeCategory.Otherletter block range for regex

I need to restrict a text fields length to a variable amount of characters. I say variable because it needs to count CJK ideographs as 2 characters. For example if I were restricting the length to 10 then I could have 10 Latin characters but only 5 ideographs, or 4 Latin and 3 CJK ideographs(4 + (3*2)).
I had this implemented well enough in c# by using:
if (char.GetUnicodeCategory(str, i) == UnicodeCategory.OtherLetter)
The thing is this was being checked on a form post, what I really want is to have a javascript implementation to check as the user is typing. I could use a regex to check each char but I cannot find out which unicode block ranges UnicodeCategory.OtherLetter uses.
This site seems really helpful for putting together the regex but I just need to know what I'm looking for to match the c# implementations behaviour.
C#
Firstly, if your goal is to count only the CJK ideographs as 2 characters, then the current C# code you have isn't quite right. The Unicode General Category OtherLetter is more or less intended for scripts that have no concept of letter case. This means that not only would CJK characters match, but so would Arabic, Hebrew, Khmer, Georgian, etc. In the Unicode data, the CJK characters are called the Han script.
Unfortunately, I could not find an easy solution within the .NET Framework to check for the script of a character. You can, however, use .NET Regex to match Unicode Blocks. Just match the necessary CJK blocks in addition to the general category. Unfortunately, though Unicode tries to keep the blocks homogeneous, they makes no guarantees that errant characters from other scripts could end up in "wrong" blocks. I imagine this is unlikely with the CJK blocks though.
Also, a minor issue is that you might want to consider using System.Globalization.CharUnicodeData.GetUnicodeCategory(str, i) instead of char.GetUnicodeCategory(str, i). The CharUnicodeData version is meant to be up to date with the current version of Unicode, while the other may not be, for backwards compatibility reasons.
JavaScript
Unfortunately, JavaScript's Unicode support is not that good, especially when it comes to regexes. It has actually already been asked if there was a way to get the general category in JavaScript. It appears that there is not, but the answers there mention the XRegExp plugin, which can check for a character's general category, in addition to its script.
Mathias Bynens has a great article detailing JavaScript's current shortcomings with Unicode and improvements expected in the upcoming ECMAScript 6. He also provides links to polyfills for these improvements.
While ECMAScript 6 provides much better support for astral characters, a quick glance at the current draft (Oct. 28, 2013, rev. 20) shows no sign of including support to match Unicode General Categories, blocks or scripts.
Astral Characters
Astral characters are those which are found in planes beyond the Basic Multilingual Plane (BMP, Plane 0), that is characters with values greater than 0xFFFF. Both C# and JavaScript use UTF-16 as their string encoding. This means that the characters are actually formed with 2 code units instead of 1 as in the BMP. My answer to a previous Unicode question goes into a little more detail about the encoding, but suffice to say, this can wreak havoc. In particular, the string length for astral characters is 2, and regex engines have a hard time dealing with them.
Neither the C# blocks, nor the XRegExp solutions actually properly deal with astral characters. Many of the rarer CJK characters are located in the Supplementary Ideographic Plane (SIP, Plane 2). That said, "character" is an overloaded term, and has been used to mean "code unit", "code point", and "user-perceived character". For this answer, I've been using it to mean code point, but I can't tell which one you mean, so the best I can do is to make you aware of the issues of astral characters.
Note that though it hasn't yet been released, XRegExp's GitHub repository indicates that they have already implemented support for astral characters in the upcoming version 3.
Manually Matching
Given all the difficulties, it might just be best to use a regex to manually match all appropriate code points. The downfall of this of course is that it would have to be updated when new CJK characters are added to the standard. The code points for the CJK ideographs can be found in the Unicode script data by searching for the "Han" script and then taking the ranges indicated by Lo (Letter, other). The corresponding regex which should work (though not tested) in C# and JavaScript would be:
[\u3400-\u4DB5\u4E00-\u9FCC\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868][\uDCOO-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|[\uD86A-\uD86C][\uDCOO-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDCOO-\uDC1D]|\uD87E[\uDC00-\uDE1D]
Depending on your definition, the code points 3005, 3007, 3021-3029, 3038-303A, 303B may or may not be considered ideographs. They have the categories Lm and Nl for "Letter, modifier" and "Number, letter".

Regex to allow non-ascii and foreign letters?

Is it possible to create a regular expression to allow non-ascii letters along with Latin alphabets, for example Chinese or Greek symbols(eg. A汉语AbN漢語 allowed)?
I currently have the following ^[\w\d][\w\d_\-\.\s]*$ which only allows Latin alphabets.
In .NET,
^[\p{L}\d_][\p{L}\d_.\s-]*$
is equivalent to your regex, additionally allowing other Unicode letters.
Explanation:
\p{L} is a shorthand for the Unicode property "Letter".
Caveat: I think you wanted to not allow the underscore as initial character (evidenced by its presence only in the second character class). Since \w includes the underscore, your regex did allow it, though. You might want to remove it from the first character class in my solution (it's not included in \p{L}, of course).
In ECMAScript, things are not so easy. You would have to define your own Unicode character ranges. Fortunately, a fellow StackOverflow user has already risen to the occasion and designed a JavaScript regex converter:
https://stackoverflow.com/a/8933546/20670

Categories

Resources