UnicodeCategory.Otherletter block range for regex

UnicodeCategory.Otherletter block range for regex - c#

I need to restrict a text fields length to a variable amount of characters. I say variable because it needs to count CJK ideographs as 2 characters. For example if I were restricting the length to 10 then I could have 10 Latin characters but only 5 ideographs, or 4 Latin and 3 CJK ideographs(4 + (3*2)).
I had this implemented well enough in c# by using:
if (char.GetUnicodeCategory(str, i) == UnicodeCategory.OtherLetter)
The thing is this was being checked on a form post, what I really want is to have a javascript implementation to check as the user is typing. I could use a regex to check each char but I cannot find out which unicode block ranges UnicodeCategory.OtherLetter uses.
This site seems really helpful for putting together the regex but I just need to know what I'm looking for to match the c# implementations behaviour.

C#
Firstly, if your goal is to count only the CJK ideographs as 2 characters, then the current C# code you have isn't quite right. The Unicode General Category OtherLetter is more or less intended for scripts that have no concept of letter case. This means that not only would CJK characters match, but so would Arabic, Hebrew, Khmer, Georgian, etc. In the Unicode data, the CJK characters are called the Han script.
Unfortunately, I could not find an easy solution within the .NET Framework to check for the script of a character. You can, however, use .NET Regex to match Unicode Blocks. Just match the necessary CJK blocks in addition to the general category. Unfortunately, though Unicode tries to keep the blocks homogeneous, they makes no guarantees that errant characters from other scripts could end up in "wrong" blocks. I imagine this is unlikely with the CJK blocks though.
Also, a minor issue is that you might want to consider using System.Globalization.CharUnicodeData.GetUnicodeCategory(str, i) instead of char.GetUnicodeCategory(str, i). The CharUnicodeData version is meant to be up to date with the current version of Unicode, while the other may not be, for backwards compatibility reasons.
JavaScript
Unfortunately, JavaScript's Unicode support is not that good, especially when it comes to regexes. It has actually already been asked if there was a way to get the general category in JavaScript. It appears that there is not, but the answers there mention the XRegExp plugin, which can check for a character's general category, in addition to its script.
Mathias Bynens has a great article detailing JavaScript's current shortcomings with Unicode and improvements expected in the upcoming ECMAScript 6. He also provides links to polyfills for these improvements.
While ECMAScript 6 provides much better support for astral characters, a quick glance at the current draft (Oct. 28, 2013, rev. 20) shows no sign of including support to match Unicode General Categories, blocks or scripts.
Astral Characters
Astral characters are those which are found in planes beyond the Basic Multilingual Plane (BMP, Plane 0), that is characters with values greater than 0xFFFF. Both C# and JavaScript use UTF-16 as their string encoding. This means that the characters are actually formed with 2 code units instead of 1 as in the BMP. My answer to a previous Unicode question goes into a little more detail about the encoding, but suffice to say, this can wreak havoc. In particular, the string length for astral characters is 2, and regex engines have a hard time dealing with them.
Neither the C# blocks, nor the XRegExp solutions actually properly deal with astral characters. Many of the rarer CJK characters are located in the Supplementary Ideographic Plane (SIP, Plane 2). That said, "character" is an overloaded term, and has been used to mean "code unit", "code point", and "user-perceived character". For this answer, I've been using it to mean code point, but I can't tell which one you mean, so the best I can do is to make you aware of the issues of astral characters.
Note that though it hasn't yet been released, XRegExp's GitHub repository indicates that they have already implemented support for astral characters in the upcoming version 3.
Manually Matching
Given all the difficulties, it might just be best to use a regex to manually match all appropriate code points. The downfall of this of course is that it would have to be updated when new CJK characters are added to the standard. The code points for the CJK ideographs can be found in the Unicode script data by searching for the "Han" script and then taking the ranges indicated by Lo (Letter, other). The corresponding regex which should work (though not tested) in C# and JavaScript would be:
[\u3400-\u4DB5\u4E00-\u9FCC\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868][\uDCOO-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|[\uD86A-\uD86C][\uDCOO-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDCOO-\uDC1D]|\uD87E[\uDC00-\uDE1D]
Depending on your definition, the code points 3005, 3007, 3021-3029, 3038-303A, 303B may or may not be considered ideographs. They have the categories Lm and Nl for "Letter, modifier" and "Number, letter".

Related

Problems with parsing rtl languages when a string ends with a direction agnostic character (eg. brackets)

When receiving an RTL string from a MySsql server that ends in a direction agnostic character, the first char (string[0]) in the string array switches to be the ending char as in the following example (which will hopefully render in the correct order here):
String str = "קוד (לדוגמה)";
Char a = str[0];
Char b = str[1];
In this example, a=( and b=ק, which is incorrect. a should = ק and b should = ו.
Using substring for character extraction yields the same result.
after further examination, I've learned RTL strings are kept as LTR behind the scenes with most programming languages. Using Unicode RTL symbol did not change the outcome.
This presents a unique problems for us, since in our ETL process which requires iterating through all chars (and not searching, since it appears regex can handle this use case), we can't differentiate whether the 1st char was indeed a bracket or other symbol, or was it the ending character.
Any ideas on how to solve this problem would be appreciated, as we couldn't find an answer relevant to our case thus far.
Edit:
It appears the example code has the same problem we encounter while being displayed using certain browsers.
the brackets are actually at the end of the string.
correct order: https://files.logoscdn.com/v1/files/35323612/content.png?signature=pvAgUwSaLB8WGf8u868Cv1eOqiM
Bug, which also happens with stack overflow display on some browsers: https://files.logoscdn.com/v1/files/35323580/content.png?signature=LNasMBU9NWEi_x3BeVSLG9FU5co
2nd edit:
After examination of MySql binaries, it appears the string in MySql starts with the bracket. However, I am unsure whether this is the proper way it should be stored, as every possible display we use (including but not limited to Visual Studio) displays it properly and other than char manipulation the strings acts as if the brackets are at the end.
So to phrase the question better: how do all these systems, including MySql workbench which is written in C# AFAIK, know whether to put the bracket at the beginning or the end?

After a lot of checking, it appears a common convention when using unicode is storing the last character as first and vice versa in the case it's an LTR\unidirectional character in an RTL string.
The convention seems to differ a bit between text parsers, as evident between browsers. However, the 1st char IS indeed the bracket in our case. And in the case where it's the first character, it will end up being the LAST character.
I recommend just checking the handling of your own specific storage, parsers and libraries.

C# / .NET Core way to filter out non-Roman characters but allow all accents and diacritics of Roman letters in all languages that use them

I'm looking for an efficient way to validate a website textbox and textarea input elements. The input is for human readable text only, like name, address, comments, question, survey answer, etc. In addition the valid input should only allow for all variety of Roman/Latin characters, including those included in Latin1, Latin2, Latin3, and Latin4 character sets (see wikipedia of ISO-8859 parts). This is because our call center can only read Roman characters (no Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek, etc.), because at least when the language is not English, they can use Google translate, or when the text input is used for address, it can still make sense on the address label or invoice.
Since it is web input, the UTF-8 characters transmitted via HTTP are converted by the C# system into Unicode (UTF-16) internally. I want a function returning a boolean that can say whether there is a non-Roman/Latin character in the string, but it should not be too stringent to disallow uncommon accented roman letter such as the French Œ, the German ẞ, the Irish Ṡ, the Finnish Ž, the Danish Ǿ, etc. (all those are not in Latin1, not to mention ASCII). Of course all punctuation marks should trigger a false; this should take care of HTML/JS/SQL injection issue. A second validator (not part of this question) will filter allowable punctuation mark like hyphen, period, apostrophe, etc.
I'm looking for ideas, not necessarily code. I have a feeling that there is a NuGet package out there or an already made function that uses .NET facility like System.Char.IsLetter and System.Globalization.UnicodeCategory enum.
The value of this question comes from other developers requiring the same kind of validation. Partial answers are welcome, and I will post the final solution on this question for everyone to use. (Let's see whether this question edit can redeem the current -2 vote for this question :-) )
EDIT:
Responding to negative comments below, I realize "non-Roman" is a little vague for computer geeks who like precision. But we are in the age of cloud where all people speaking all kinds of language are entering stuff into a web page. I want to restrict the input to all varieties of Roman / Latin characters. By "Roman" I mean anything derived from a,b,c,d,e,...x,y,z. Pretty common sense, don't you think? So I want to allow characters similar to those letters used by speakers of French, German, Danish, Norwegian, Bulgarians, etc. BUT excluding Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek characters. Nothing wrong with them, but it's simply a business policy so the characters in the database are at least readable and sortable.
So I'm not looking for anything super precise here, and a basic guideline is that it needs to include all letters defined in Latin1, Latin2, Latin3, and Latin4 character sets, but I require the filter to detect them as unicode (so has numerical value of a unicode character, not Latin3 character set). I think the criteria is specific enough.

You can try using regular expressions, which support named Unicode blocks.
Your regex may look something like
(\s|\p{IsBasicLatin}|\p{IsCombiningDiacriticalMarks})+
You could also have a broader range with exclusions. For example:
[\u0000-\u036F-[\p{P}\p{IsIPAExtensions}]]
Of course, you'd need to test and tweak the exact regex to allow/disallow punctuation and other character classes.

After reviewing tips from Sten, Scott Hannen and Prix, I decided to go with the following:
private static string AllowedCharacterRegexPattern = #"^([a-zA-Z0-9\(\)\+,\-\.'/#_#& ]|[\u00C0-\u024F]|[\u1E00-\u1EFF])+$";
public static bool AllowedCharacter(string s)
{
// Decision: Characters to include:
// Basic Latin: 0x0030-0039, 0x0041-0x005A, 0x0061-0x007A: 0-9, A-Z, a-z : (https://unicode.org/charts/PDF/U0000.pdf)
// Latin1: 0x00C0 - 0x00FF (https://unicode.org/charts/PDF/U0080.pdf)
// Latin Extended A: 0x0100-0x017F (https://unicode.org/charts/PDF/U0100.pdf)
// Latin Extended-B: 0x0180-0x24F (https://unicode.org/charts/PDF/U0180.pdf)
// Latin Extended Additional: 0x1E00-0x1EFF (https://unicode.org/charts/PDF/U1E00.pdf)
// Some punctuation: ( ) + , - . ' / # _ # &
return Regex.IsMatch(s, AllowedCharacterRegexPattern);
}

Suggestions needed to Apply SuperScript to C# string For Xsl transformation

I Want to apply SuperScript to String for display
It works fine with numbers in superscript, doesn't work for String characters.
Suggestions needed.
Works fine for :
var o2 = "O₂"; // or "O\x2082"
var unit2 = "unit²"; // or "unit\xB2"
Does not work for :
var xyz = "ABC365\xBTM"
Can not get TM superscripted over string ABC365.
Suggestions appreciated.

You seem to have completely misunderstood what is going on here, so I'll try a very basic explanation.
Unicode defines a large number of characters (some 1,114,111 if I remember right). These came from a large number of historic sources, and there's no great rhyme or reason about which characters made it in and which didn't. The available characters include some subscript and superscript digits, for example x2082 is subscript 2, and x00B2 is superscript 2. It also includes some special symbols such as the trademark symbol x2122 which are traditionally rendered with a superscript appearance.
But there's no general mechanism in Unicode to render any character in superscript or subscript rendition. If you want to write Xn, Unicode won't help you: to achieve that I had to resort to mechanisms outside Unicode, specifically HTML tagging. HTML allows you to render anything as subscript or superscript; Unicode only handles a few select cases.
C# recognizes the escape sequences \xHH and \xHHHH (depending on context), where H is any hex digit, to represent special characters by their Unicode code point value. So if there's a codepoint x2082 meaning subscript 2, you can write it as \x2082 in a Unicode string literal. But there's no codepoint for subscript-lowercase-italic N, so there's no way of representing that.
Now when you write \xBTM it should be clear that's nonsense. \x must be followed by 2 or 4 hex digits (depending on context). If you want the trademark symbol, you can use \x2122. If you want the two characters "T" and "M" in superscript rendition, you're out of luck; if you need to pass that sort of thing around in your application, you will need to pass strings containing HTML markup, rather than just plain Unicode.
You indicate that you're trying to create strings that will be used as input to an XSLT transformation. My suggestion would to pass XML documents rather than plain strings: but I would need to understand the requirement in better detail before saying that's definitively the right solution.

Domain Name Regex Including IDN Characters c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.

Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties

"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

Unicode class names in C# - why do some work, when others don't?

I'm wondering why this is. I have two unicode characters from the same group Ll, which is allowed according to the specs: http://msdn.microsoft.com/en-us/library/aa664670%28VS.71%29.aspx
One of them works, the other gives a compile error, and I can't find any documentation on why this is:
This works:
U+0467 CYRILLIC SMALL LETTER LITTLE YUS ѧ
This don't:
U+04FF CYRILLIC SMALL LETTER HA WITH STROKE ӿ
Can you help me find the pattern?

U+0467 is from Unicode 1.1, whereas U+04FF is from Unicode 5.0. The page you refer to mentions Unicode 3.0. So the compiler's Unicode databases are just not new enough.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.