What is a String Culture - c#

Just trying to understand that - I have never used it before. How is a culture different to ToUpper() / ToLower()??

As SLaks says, different cultures handle casing differently.
A specific example from MSDN:
In most Latin alphabets, the character
i (Unicode 0069) is the lowercase
version of the character I (Unicode
0049). However, the Turkish alphabet
has two versions of the character I:
one with a dot and one without a dot.
In Turkish, the character I (Unicode
0049) is considered the uppercase
version of a different character ı
(Unicode 0131).

Different cultures have different rules for converting between uppercase and lowercase characters.
They also have different rules for comparing and sorting strings, and for converting numbers and dates to strings.

The Turkish I is the most common example of cultural differences in case mappings, but there are many others.
I recommend checking out the Unicode Consortium's information on this.
http://www.unicode.org/faq/casemap_charprop.html

Related

Categorizing this Thai character using the .NET framework

I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing
Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.
To parse the text I tried simply looping, like
while( Char.IsLetterOrDigit(theText[i++]) ) {}
to find the next character that isn't a letter or digit. That works except for certain characters like this one
which is the second character in this word (I think that's the character 'superscripting' the first character in the word).
This character doesn't seem to be categorized as anything by the Char class, ie:
Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)
all return false.
This character might be a 'tone' - how would that be identified using .NET?
According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”
You can use the Char.GetUnicodeCategory() method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark

C# toUpper for language without Uppercase

When using String.toUpper() are there any additional precautions which must be taken when attempting to "format" a language which does not contain uppercase characters such as Arabic?
string arabic = "مرحبا بالعالم";
string upper= arabic.ToUpper();
Sidebar: Never call .ToUpper() or .ToLower() when localization matters because these methods do not accept an explicit IFormatProvider that makes your intent (about localization) clear. You should prefer CultureInfo.TextInfo.ToUpperCase instead.
But to answer your question: case-conversions do not affect characters not subject to casing, they are kept as-is. This also happens in en-US and other Latin-alphabet languages because characters like digits 0, 1, 2 etc don't have cases either - so your Arabic characters will be preserved as-is.
Note how the non-alphabetical and already-uppercase characters are ignored:
"abcDEF1234!##" -> "ABCDEF1234!##"
Another thing to be aware of is that some languages have characters that don't have a one-to-one mapping between lowercase and uppercase forms, namely the Turkish I, which is written up here: https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/ (and it's why FxCop yells at you if you ever use ToLower instead of ToUpper, and why you should use StringComparison.OrdinalIgnoreCase or CurrentCultureIgnoreCase and never str1.ToLower() == str2.ToLower() for case-insensitive string comparison.

Domain Name Regex Including IDN Characters c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.
Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties
"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

Regex to allow non-ascii and foreign letters?

Is it possible to create a regular expression to allow non-ascii letters along with Latin alphabets, for example Chinese or Greek symbols(eg. A汉语AbN漢語 allowed)?
I currently have the following ^[\w\d][\w\d_\-\.\s]*$ which only allows Latin alphabets.
In .NET,
^[\p{L}\d_][\p{L}\d_.\s-]*$
is equivalent to your regex, additionally allowing other Unicode letters.
Explanation:
\p{L} is a shorthand for the Unicode property "Letter".
Caveat: I think you wanted to not allow the underscore as initial character (evidenced by its presence only in the second character class). Since \w includes the underscore, your regex did allow it, though. You might want to remove it from the first character class in my solution (it's not included in \p{L}, of course).
In ECMAScript, things are not so easy. You would have to define your own Unicode character ranges. Fortunately, a fellow StackOverflow user has already risen to the occasion and designed a JavaScript regex converter:
https://stackoverflow.com/a/8933546/20670

Categories

Resources