French/Portuguese extended ASCII symbols in regex

French/Portuguese extended ASCII symbols in regex - c#

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçÇµ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçÇµ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?

The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

Related

C# / .NET Core way to filter out non-Roman characters but allow all accents and diacritics of Roman letters in all languages that use them

I'm looking for an efficient way to validate a website textbox and textarea input elements. The input is for human readable text only, like name, address, comments, question, survey answer, etc. In addition the valid input should only allow for all variety of Roman/Latin characters, including those included in Latin1, Latin2, Latin3, and Latin4 character sets (see wikipedia of ISO-8859 parts). This is because our call center can only read Roman characters (no Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek, etc.), because at least when the language is not English, they can use Google translate, or when the text input is used for address, it can still make sense on the address label or invoice.
Since it is web input, the UTF-8 characters transmitted via HTTP are converted by the C# system into Unicode (UTF-16) internally. I want a function returning a boolean that can say whether there is a non-Roman/Latin character in the string, but it should not be too stringent to disallow uncommon accented roman letter such as the French Œ, the German ẞ, the Irish Ṡ, the Finnish Ž, the Danish Ǿ, etc. (all those are not in Latin1, not to mention ASCII). Of course all punctuation marks should trigger a false; this should take care of HTML/JS/SQL injection issue. A second validator (not part of this question) will filter allowable punctuation mark like hyphen, period, apostrophe, etc.
I'm looking for ideas, not necessarily code. I have a feeling that there is a NuGet package out there or an already made function that uses .NET facility like System.Char.IsLetter and System.Globalization.UnicodeCategory enum.
The value of this question comes from other developers requiring the same kind of validation. Partial answers are welcome, and I will post the final solution on this question for everyone to use. (Let's see whether this question edit can redeem the current -2 vote for this question :-) )
EDIT:
Responding to negative comments below, I realize "non-Roman" is a little vague for computer geeks who like precision. But we are in the age of cloud where all people speaking all kinds of language are entering stuff into a web page. I want to restrict the input to all varieties of Roman / Latin characters. By "Roman" I mean anything derived from a,b,c,d,e,...x,y,z. Pretty common sense, don't you think? So I want to allow characters similar to those letters used by speakers of French, German, Danish, Norwegian, Bulgarians, etc. BUT excluding Chinese, Korean, Japanese, Thai, Russian, Arabic, Hebrew, Greek characters. Nothing wrong with them, but it's simply a business policy so the characters in the database are at least readable and sortable.
So I'm not looking for anything super precise here, and a basic guideline is that it needs to include all letters defined in Latin1, Latin2, Latin3, and Latin4 character sets, but I require the filter to detect them as unicode (so has numerical value of a unicode character, not Latin3 character set). I think the criteria is specific enough.

You can try using regular expressions, which support named Unicode blocks.
Your regex may look something like
(\s|\p{IsBasicLatin}|\p{IsCombiningDiacriticalMarks})+
You could also have a broader range with exclusions. For example:
[\u0000-\u036F-[\p{P}\p{IsIPAExtensions}]]
Of course, you'd need to test and tweak the exact regex to allow/disallow punctuation and other character classes.

After reviewing tips from Sten, Scott Hannen and Prix, I decided to go with the following:
private static string AllowedCharacterRegexPattern = #"^([a-zA-Z0-9\(\)\+,\-\.'/#_#& ]|[\u00C0-\u024F]|[\u1E00-\u1EFF])+$";
public static bool AllowedCharacter(string s)
{
// Decision: Characters to include:
// Basic Latin: 0x0030-0039, 0x0041-0x005A, 0x0061-0x007A: 0-9, A-Z, a-z : (https://unicode.org/charts/PDF/U0000.pdf)
// Latin1: 0x00C0 - 0x00FF (https://unicode.org/charts/PDF/U0080.pdf)
// Latin Extended A: 0x0100-0x017F (https://unicode.org/charts/PDF/U0100.pdf)
// Latin Extended-B: 0x0180-0x24F (https://unicode.org/charts/PDF/U0180.pdf)
// Latin Extended Additional: 0x1E00-0x1EFF (https://unicode.org/charts/PDF/U1E00.pdf)
// Some punctuation: ( ) + , - . ' / # _ # &
return Regex.IsMatch(s, AllowedCharacterRegexPattern);
}

Suggestions needed to Apply SuperScript to C# string For Xsl transformation

I Want to apply SuperScript to String for display
It works fine with numbers in superscript, doesn't work for String characters.
Suggestions needed.
Works fine for :
var o2 = "O₂"; // or "O\x2082"
var unit2 = "unit²"; // or "unit\xB2"
Does not work for :
var xyz = "ABC365\xBTM"
Can not get TM superscripted over string ABC365.
Suggestions appreciated.

You seem to have completely misunderstood what is going on here, so I'll try a very basic explanation.
Unicode defines a large number of characters (some 1,114,111 if I remember right). These came from a large number of historic sources, and there's no great rhyme or reason about which characters made it in and which didn't. The available characters include some subscript and superscript digits, for example x2082 is subscript 2, and x00B2 is superscript 2. It also includes some special symbols such as the trademark symbol x2122 which are traditionally rendered with a superscript appearance.
But there's no general mechanism in Unicode to render any character in superscript or subscript rendition. If you want to write Xn, Unicode won't help you: to achieve that I had to resort to mechanisms outside Unicode, specifically HTML tagging. HTML allows you to render anything as subscript or superscript; Unicode only handles a few select cases.
C# recognizes the escape sequences \xHH and \xHHHH (depending on context), where H is any hex digit, to represent special characters by their Unicode code point value. So if there's a codepoint x2082 meaning subscript 2, you can write it as \x2082 in a Unicode string literal. But there's no codepoint for subscript-lowercase-italic N, so there's no way of representing that.
Now when you write \xBTM it should be clear that's nonsense. \x must be followed by 2 or 4 hex digits (depending on context). If you want the trademark symbol, you can use \x2122. If you want the two characters "T" and "M" in superscript rendition, you're out of luck; if you need to pass that sort of thing around in your application, you will need to pass strings containing HTML markup, rather than just plain Unicode.
You indicate that you're trying to create strings that will be used as input to an XSLT transformation. My suggestion would to pass XML documents rather than plain strings: but I would need to understand the requirement in better detail before saying that's definitively the right solution.

Domain Name Regex Including IDN Characters c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.

Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties

"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

Regex to allow non-ascii and foreign letters?

Is it possible to create a regular expression to allow non-ascii letters along with Latin alphabets, for example Chinese or Greek symbols(eg. A汉语AbN漢語 allowed)?
I currently have the following ^[\w\d][\w\d_\-\.\s]*$ which only allows Latin alphabets.

In .NET,
^[\p{L}\d_][\p{L}\d_.\s-]*$
is equivalent to your regex, additionally allowing other Unicode letters.
Explanation:
\p{L} is a shorthand for the Unicode property "Letter".
Caveat: I think you wanted to not allow the underscore as initial character (evidenced by its presence only in the second character class). Since \w includes the underscore, your regex did allow it, though. You might want to remove it from the first character class in my solution (it's not included in \p{L}, of course).
In ECMAScript, things are not so easy. You would have to define your own Unicode character ranges. Fortunately, a fellow StackOverflow user has already risen to the occasion and designed a JavaScript regex converter:
https://stackoverflow.com/a/8933546/20670

Pattern matching for swedish character

I need a help regarding regular expression.
I have to match string like this:
âãa34dc
Pattern that i have used:
\s*[a-zA-Z]+[a-zA-Z_0-9]*\s
but this pattern is not good enough to identify this kind of string e.g. âãa34dc
P.S. âã these are swedish character.
Please help me for find out correct pattern for this kind of string.

Do you actually want to restrict it to Swedish characters? In other words, should a German character not match? If so, then you'll probably have to enumerate the whole alphabet, and include that.
If what you really want is to match every alphabetic character, use the regular expression terms for matching all letters.
\w matches any word character, but that includes numbers & some punctuation. That's close, but not exactly what you want for your second term.
For the first term, where you don't want to include numbers, specifying that the character should be a Unicode 'letter' class will work. \p{L} specifies all Unicode characters that are a letter. This includes [a-zA-Z], and all the Swedish characters, and German, and Russian, etc.
Therefore, I think this regular expression is what you want:
\s*[\p{L}][\p{L}_0-9]*\s
If you want to include digits from other character sets, and some other punctuation, then you can use [\w]* for the second term.

please give a set of rules.
according to your question :
[X-Ya-zA-Z]{3}[0-9]{2}[a-zA-Z]{2}
Replace X with the first swedish letter
Replace Y with the last swedish letter

John Machin provides a great answer for this. Adapting his pattern, what you need is probably something similar to: \s*[^\W\d_]\w*\s*
P.S. I removed the + quantifier from your first part. Any subsequent letters would be matched by the subsequent quantified \w.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.