Regex for check the input string is just in persian language - c#

I work with MVC and I am new on it. I want to check input values is only in Persian language (Characters) by [RegularExpression] Validation.
So I think to use Regex and need to check in range of unicodes, but I don't lnow how can find range of Persian characters Unicode. Am I right about this Regex? what is your suggestion and how can I find range of Unicode in Persian

Persian characters are within the range: [\u0600-\u06FF]
Try:
Regex.IsMatch(value, #"^[\u0600-\u06FF]+$")

Check first letter and last letter range in Persian I think something like this:
"^[آ-ی]$"

Regex.IsMatch(Text, #"^([\u0600-\u06FF]+\s?)+$")
This Only Contain standard Arabic symbols range But Persian also include 4 More Characters:
ژ \uFB8A
پ \u067E
چ \u0686
گ \u06AF
So You Should Use:
^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF]+$
If you want to match Zero-width-non-joiner you should add this too:
\u200C

TL;DR
All answers that say use \u0600-\u06FF or [آ-ی] are simply WRONG.
i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!
Farsi MUST used character sets are as following:
Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters.
Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers.
Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels.
Or a union of those. You may want to add other Arabic letters like Hamza ء to your character set additionally.
This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):
[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏
۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ
ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ
ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ
ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]
255 characters are fallen in this range, Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) and Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we'd end with 46 characters. This means:
\u0600-\u06FF contains 209 more characters than you need!
۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.
The shapes of the Persian digits four (۴), five (۵), and six (۶) are
different from the shapes used in Arabic and the other numbers have
different codepoints.
You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.
[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

I use this RegExp in my program, and it works correctly. hope to help you:
[پچجحخهعغفقثصضشسیبلاتنمکگوئدذرزطظژؤآإأءًٌٍَُِّ\s]+$

Persian characters are within the range: [\u0600-\u06FF] + [\s]
Try:
Regex.IsMatch(Text, #"^([\u0600-\u06FF]+\s?)+$")
This Patern Contains Letter and space Charachters.

Related

Insufficient Hexadecimal Digits Regex Exception?

I am formulating a regex where it would match with all letters (including chinese) and some chosen punctuations (also including chinese).
Here's my regex
"^[\p{L}\x{FF01}-\x{FF1E}\x{3008}-\x{30A9}0-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"
It throws an exception of insufficient hexadecimal digits. Can anybody please tell me what is wrong with it? I tried some regex testers online and it works there.
Im using the Regex class of c# to parse it
From the docs:
\x nn Uses hexadecimal representation to specify a character (nn consists of exactly two digits).
I think what you want is \u:
\u nnnn Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).
Try this:
#"^[\p{L}\uFF01-\uFF1E\u3008-\u30A90-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

At least one digit, minimum 8 chars length, with unicode

I know that regex questions have been asked many times before, but I just can't make it to work as I need. What I need is a regex, with a minimum of 8 characters, containing at least one digit (digits can appear in the start, end or after other characters), and supporting Unicode, so that Hebrew, Arabic etc. characters can be used.
Here's the basic regex:
^(?=.*?\d).{8}
^.{8} will match any string that has at least 8 characters. (?=.*?\d) will assert there's a digit in there.
As for the Unicode support, that's up to the regex engine. If Unicode is supported, . should match a Unicode character. If you want to match graphemes instead, your regex flavor may support \X, which you could use instead of ..
If you want to allow non-latin digits, you may need to replace \d with \p{N} depending on your regex engine.
Update for the .NET flavor:
\d already matches Unicode digits so you don't need to use \p{N}
\X is not supported so you'll have to stick with . or use a workaround like (?>\P{M}\p{M}*).
Assuming you are using a C# or Java like regex flavor, and you mean
with characters a character of the unicode category "letter" you can
use:
(?=\p{L}*?\p{Nd})[\p{L}\p{Nd}]{8,}

textbox validation expression property alphanumerical

I'm working with a simple RegularExpressionValidator. The textbox has to be 14 digits long (exactly 14). So, I use ValidationExpression="\d{14}"></asp:RegularExpressionValidator> but that instance just allow numbers, and I need letters also (to be clear, no special characters or dots, semi-colons, only numbers and letters).
What would fit better than "\d{14}" ?
thanks!
\d replaces only digit characters.
Try with \w which replaces letters and also numbers.
So your expression should be:
<asp ValidationExpression="\w{14}"></asp:RegularExpressionValidator>

How can I display a negative symbol in .NET?

I want to display a negative symbol from a string in .NET. I want a string that represents an equation that looks something like this:
7--5=12
But when displayed, I want the 2nd minus sign to be slightly raised so it looks more natural as a negative sign instead of just 2 minus signs in a row.
Is this possible?
Not sure if theres a character for what you want but a simple solution (and one that would be easily understood and implemented) would be to surround your negative number in brackets:
7 - (-5) = 13
Use the Unicode character SUPERSCRIPT MINUS (U+207B) ⁻.
For example:
7-⁻5 = 13
EDIT: Or, with a MINUS SIGN (U+2212) ⁻ for the minus:
7 − ⁻5 = 13
Provided that you're using unicode you can use a true minus sign, "−" (U+2212) rather than a hyphen-minus, "-" (U+002D). Just be aware that it's not ASCII compatible
Here's your example showing them:
7 - −5=13
Also, here's some fun wiki-articles on all sorts of dash-hyphen-minus lines:
http://en.wikipedia.org/wiki/Dash#Common_dashes
http://en.wikipedia.org/wiki/Minus_sign#Character_codes
This is a great resource on format strings in C#:
SteveX Compiled - Format Strings
You can choose how a negative number is displayed by using a range expression for your format string. It's in the format:
{0:<PositiveFormat>;<NegativeFormat>;<ZeroFormat>}
For example, this is how to display a negative number in parenthesis and the word "Zero" for 0:
{0:#;(#);Zero}
Using this technique, I think you can try it with the superscript version of negative (which is ascii code U+207B) in the negative format string.
{0:#;⁻#;Zero}
HTH, Anderson
Traditionally in math typography you use an en dash U+2013 or minus U+2212 (but not a hyphen!) for both binary (subtraction) and unary (negation) minus, and they are differentiated with spacing (spaces before and after a binary minus, no space between a unary minus and the number it negates).
But if you want to further distinguish the unary, I'd recommend substituting the superscript minus U+207B (but keep the spacing around the subtraction minus):
7 − ⁻5 = 13
You can use the Unicode character U+2212 (Minus Sign): 7-−5=13
In the font I'm using, the minus sign is displayed slightly raised relative to the dash. Your results may vary.
Unicode "superscript minus" http://www.fileformat.info/info/unicode/char/207b/index.htm
char c = '\u207b';

Categories

Resources