I am formulating a regex where it would match with all letters (including chinese) and some chosen punctuations (also including chinese).
Here's my regex
"^[\p{L}\x{FF01}-\x{FF1E}\x{3008}-\x{30A9}0-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"
It throws an exception of insufficient hexadecimal digits. Can anybody please tell me what is wrong with it? I tried some regex testers online and it works there.
Im using the Regex class of c# to parse it
From the docs:
\x nn Uses hexadecimal representation to specify a character (nn consists of exactly two digits).
I think what you want is \u:
\u nnnn Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).
Try this:
#"^[\p{L}\uFF01-\uFF1E\u3008-\u30A90-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"
Related
I know that regex questions have been asked many times before, but I just can't make it to work as I need. What I need is a regex, with a minimum of 8 characters, containing at least one digit (digits can appear in the start, end or after other characters), and supporting Unicode, so that Hebrew, Arabic etc. characters can be used.
Here's the basic regex:
^(?=.*?\d).{8}
^.{8} will match any string that has at least 8 characters. (?=.*?\d) will assert there's a digit in there.
As for the Unicode support, that's up to the regex engine. If Unicode is supported, . should match a Unicode character. If you want to match graphemes instead, your regex flavor may support \X, which you could use instead of ..
If you want to allow non-latin digits, you may need to replace \d with \p{N} depending on your regex engine.
Update for the .NET flavor:
\d already matches Unicode digits so you don't need to use \p{N}
\X is not supported so you'll have to stick with . or use a workaround like (?>\P{M}\p{M}*).
Assuming you are using a C# or Java like regex flavor, and you mean
with characters a character of the unicode category "letter" you can
use:
(?=\p{L}*?\p{Nd})[\p{L}\p{Nd}]{8,}
I want to print a Unicode character with 5 hexadecimal digits on the screen (for example to write it on a Windows Forms button).
For example, the Unicode of the character Ace Heart is 1F0B1. I tried it with \x but it can present up to 4 digits.
You can use the \U escape sequence:
string text = "Ace of hearts: \U0001f0b1";
Of course, you'll have to be using a font which supports that character...
As an aside, I'd strongly recommend avoiding the \x escape sequence, as they're hard to read. For example:
string good = "Bell: \x7Good compiler";
string bad = "Bell: \x7Bad compiler";
When presented together, at first glance it would seem that these are both "Bell: " followed by U+0007 followed by either "Good compiler" or "Bad" compiler... but because "Bad" is entirely composed of valid hex characters, the second string is actually "Bell: " followed by U+7BAD followed by " compiler".
I work with MVC and I am new on it. I want to check input values is only in Persian language (Characters) by [RegularExpression] Validation.
So I think to use Regex and need to check in range of unicodes, but I don't lnow how can find range of Persian characters Unicode. Am I right about this Regex? what is your suggestion and how can I find range of Unicode in Persian
Persian characters are within the range: [\u0600-\u06FF]
Try:
Regex.IsMatch(value, #"^[\u0600-\u06FF]+$")
Check first letter and last letter range in Persian I think something like this:
"^[آ-ی]$"
Regex.IsMatch(Text, #"^([\u0600-\u06FF]+\s?)+$")
This Only Contain standard Arabic symbols range But Persian also include 4 More Characters:
ژ \uFB8A
پ \u067E
چ \u0686
گ \u06AF
So You Should Use:
^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF]+$
If you want to match Zero-width-non-joiner you should add this too:
\u200C
TL;DR
All answers that say use \u0600-\u06FF or [آ-ی] are simply WRONG.
i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!
Farsi MUST used character sets are as following:
Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters.
Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers.
Use [ ٌ ًّ َ ِ ُ ْ ] for vowels.
Or a union of those. You may want to add other Arabic letters like Hamza ء to your character set additionally.
This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):
[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏
۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ
ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ
ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ
ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]
255 characters are fallen in this range, Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) and Tanvin (ً, ٍِ , ٌ ) and Tashdid (ّ ) that are both a subset of Arabic diacritics not Farsi, we'd end with 46 characters. This means:
\u0600-\u06FF contains 209 more characters than you need!
۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.
The shapes of the Persian digits four (۴), five (۵), and six (۶) are
different from the shapes used in Arabic and the other numbers have
different codepoints.
You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.
[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.
I use this RegExp in my program, and it works correctly. hope to help you:
[پچجحخهعغفقثصضشسیبلاتنمکگوئدذرزطظژؤآإأءًٌٍَُِّ\s]+$
Persian characters are within the range: [\u0600-\u06FF] + [\s]
Try:
Regex.IsMatch(Text, #"^([\u0600-\u06FF]+\s?)+$")
This Patern Contains Letter and space Charachters.
I have the following string:
test123 test ödo 123teö"st 123 m.1212 123t.est
I only want to match strings as a whole that have either characters, digits and special character mixed together. So the regex should match the following string of the example above:
test123 test ödo 123teö"st 123 m.1212 123t.est
Could someone help me out please?
Update
Sorry for not giving a clear explanation of what I need.
I am using C#.
I need to find words that contain alphanumeric strings (eg abc123, 123abc, a1b2c3, 1abc23 etc). Also I need to find strings that contain any kind of symbols (symbols = anythings else than word characters and digits) (eg abc"123, "abc, ab?dd, 100mm", 345t{asd]dd)
If I find a match, I need to "tokenize" (separate digits, word characters and symbols with whitespace) these strings so abc123 becomes abc 123 or 345t{asd]dd becomes 345 t { asd ] dd etc
Assuming you're using a regex flavor that supports lookaheads and Unicode properties, this should get you started:
(?!(?:\pL+|\pN+|\pP+)(?!\S))\S+
\S+ matches one or more non-whitespace characters, but only after the negative lookahead asserts that those characters are not all letters (\pL), digits (\pN), or punctuation (\pP). The inner negative lookahead--(?!\S)--ensures that the outer one examines all the characters in the word.
Although it might satisfy your requirements, this regex is really just a demonstration of the technique you'll probably want to use. As it is, it can be fooled by "words" with (for example) control characters or dingbats in them.
The answer to the question you’ve actually asked is (?s).+, but perhaps you would care to refine your question.
I am maintaining some Java code that I am currently converting to C#.
The Java code is doing this:
sendString(somedata + '\000');
And in C# I am trying to do the same:
sendString(somedata + '\000');
But on the '\000' VS2010 tells me that "Too many characters in character literal". How can I use '\000' in C#? I have tried to find out what the character is, but it seems to be " " or some kind of newline-character.
Do you know anything about the issue?
Thanks!
'\0' will be just fine in C#.
What's happening is that C# sees \0 and converts that to a nul-character with an ASCII value of 0; then it sees two more 0s, which is illegal inside a character (since you used single quotes, not double quotes). The nul-character is typically not printable, which is why it looked like an empty string when you tried to print it.
What you've typed in Java is a character literal supporting an octal number. C# does not support octal literals in characters or numbers, in an effort to reduce programming mistakes.*
C# does supports Unicode literals of the form '\u0000' where 0000 is a 1-4 digit hexadecimal number.
* In PHP, for example, if you type in a number with a leading zero that is a valid octal number, it gets translated. If it's not a legal octal number, it doesn't get translated correctly. <? echo 017; echo ", "; echo 018; ?> outputs 15, 1 on my machine.
That's a null character, also known as NUL. You can write it as '\0' in C#.
In C# the string "\000" represents three characters: the null character, followed by two zero digits. Since a character literal can only contain one character, this is why you get the error "Too many characters in character literal".