How to make a regular expression for combining characters?

How to make a regular expression for combining characters? - c#

I am working in an application in which i have required a regular expression of to detect combining characters.I have made following regex
string regex = #"^([~.][a-z])";
I have to detect combining characters which are separated from character because they don not exist in the font so i have to check two characters, one is symbol and other is any character i.e ~a.
Problem is that i am not able to paste exact shape of symbols. I am using this link
http://en.wikipedia.org/wiki/Combining_character
When i paste them in regex there shape is changed.
How to make a regex that detect specific combining characters provided in regex.

Use Unicode properties:
\p{L}\p{M}*+
\p{L} any kind of letter from any language (but not combined ones!)
\p{M} a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
See regular-expressions.info/unicode for more details (chapter Unicode Categories)

Related

How split text with parenthesis into words using regular expression

Here is a text with English words, CJK characters and fullwidth parenthesis(\uff08 and \uff09):
这是（一段测试）文字（start开始end）的结果
I want to split the text into words, for CJK characters, one charcater is a word. The special point is that I also want the fullwidth left parenthesis \uff08 combines with the word after it, and the fullwidth right parenthesis \uff09 combines with the word before it.
The expected result will be:
这
是
（一
段
测
试）
文
字
（start
开
始
end）
的
结
果
Currently, I use new Regex(#"(\s+)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])"); to split the text, but fullwidth parentheses didn't combine with the word before/after it.

You can add those special cases:
(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))
and
((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)
to your regex, giving you a complete regex of:
(\s+)|(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))|((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])
Demo on regex101
Note they need to be added to the regex prior to the part of the regex that could match the word on its own, otherwise that match will take precedence.

C# Regex Match whole word, with special characters

I have searched through some questions but couldn't find the exact answer i am looking for.
I have a requirement to search through large strings of text looking for keywords matches. I was using IndexOf, however, i require to find whole word matches e.g. if i search for Java, but the text contains JavaScript, it shouldn't match. This works fine using \b{pattern}\b, but if i search for something like C#, then it doesn't work.
Below is a few examples of text strings that i am searching through:
languages include Java,JavaScript,MySql,C#
languages include Java/JavaScript/MySql/C#
languages include Java, JavaScript, MySql, C#
Obviously the issue is with the special character '#'; so this also doesn't work when searching for C++.

Escape the pattern using Regex.Escape and replace the context-dependent \b word boundaries with (?<!\w) / (?!\w) lookarounds:
var rx = $#"(?<!\w){Regex.Escape(pattern)}(?!\w)";
The (?<!\w) is a negative lookbehind that fails the match if there is a start of string or a non-word char immediately before the current location, and (?!\w) is a negative looahead that fails the match if there is an end of string or a non-word char immediately after the current location.

Yeah, this is because there isn't a word boundary (a \b) after the #, because # isn't a "word" character. You could use a regular expression like the following, which searches for a character that isn't a part of a language name [^a-zA-Z+#] after the language:
\b{pattern}[^a-zA-Z+#]
Or, if you believe you can list all of the possible characters that aren't part of a language name (for example, whitespace, ,, ., and ;):
[\s,.;]{pattern}[\s,.;]
Alternately, if it is possible that a language name is at the very end of a string (depending on what you're getting the data from), you might need to also match the end of the string $ in addition to the separators, or similarly, the beginning of the string ^.
[\s,.;]{pattern}(?:[\s,.;]|$)

How to allow only first punctuation mark in string with different marks sequence between words

If I need to allow only first punctuation mark in string with different punctuation marks sequence between words, for example if string is:
string str = "hello,.,.,.world.,.?,.";
in result I want get this:
hello, world.
It would be good to know both, how to pass such string after insert and how to avoid writing of more then one mark and one white space between the words in string directly in textbox.

You can try this: (?<=[,.])[,.?]+.
See it working here: https://regex101.com/r/di5Ebw/1.
If you need to have a list of special ponctuation that you want to strip we can adjust in [,.]!
(So in the example I give you the match is on the chars you want to remove: just replace that match with empty string - as you can see in the SUBSTITUTION panel at the bottom)
[EDIT] Extend the match cases.
If you don't want to bother let this do it for you: (?<=\W)(?<! )\W+
See it working here: https://regex101.com/r/di5Ebw/2

.Net regular expressions have a punctuation class, so a simple way of achieving the required result is to search for the string (\w\p{P})\p{P}+ and replace with $1.
For a regular expression that handles exactly the few punctuation characters used in the question the regular expression (\w[.,?])[.,?]+ can be used.
(Note, the above shows the regular expressions. Their C# strings are "(\\w\\p{P})\\p{P}+" and "(\\w[.,?])[.,?]+".)
Explanation. This looks for a word character (\w) followed by one punctuation character and it captures these two characters. Any immediately following punctuation characters are matched by the \p{P}+. The whole match is replace by the capture.
The \p{name} construct is defined here as "Matches any single character in the Unicode general category or named block specified by name.
".
The \p{P} category is defined here as "All punctuation characters". There are also several subcategories of punctuation, but it may be best to look at Unicode to understand them.

Regex to allow non-ascii and foreign letters?

Is it possible to create a regular expression to allow non-ascii letters along with Latin alphabets, for example Chinese or Greek symbols(eg. A汉语AbN漢語 allowed)?
I currently have the following ^[\w\d][\w\d_\-\.\s]*$ which only allows Latin alphabets.

In .NET,
^[\p{L}\d_][\p{L}\d_.\s-]*$
is equivalent to your regex, additionally allowing other Unicode letters.
Explanation:
\p{L} is a shorthand for the Unicode property "Letter".
Caveat: I think you wanted to not allow the underscore as initial character (evidenced by its presence only in the second character class). Since \w includes the underscore, your regex did allow it, though. You might want to remove it from the first character class in my solution (it's not included in \p{L}, of course).
In ECMAScript, things are not so easy. You would have to define your own Unicode character ranges. Fortunately, a fellow StackOverflow user has already risen to the occasion and designed a JavaScript regex converter:
https://stackoverflow.com/a/8933546/20670

Regex for non-alphabets and non-numerals

Please provide a solution to write a regular expression as following in C#.NET:
I would require a RegEx for Non-Alphabets(a to z;A to Z) and Non-Numerals(0 to 9).
Mean to say as reverse way for getting regular expression other than alphabets and otherthan numerals(0 to 9).
Kindly suggest the solution for the same.

You can use a negated character class here:
[^a-zA-Z0-9]
Above regex will match a single character which can't be a latin lowercase or uppercase letter or a number.
The ^ at the start of the character class (the part between [ and ]) negates the complete class so that it matches anything not in the class, instead of normal character class behavior.
To make it useful, you probably want one of those:
Zero or more such characters
[^a-zA-Z0-9]*
The asterisk (*) here signifies that the preceding part can be repeated zero or more times.
One or more such characters
[^a-zA-Z0-9]+
The plus (+) here signifies that the preceding part can be repeated one or more times.
A complete (possibly empty) string, consisting only of such characters
^[^a-zA-Z0-9]*$
Here the characters ^ and $ have a meaning as anchors, matching the start and end of the string, respectively. This ensures that the entire string consists of characters not in that character class and no other characters come before or after them.
A complete (non-empty) string, consisting only of such characters
^[^a-zA-Z0-9]+$
Elaborating a bit, this won't (and can't) make sure that you won't use any other characters, possibly from other scripts. The string аеΒ would be completely valid with the above regular expression, because it uses letters from Greek and Cyrillic. Furthermore there are other pitfalls. The string á will pass above regular expression, while the string ́a will not (because it constructs the letter á from the letter a and a combining diacritical mark).
So negated character classes have to be taken with care at times.
I can also use numerals from other scripts, if I wanted to: ١٢٣ :-)
You can use the character class
[^\p{L&}\p{Nd}]
if you need to take care of the above things.

just negate the class:
[^A-Za-z0-9]

To obey local setting use:
[^[:alnum:]]

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to make a regular expression for combining characters? - c#

Use Unicode properties: \p{L}\p{M}*+ \p{L} any kind of letter from any language (but not combined ones!) \p{M} a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.). See regular-expressions.info/unicode for more details (chapter Unicode Categories)

Related

How split text with parenthesis into words using regular expression

C# Regex Match whole word, with special characters

How to allow only first punctuation mark in string with different marks sequence between words

Regex to allow non-ascii and foreign letters?

Regex for non-alphabets and non-numerals

Categories

Resources