Here is a text with English words, CJK characters and fullwidth parenthesis(\uff08 and \uff09):
这是(一段测试)文字(start开始end)的结果
I want to split the text into words, for CJK characters, one charcater is a word. The special point is that I also want the fullwidth left parenthesis \uff08 combines with the word after it, and the fullwidth right parenthesis \uff09 combines with the word before it.
The expected result will be:
这
是
(一
段
测
试)
文
字
(start
开
始
end)
的
结
果
Currently, I use new Regex(#"(\s+)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])"); to split the text, but fullwidth parentheses didn't combine with the word before/after it.
You can add those special cases:
(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))
and
((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)
to your regex, giving you a complete regex of:
(\s+)|(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))|((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])
Demo on regex101
Note they need to be added to the regex prior to the part of the regex that could match the word on its own, otherwise that match will take precedence.
Related
If I need to allow only first punctuation mark in string with different punctuation marks sequence between words, for example if string is:
string str = "hello,.,.,.world.,.?,.";
in result I want get this:
hello, world.
It would be good to know both, how to pass such string after insert and how to avoid writing of more then one mark and one white space between the words in string directly in textbox.
You can try this: (?<=[,.])[,.?]+.
See it working here: https://regex101.com/r/di5Ebw/1.
If you need to have a list of special ponctuation that you want to strip we can adjust in [,.]!
(So in the example I give you the match is on the chars you want to remove: just replace that match with empty string - as you can see in the SUBSTITUTION panel at the bottom)
[EDIT] Extend the match cases.
If you don't want to bother let this do it for you: (?<=\W)(?<! )\W+
See it working here: https://regex101.com/r/di5Ebw/2
.Net regular expressions have a punctuation class, so a simple way of achieving the required result is to search for the string (\w\p{P})\p{P}+ and replace with $1.
For a regular expression that handles exactly the few punctuation characters used in the question the regular expression (\w[.,?])[.,?]+ can be used.
(Note, the above shows the regular expressions. Their C# strings are "(\\w\\p{P})\\p{P}+" and "(\\w[.,?])[.,?]+".)
Explanation. This looks for a word character (\w) followed by one punctuation character and it captures these two characters. Any immediately following punctuation characters are matched by the \p{P}+. The whole match is replace by the capture.
The \p{name} construct is defined here as "Matches any single character in the Unicode general category or named block specified by name.
".
The \p{P} category is defined here as "All punctuation characters". There are also several subcategories of punctuation, but it may be best to look at Unicode to understand them.
I need to find all the words that have between 15 and 20 characters in a big string. And I want to avoid getting a long words with something else at the end (for ex 1234567890abcdef#asdf.com). I don't want that to be a result, only words. Right now I'm spliting the string using white space as token and for each word I'm applying the following regular expression:
^[a-zA-Z0-9]{15,20}$
Is there any chance to do both things using one regular expression?
I'm using C#.
Good examples to catch:
1234567890abcdeg
qwertyuiopasdfgh
1234567890abcdeg, (catch it but remove ",")
Examples to avoid: 1234567890abcdeg#gmail.com
Don't use start/end anchors (^/$), but word delimiters (\b):
\b[a-zA-Z0-9]{15,20}(?=[\s,]|$)
I used (?=[\s,]|$) instead of the end delimiter to force a space character or a comma or the end of the string. Expand it as needed.
You may want to do likewise for the first \b if you need to, for instance: (?<=\s|^).
Normally, you would use word boundaries (\b) before and after the alphanumerics:
\b[a-zA-Z0-9]{15,20}\b
However, there's a small detail to take into account: uderscores ("_") are also considered a word character. The previous regex won't match the following text:
12345678901234567_
In order to avoid it, you can check if it's preceded and followed by either a \b or a "_", with lookarounds.
Regex:
(?<=\b|_)[a-zA-Z0-9]{15,20}(?=\b|_)
I am working in an application in which i have required a regular expression of to detect combining characters.I have made following regex
string regex = #"^([~.][a-z])";
I have to detect combining characters which are separated from character because they don not exist in the font so i have to check two characters, one is symbol and other is any character i.e ~a.
Problem is that i am not able to paste exact shape of symbols. I am using this link
http://en.wikipedia.org/wiki/Combining_character
When i paste them in regex there shape is changed.
How to make a regex that detect specific combining characters provided in regex.
Use Unicode properties:
\p{L}\p{M}*+
\p{L} any kind of letter from any language (but not combined ones!)
\p{M} a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
See regular-expressions.info/unicode for more details (chapter Unicode Categories)
I want to find the href from an achore tag. So I have used regex as
<a\s*[^>]*\s*href\s*\=\s*([^(\s*|\>)]*)\s*[^>]*>\s*Text\s*<\/a>
Options = Ignorecase + singleline
Example
Text
So Group[1]="/abc/xzy/pqr.com"
But If the content is like
<a href="/abc/xzy/ //Contains new line
pqr.com" class="m">Text</a>
so Group[1]="/abc/xzy/
So I want to know how to get "/abc/xzy/pqr.com" if the content contains new line(\r\n)
Your capture group is a bit weird: [^(\s*|\>)]* is a character class and it will match any character not (, ror a character class \s, nor an asterisk *, etc.
What you can do however is to put quotes before and after the capture group:
<a\s*[^>]*\s*href\s*\=\s*"([^(\s*|\>)]*)"\s*[^>]*>\s*Text\s*<\/a>
^ ^
And then change the character class to [^"] (not quotes):
<a\s*[^>]*\s*href\s*\=\s*"([^"]*)"\s*[^>]*>\s*Text\s*<\/a>
^^^^
regex101 demo.
This said, it would be better to use a proper html parser instead of regex. It's just that it's more tedious to make a suitable regex because you can forget about a lot of different scenarios, but if you're certain of how your data comes through, regex might be a quick way to get what you need.
If you want to consider single quotes and no quotes at all in some cases, you might try this instead:
<a\s*[^>]*\s*href\s*=\s*((?:[^ ]|[\n\r])+)\s*[^>]*>\s*Text\s*<\/a>
Updated regex101.
This regex has this part instead (?:[^ ]|[\n\r])+ which accepts non-spaces and newlines (and carriage returns just in case). Note that \s contains white spaces, tabs, newlines and form-feed.
I'm really a n00b when it comes to regular expressions. I've been trying to Split a string wherever there's a [----anything inside-----] for example.
string s = "Hello Word my name_is [right now I'm hungry] Julian";
string[] words = Regex.Split( s, "------");
The outcome would be "Hello Word my name_is " and " Julian"
The regex you want to use is:
Regex.Split( s, "\\[.*?\\]" );
Square brackets are special characters (specifying a character group), so they have to be escaped with a backslash. Inside the square brackets, you want any sequence of characters EXCEPT a close square bracket. There are a couple of ways to handle that. One is to specify [^\]]* (explicitly specifying "not a close square bracket"). The other, as I used in my answer, is to specify that the match is not greedy by affixing a question mark after it. This tells the regular expression processor not to greedily consume as many characters as it can, but to stop as soon as the next expression is matched.
#"\[.*?\]" will match the brackets of text
Another way to write it:
Regex.Split(str, #"\[[^]]*\]");