I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing
Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.
To parse the text I tried simply looping, like
while( Char.IsLetterOrDigit(theText[i++]) ) {}
to find the next character that isn't a letter or digit. That works except for certain characters like this one
which is the second character in this word (I think that's the character 'superscripting' the first character in the word).
This character doesn't seem to be categorized as anything by the Char class, ie:
Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)
all return false.
This character might be a 'tone' - how would that be identified using .NET?
According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”
You can use the Char.GetUnicodeCategory() method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark
Related
I have searched through some questions but couldn't find the exact answer i am looking for.
I have a requirement to search through large strings of text looking for keywords matches. I was using IndexOf, however, i require to find whole word matches e.g. if i search for Java, but the text contains JavaScript, it shouldn't match. This works fine using \b{pattern}\b, but if i search for something like C#, then it doesn't work.
Below is a few examples of text strings that i am searching through:
languages include Java,JavaScript,MySql,C#
languages include Java/JavaScript/MySql/C#
languages include Java, JavaScript, MySql, C#
Obviously the issue is with the special character '#'; so this also doesn't work when searching for C++.
Escape the pattern using Regex.Escape and replace the context-dependent \b word boundaries with (?<!\w) / (?!\w) lookarounds:
var rx = $#"(?<!\w){Regex.Escape(pattern)}(?!\w)";
The (?<!\w) is a negative lookbehind that fails the match if there is a start of string or a non-word char immediately before the current location, and (?!\w) is a negative looahead that fails the match if there is an end of string or a non-word char immediately after the current location.
Yeah, this is because there isn't a word boundary (a \b) after the #, because # isn't a "word" character. You could use a regular expression like the following, which searches for a character that isn't a part of a language name [^a-zA-Z+#] after the language:
\b{pattern}[^a-zA-Z+#]
Or, if you believe you can list all of the possible characters that aren't part of a language name (for example, whitespace, ,, ., and ;):
[\s,.;]{pattern}[\s,.;]
Alternately, if it is possible that a language name is at the very end of a string (depending on what you're getting the data from), you might need to also match the end of the string $ in addition to the separators, or similarly, the beginning of the string ^.
[\s,.;]{pattern}(?:[\s,.;]|$)
I have a text (string) that I want in all upper case, except the following:
Words starting with : (colon)
Words or strings surrounded by double quotation marks, ""
Words or strings surrounded by single quotation marks, ''
Everything else should be replaced with its upper case counterpart, and formatting (whitespaces, line breaks, etc.) should remain.
How would I go about doing this using Regex (C# style/syntax)?
I think you are looking for something like this:
text = Regex.Replace(text, #":\w+|""[^""]*""|'[^']*'|(.)",
match => match.Groups[1].Success ?
match.Groups[1].Value.ToUpper() : match.Value);
:\w+ - match words with a colon.
"[^"]*"|'[^']*' - match quoted text. For escaped quotes, you may use:
"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'
(.) - capture anything else (you can also try ([^"':]*|.), it might be faster).
Next, we use a callback for Regex.Replace to do two things:
Determine if we need to keep the text as-is, or
Return the upper-case version of the text.
Working example: http://ideone.com/ORFU8
You can start with this RegEx:
\b(?<![:"'])(\w+?)(?!["'])\b
But of course you have to improve it by yourself, if it is not enough.
For example this will also not find "dfgdfg' (not equal quotation)
The word which is found is in the first match ($1)
I need a help regarding regular expression.
I have to match string like this:
âãa34dc
Pattern that i have used:
\s*[a-zA-Z]+[a-zA-Z_0-9]*\s
but this pattern is not good enough to identify this kind of string e.g. âãa34dc
P.S. âã these are swedish character.
Please help me for find out correct pattern for this kind of string.
Do you actually want to restrict it to Swedish characters? In other words, should a German character not match? If so, then you'll probably have to enumerate the whole alphabet, and include that.
If what you really want is to match every alphabetic character, use the regular expression terms for matching all letters.
\w matches any word character, but that includes numbers & some punctuation. That's close, but not exactly what you want for your second term.
For the first term, where you don't want to include numbers, specifying that the character should be a Unicode 'letter' class will work. \p{L} specifies all Unicode characters that are a letter. This includes [a-zA-Z], and all the Swedish characters, and German, and Russian, etc.
Therefore, I think this regular expression is what you want:
\s*[\p{L}][\p{L}_0-9]*\s
If you want to include digits from other character sets, and some other punctuation, then you can use [\w]* for the second term.
please give a set of rules.
according to your question :
[X-Ya-zA-Z]{3}[0-9]{2}[a-zA-Z]{2}
Replace X with the first swedish letter
Replace Y with the last swedish letter
John Machin provides a great answer for this. Adapting his pattern, what you need is probably something similar to: \s*[^\W\d_]\w*\s*
P.S. I removed the + quantifier from your first part. Any subsequent letters would be matched by the subsequent quantified \w.
i need a regular expression to match only the word's that match the following conditions. I am using it in my C# program
Can be any case
Should not have any numbers
may contain - and ' characters, but are optional
Should start with a letter
I have tried using the expression ^([a-zA-Z][\'\-]?)+$ but it doesn't work.
Here are list of few words that are acceptable
London (Case insensitive)
Jackson's
non-profit
Here are a list of few words that are not acceptable
12london (contains a number and is not started by a alphabet)
-to (does not start with a alphabet)
to: (contains : character, any special character other that - and ' is not allowed)
^[a-zA-Z][-'a-zA-Z]*$
This matches any word that starts with an alphabetical character, followed by any number of alphabetical characters, - or '.
Note that you don't need to escape the - and ' when it's inside the character [] class, as long as the dash is either the first or last character in the sequence.
Note also that I've removed the round brackets from your example - if you don't want to capture the input, you'll get better performance by leaving them out.
Try this one:
^[A-Za-z]+[A-Za-z'-]*$
First of all, try your regexes against tools such as http://www.regextester.com/
You are testing strings that both start with AND end with your pattern (^ means start of line, $ is the end), thus leaving out all of the words contained between two spaces.
You should use \b or \B.
Instead of looking for [a-zA-Z] you can use character classes such as '\D' (not digit).
Let me know if the above is working in your scenario.
\b\D[^\c][a-zA-Z]+[^\c]
It says: word boundaries with no digits, no control characters, one or more alphabetical lower or uppercase character, with no following control characters.
I had a question on here for a RegularExpressionValidator which I'm relatively new to. It was to accept all alphanumeric, apostrophe, hyphen, underscore, space, ampersand, comma, parentheses, full stop.
The answer I was given was:
"^([a-zA-Z0-9 '-_&,()\.])+$"
This seemed good at first but it seems to accept amoung other things '*'.
Can anybody tell me what I have wrong here?
The problem appears to be the dash - inside a character class, if unescaped and not at the very end or very beginning of the character class, it denotes a range (A-Z would be a good example from your own regex).
Therefore '-_ is also interpreted as a range, and the characters between ASCII 39 (') and ASCII 95 (_) are ()*+,-./0-9:;<=>?#A-Z[\]^.
Put the dash at the end, and you should be fine:
^[a-zA-Z0-9 '_&,().-]+$
Your character class is not quite correct. This part: '-_ creates a range from the apostrophe character to the underscore character. In the ASCII table, the * character falls in between. You need to either escape the hyphen:
^([a-zA-Z0-9 '\-_&,()\.])+$
Or move it somewhere "insignificant", such as the end of the character class:
^([a-zA-Z0-9 '_&,()\.-])+$
In addition to the '-_ issue touched on by other people you also have the + on the end in the wrong place.
The value capture group in this regex:
^([a-zA-Z0-9 '-_&,()\.])+$
in Expresso is the last character in the string.
If you want to capture the whole thing inside the regex then put the + straight after the ] like
^([a-zA-Z0-9 '-_&,()\.]+)$
If you are not bothered about extracting the value captured inside the ( ) then drop the ()
^[a-zA-Z0-9 '-_&,()\.]+$
As I also tripped up on the fact that this uses a character class in my initial answer, I dug around for more info. Found the following tutorial excerpt at http://www.regular-expressions.info/charclass.html
The only special characters or
metacharacters inside a character
class are the closing bracket (]), the
backslash (), the caret (^) and the
hyphen (-). The usual metacharacters
are normal characters inside a
character class, and do not need to be
escaped by a backslash.
Escaping the - with \- should solve your problem.