How to match with Non-Ascii character using Regex in C#? - c#

How to match 4 char then jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) after 4 char again jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) again 4 etc.,
My check string : 1234 4567 7891 0934
this is 16digit char, each 4char separated by space.
Main string:
"ACCOUNT NUMBER NAME STATEMENT DATE PAYMENT DUE DATE 1234 4567 7891 0934 Jane Doe 01/01/2009 02/26/09 CREDIT LIMIT CREDIT AVAILABLE NEW BALANCE MINIMUM PAYMENT DUE ."
above text(Main string) comes from PDF document. which was extracted by OCR Engine.
since Main string has my check string, but it's separated by some unknown char instead of space. I tried replace with # to space in Visual studio's immediate window. but that space of in-between Main string's check string was not replaced. thus, I could able to say It is Non-ascii character, but seems like a space.
I could be able get rid from this issue by below code:
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
but,I would like to know Regex solution.
Even though non-ascii char occured, should be match with regex to check whether exists or not.

If you aren't sure whether the character between those 4 digits is a space or not, you can use a . character which matches any character and use this regex to match those group of 4 digits separated by a seemingly unknown character.
\d{4}.\d{4}.\d{4}.\d{4}
If you want to access those group of 4 digits, then you can put them in group and access them using all four grouping pattern from this regex,
(\d{4}).(\d{4}).(\d{4}).(\d{4})
Check this demo
Let me know if any of your query remains unresolved.

Related

Categorizing this Thai character using the .NET framework

I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing
Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.
To parse the text I tried simply looping, like
while( Char.IsLetterOrDigit(theText[i++]) ) {}
to find the next character that isn't a letter or digit. That works except for certain characters like this one
which is the second character in this word (I think that's the character 'superscripting' the first character in the word).
This character doesn't seem to be categorized as anything by the Char class, ie:
Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)
all return false.
This character might be a 'tone' - how would that be identified using .NET?
According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”
You can use the Char.GetUnicodeCategory() method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark

Regex to match more than one word

I have an ASP.NET MVC application containing a form field called 'First/last name'. I need to add some basic validation to ensure people enter at least two words. It doesn't need to be totally comprehensive in checking word length etc, we essentially just need to prevent people from entering just their first name which is what's happening currently. I don't want to limit to just alphabetic characters as some names include punctuation. I just want to ensure that people have entered at least two words separated by a space.
I have the following regex currently:
[RegularExpression(#"^((\b[a-zA-Z]{2,40}\b)\s*){2,}$", ErrorMessage = "Invalid first/last name")]
This works to an extent (it checks for 2 words) but it's invalid if punctuation is entered, which isn't what I'm looking for.
Could anyone suggest how to modify the above so that it doesn't matter if punctuation is used in the words? I'm not good with the regular expression syntax, hence asking here.
Thanks.
You want two words, so at least one space between them, and beyond that you want to allow everything else (e.g., punctuation). So keep it simple:
\w.*\s.*\w
Or if you must anchor it to start and end:
^.*\w.*\s.*\w.*$
These will match, for example, D' Addario (but not D'Artagnan by itself, since it counts as one word by the space criterion).
Maybe just:
#"\w\s\w"
word white space word
Hi you can use this regex for validation
'^[a-zA-Z0-9]+ {1}[a-zA-Z0-9]+$`'
Demo http://rubular.com/r/YN8eFa1yFE
If you just want to allow a sequence of non-whitespace characters followed by 1 or more sequences of whitespace characters followed by non-whitespace characters, you can use
^\s*\S+(?:\s+\S+)+\s*$
See regex demo
It won't accept just First or First .
Regex breakdown:
^ - start of string
\s* - zero or more whitespace
\S+ - 1 or more non-whitespace symbols
(?:\s+\S+)+ - 1 or more sequences of ...
\s+ - 1 or more whitespace sequences (remove + to allow only 1 whitespace between words)
\S+ - 1 or more non-whitespace symbols
\s* - zero or more whitespace
$ - end of string

C# Regex Phone Number Check

I have the following to check if the phone number is in the following format
(XXX) XXX-XXXX. The below code always return true. Not sure why.
Match match = Regex.Match(input, #"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}");
// Below code always return true
if (match.Success) { ....}
The general complaint about regex patterns for phone numbers is that they require one to put in the truly optional characters as dashes and other items.
Why can't they be optional and have the pattern not care if they are there or not?
The below pattern makes dashes, periods and parenthesis optional for the user and focuses on the numbers as a result using named captures.
The pattern is commented (using the # and spans multiple lines) so use the Regex option IgnorePatternWhitespace unless one removes the comments. For that flag doesn't affect regex processing, it only allows for commenting of the pattern via the # character and line break .
string pattern = #"
^ # From Beginning of line
(?:\(?) # Match but don't capture optional (
(?<AreaCode>\d{3}) # 3 digit area code
(?:[\).\s]?) # Optional ) or . or space
(?<Prefix>\d{3}) # Prefix
(?:[-\.\s]?) # optional - or . or space
(?<Suffix>\d{4}) # Suffix
(?!\d) # Fail if eleventh number found";
The above pattern just looks for 10 numbers and ignores any filler characters such as a ( or a dash - or a space or a tab or even a .. Examples are
(555)555-5555 (OK)
5555555555 (ok)
555 555 5555(ok)
555.555.5555 (ok)
55555555556 (not ok - match failure - too many digits)
123.456.789 (failure)
Different Variants of same pattern
Pattern without comments no longer need to use IgnorePatternWhiteSpace:
^(?:\(?)(?<AreaCode>\d{3})(?:[\).\s]?)(?<Prefix>\d{3})(?:[-\.\s]?)(?<Suffix>\d{4})(?!\d)
Pattern when not using Named Captures
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d)
Pattern if ExplicitCapture option is used
^\(?(?<AreaCode>\d{3})[\).\s]?(?<Prefix>\d{3})[-\.\s](?<Suffix>\d{4})(?!\d)
It doesn't always match, but it will match any string that contains three digits, followed by a hyphen, followed by four more digits. It will also match if there's something that looks like an area code on the front of that. So this is valid according to your regex:
%%%%%%%%%%%%%%(999)123-4567%%%%%%%%%%%%%%%%%
To validate that the string contains a phone number and nothing else, you need to add anchors at the beginning and end of the regex:
#"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$"
Alan Moore did a good explaining what your exp is actually doing. +1
If you want to match exactly "(XXX) XXX-XXXX" and absolutely nothing else, then what you want is
#"^\(\d{3}\) \d{3}-\d{4}$"
Here is the C# code I use. It is designed to get all phone numbers from a page of text. It works for the following patters: 0123456789, 012-345-6789, (012)-345-6789, (012)3456789 012 3456789, 012 345 6789, 012 345-6789, (012) 345-6789, 012.345.6789
List<string> phoneList = new List<string>();
Regex rg = new Regex(#"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})");
MatchCollection m = rg.Matches(html);
foreach (Match g in m)
{
if (g.Groups[0].Value.Length > 0)
phoneList.Add(g.Groups[0].Value);
}
none of the comments above takes care of international numbers like +33 6 87 17 00 11 (which is a valid phone number for France for example).
I would do it in a two-step approach:
1. Remove all characters that are not numbers or '+' character
2. Check the + sign is at the beginning or not there. Check length (this can be very hard as it depends on local country number schemes).
Now if your number starts with +1 or you are sure the user is in USA, then you can apply the comments above.

Regular expression - match character, digits and special characters

I have the following string:
test123 test ödo 123teö"st 123 m.1212 123t.est
I only want to match strings as a whole that have either characters, digits and special character mixed together. So the regex should match the following string of the example above:
test123 test ödo 123teö"st 123 m.1212 123t.est
Could someone help me out please?
Update
Sorry for not giving a clear explanation of what I need.
I am using C#.
I need to find words that contain alphanumeric strings (eg abc123, 123abc, a1b2c3, 1abc23 etc). Also I need to find strings that contain any kind of symbols (symbols = anythings else than word characters and digits) (eg abc"123, "abc, ab?dd, 100mm", 345t{asd]dd)
If I find a match, I need to "tokenize" (separate digits, word characters and symbols with whitespace) these strings so abc123 becomes abc 123 or 345t{asd]dd becomes 345 t { asd ] dd etc
Assuming you're using a regex flavor that supports lookaheads and Unicode properties, this should get you started:
(?!(?:\pL+|\pN+|\pP+)(?!\S))\S+
\S+ matches one or more non-whitespace characters, but only after the negative lookahead asserts that those characters are not all letters (\pL), digits (\pN), or punctuation (\pP). The inner negative lookahead--(?!\S)--ensures that the outer one examines all the characters in the word.
Although it might satisfy your requirements, this regex is really just a demonstration of the technique you'll probably want to use. As it is, it can be fooled by "words" with (for example) control characters or dingbats in them.
The answer to the question you’ve actually asked is (?s).+, but perhaps you would care to refine your question.

A regular expression for matching a simple word in C#?

i need a regular expression to match only the word's that match the following conditions. I am using it in my C# program
Can be any case
Should not have any numbers
may contain - and ' characters, but are optional
Should start with a letter
I have tried using the expression ^([a-zA-Z][\'\-]?)+$ but it doesn't work.
Here are list of few words that are acceptable
London (Case insensitive)
Jackson's
non-profit
Here are a list of few words that are not acceptable
12london (contains a number and is not started by a alphabet)
-to (does not start with a alphabet)
to: (contains : character, any special character other that - and ' is not allowed)
^[a-zA-Z][-'a-zA-Z]*$
This matches any word that starts with an alphabetical character, followed by any number of alphabetical characters, - or '.
Note that you don't need to escape the - and ' when it's inside the character [] class, as long as the dash is either the first or last character in the sequence.
Note also that I've removed the round brackets from your example - if you don't want to capture the input, you'll get better performance by leaving them out.
Try this one:
^[A-Za-z]+[A-Za-z'-]*$
First of all, try your regexes against tools such as http://www.regextester.com/
You are testing strings that both start with AND end with your pattern (^ means start of line, $ is the end), thus leaving out all of the words contained between two spaces.
You should use \b or \B.
Instead of looking for [a-zA-Z] you can use character classes such as '\D' (not digit).
Let me know if the above is working in your scenario.
\b\D[^\c][a-zA-Z]+[^\c]
It says: word boundaries with no digits, no control characters, one or more alphabetical lower or uppercase character, with no following control characters.

Categories

Resources