Regular Expression group reversed order - c#

I am reading in a very messy file with very little (if any) format. I am looking for the following two of which I have working properly.
Name (first and last) working
Email addresses (varying types (eg. .edu .net .com) There could be others as well.) working
Employee number (two capital letters followed by 5 digit values then the same two letters as the first but reversed) NOT Working
The code I have currently for the Employee regex:
string employeeNumber = #"(?<grp1>[A-Z]{2})[0-9]{5}[A-Z]{2}";
This finds the required values, but would also find invalid employee numbers since it is not actually looking for the first two capital chars in the opposite order.
What I would like in the end is to some how use the <grp1> only in the reversed order.
Example of a valid employee number XY12345YX.
I could not find any good documentation on any type of regular expression group reversal. Any Ideas would be great!
EDIT:
This is an example of a line from a text document that I am reading in.
'Name list from PQP-97 system &%$ Bill Williams MK12345KM bwilliams01#msn.com ^ %20%
Fredericka Hanover GW22887WG freddie#verizon.net'

Try this:
/.*?([A-Z][a-z]*)\s+([A-Z][a-z]*)\s+(([A-Z])([A-Z])[0-9]{5}\5\4)\s+\(\S+#\S+).*/g
Regex101 Demo: https://regex101.com/r/iB9vF2/2
Match1 = First Name
Match2 = Last Name
Match3 = Employee ID
Match4 = (ignore this; just used for finding employee id)
Match5 = (ignore this; just used for finding employee id)
Match6 = Email
Explanation:
.*? - ignore any rubbish before the first name
([A-Z][a-z]*) - first name begins with a capital followed by any number of lower case letters
\s+ - 1 or more spaces marks the end of the first name
([A-Z][a-z]*) - last name follows first name, and follows the same pattern
\s+ - last name terminated by space(s)
(([A-Z])([A-Z])[0-9]{5}\5\4) - employee id follows last name, in the format Capital1, Capital2 then 5 digits, then a repeat of Capital2 (match5) and Capital1 (match4)
\s+ - space(s) shows the end of the employee id
(\S+#\S+) - non space characters either side of an # symbol make up the email*
.* - this just allows for junk on the end of the string. It won't match the mail, since the \S+ is greedy, but it will cater for any other character, thus also representing the end of the email.
* NB: the email regex is overly simple; should be enough for your needs, but this couldn't check for valid emails, since the rules around those are complex.
Further reading: Using a regular expression to validate an email address

Related

How to match with Non-Ascii character using Regex in C#?

How to match 4 char then jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) after 4 char again jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) again 4 etc.,
My check string : 1234 4567 7891 0934
this is 16digit char, each 4char separated by space.
Main string:
"ACCOUNT NUMBER NAME STATEMENT DATE PAYMENT DUE DATE 1234 4567 7891 0934 Jane Doe 01/01/2009 02/26/09 CREDIT LIMIT CREDIT AVAILABLE NEW BALANCE MINIMUM PAYMENT DUE ."
above text(Main string) comes from PDF document. which was extracted by OCR Engine.
since Main string has my check string, but it's separated by some unknown char instead of space. I tried replace with # to space in Visual studio's immediate window. but that space of in-between Main string's check string was not replaced. thus, I could able to say It is Non-ascii character, but seems like a space.
I could be able get rid from this issue by below code:
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
but,I would like to know Regex solution.
Even though non-ascii char occured, should be match with regex to check whether exists or not.
If you aren't sure whether the character between those 4 digits is a space or not, you can use a . character which matches any character and use this regex to match those group of 4 digits separated by a seemingly unknown character.
\d{4}.\d{4}.\d{4}.\d{4}
If you want to access those group of 4 digits, then you can put them in group and access them using all four grouping pattern from this regex,
(\d{4}).(\d{4}).(\d{4}).(\d{4})
Check this demo
Let me know if any of your query remains unresolved.

Regex - Match first and last character within capturing group

I want to capture the first and last character within a capturing group.
My current RegEx is -
([\w\.]+)#([\w]+)\.com
For example, if there is an email address -
xyz#test.com
This is the output -
Full match 0-12 `xyz#test.com`
Group 1. 0-3 `xyz`
Group 2. 4-8 `test`
The email address can have alphanumeric and period values.
If I want to curtail the Group 1 such that it starts and ends with only alphanumeric values, how to do that?
I want to modify this capturing group -
([\w\.]+)
The required output is -
xyz.#test.com Invalid
.xyz#test.com Invalid
xy.z#test.com Valid
To tell engine match English alphanumeric characters at the start position and one before # you need to do this:
^([a-zA-Z0-9][\.a-zA-Z0-9]*[a-zA-Z0-9])#([a-zA-Z0-9]+)\.com$
Note: \w includes _ that you may not desire.
But this doesn't allow usernames with one character long. So you have to modify it a little:
^([a-zA-Z0-9]+(?:\.+[a-zA-Z0-9]+)*)#([a-zA-Z0-9]+)\.com$
Also this shouldn't be considered a good email validator. But as it seems you narrow down matching to .com TLD so I assume this is a very specific requirement otherwise it limits domain name to alphanumerics and doesn't allow many more characters that would be valid in an email address according to RFC 822. This would be enough for capturing an email address from user input:
^[^\s#]+#[^\s#]+$
Try this regex - (^[\w][\w\.\w]+[\w])#([\w]+)\.com
This works:
^([0-9a-zA-Z][a-zA-Z0-9_\.]*)(?<!\.)#([a-zA-Z0-9_]+)\.com$
Demo
Basically, it tries to match alphanumeric characters at the start, then [a-zA-Z0-9_\.] for 0 or more times. Before it reaches #, it will look behind to check if there is a dot (if it is not an alphanumeric, it's gotta be a dot).

Regex to match more than one word

I have an ASP.NET MVC application containing a form field called 'First/last name'. I need to add some basic validation to ensure people enter at least two words. It doesn't need to be totally comprehensive in checking word length etc, we essentially just need to prevent people from entering just their first name which is what's happening currently. I don't want to limit to just alphabetic characters as some names include punctuation. I just want to ensure that people have entered at least two words separated by a space.
I have the following regex currently:
[RegularExpression(#"^((\b[a-zA-Z]{2,40}\b)\s*){2,}$", ErrorMessage = "Invalid first/last name")]
This works to an extent (it checks for 2 words) but it's invalid if punctuation is entered, which isn't what I'm looking for.
Could anyone suggest how to modify the above so that it doesn't matter if punctuation is used in the words? I'm not good with the regular expression syntax, hence asking here.
Thanks.
You want two words, so at least one space between them, and beyond that you want to allow everything else (e.g., punctuation). So keep it simple:
\w.*\s.*\w
Or if you must anchor it to start and end:
^.*\w.*\s.*\w.*$
These will match, for example, D' Addario (but not D'Artagnan by itself, since it counts as one word by the space criterion).
Maybe just:
#"\w\s\w"
word white space word
Hi you can use this regex for validation
'^[a-zA-Z0-9]+ {1}[a-zA-Z0-9]+$`'
Demo http://rubular.com/r/YN8eFa1yFE
If you just want to allow a sequence of non-whitespace characters followed by 1 or more sequences of whitespace characters followed by non-whitespace characters, you can use
^\s*\S+(?:\s+\S+)+\s*$
See regex demo
It won't accept just First or First .
Regex breakdown:
^ - start of string
\s* - zero or more whitespace
\S+ - 1 or more non-whitespace symbols
(?:\s+\S+)+ - 1 or more sequences of ...
\s+ - 1 or more whitespace sequences (remove + to allow only 1 whitespace between words)
\S+ - 1 or more non-whitespace symbols
\s* - zero or more whitespace
$ - end of string

C# Regex Phone Number Check

I have the following to check if the phone number is in the following format
(XXX) XXX-XXXX. The below code always return true. Not sure why.
Match match = Regex.Match(input, #"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}");
// Below code always return true
if (match.Success) { ....}
The general complaint about regex patterns for phone numbers is that they require one to put in the truly optional characters as dashes and other items.
Why can't they be optional and have the pattern not care if they are there or not?
The below pattern makes dashes, periods and parenthesis optional for the user and focuses on the numbers as a result using named captures.
The pattern is commented (using the # and spans multiple lines) so use the Regex option IgnorePatternWhitespace unless one removes the comments. For that flag doesn't affect regex processing, it only allows for commenting of the pattern via the # character and line break .
string pattern = #"
^ # From Beginning of line
(?:\(?) # Match but don't capture optional (
(?<AreaCode>\d{3}) # 3 digit area code
(?:[\).\s]?) # Optional ) or . or space
(?<Prefix>\d{3}) # Prefix
(?:[-\.\s]?) # optional - or . or space
(?<Suffix>\d{4}) # Suffix
(?!\d) # Fail if eleventh number found";
The above pattern just looks for 10 numbers and ignores any filler characters such as a ( or a dash - or a space or a tab or even a .. Examples are
(555)555-5555 (OK)
5555555555 (ok)
555 555 5555(ok)
555.555.5555 (ok)
55555555556 (not ok - match failure - too many digits)
123.456.789 (failure)
Different Variants of same pattern
Pattern without comments no longer need to use IgnorePatternWhiteSpace:
^(?:\(?)(?<AreaCode>\d{3})(?:[\).\s]?)(?<Prefix>\d{3})(?:[-\.\s]?)(?<Suffix>\d{4})(?!\d)
Pattern when not using Named Captures
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d)
Pattern if ExplicitCapture option is used
^\(?(?<AreaCode>\d{3})[\).\s]?(?<Prefix>\d{3})[-\.\s](?<Suffix>\d{4})(?!\d)
It doesn't always match, but it will match any string that contains three digits, followed by a hyphen, followed by four more digits. It will also match if there's something that looks like an area code on the front of that. So this is valid according to your regex:
%%%%%%%%%%%%%%(999)123-4567%%%%%%%%%%%%%%%%%
To validate that the string contains a phone number and nothing else, you need to add anchors at the beginning and end of the regex:
#"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$"
Alan Moore did a good explaining what your exp is actually doing. +1
If you want to match exactly "(XXX) XXX-XXXX" and absolutely nothing else, then what you want is
#"^\(\d{3}\) \d{3}-\d{4}$"
Here is the C# code I use. It is designed to get all phone numbers from a page of text. It works for the following patters: 0123456789, 012-345-6789, (012)-345-6789, (012)3456789 012 3456789, 012 345 6789, 012 345-6789, (012) 345-6789, 012.345.6789
List<string> phoneList = new List<string>();
Regex rg = new Regex(#"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})");
MatchCollection m = rg.Matches(html);
foreach (Match g in m)
{
if (g.Groups[0].Value.Length > 0)
phoneList.Add(g.Groups[0].Value);
}
none of the comments above takes care of international numbers like +33 6 87 17 00 11 (which is a valid phone number for France for example).
I would do it in a two-step approach:
1. Remove all characters that are not numbers or '+' character
2. Check the + sign is at the beginning or not there. Check length (this can be very hard as it depends on local country number schemes).
Now if your number starts with +1 or you are sure the user is in USA, then you can apply the comments above.

How can I get a regex to check that a string only contains alpha characters [a-z] or [A-Z]?

I'm trying to create a regex to verify that a given string only has alpha characters a-z or A-Z. The string can be up to 25 letters long. (I'm not sure if regex can check length of strings)
Examples:
1. "abcdef" = true;
2. "a2bdef" = false;
3. "333" = false;
4. "j" = true;
5. "aaaaaaaaaaaaaaaaaaaaaaaaaa" = false; //26 letters
Here is what I have so far... can't figure out what's wrong with it though
Regex alphaPattern = new Regex("[^a-z]|[^A-Z]");
I would think that would mean that the string could contain only upper or lower case letters from a-z, but when I match it to a string with all letters it returns false...
Also, any suggestions regarding efficiency of using regex vs. other verifying methods would be greatly appreciated.
Regex lettersOnly = new Regex("^[a-zA-Z]{1,25}$");
^ means "begin matching at start of string"
[a-zA-Z] means "match lower case and upper case letters a-z"
{1,25} means "match the previous item (the character class, see above) 1 to 25 times"
$ means "only match if cursor is at end of string"
I'm trying to create a regex to verify that a given string only has alpha
characters a-z or A-Z.
Easily done as many of the others have indicated using what are known as "character classes". Essentially, these allow us to specifiy a range of values to use for matching:
(NOTE: for simplification, I am assuming implict ^ and $ anchors which are explained later in this post)
[a-z] Match any single lower-case letter.
ex: a matches, 8 doesn't match
[A-Z] Match any single upper-case letter.
ex: A matches, a doesn't match
[0-9] Match any single digit zero to nine
ex: 8 matches, a doesn't match
[aeiou] Match only on a or e or i or o or u.
ex: o matches, z doesn't match
[a-zA-Z] Match any single lower-case OR upper-case letter.
ex: A matches, a matches, 3 doesn't match
These can, naturally, be negated as well:
[^a-z] Match anything that is NOT an lower-case letter
ex: 5 matches, A matches, a doesn't match
[^A-Z] Match anything that is NOT an upper-case letter
ex: 5 matches, A doesn't matche, a matches
[^0-9] Match anything that is NOT a number
ex: 5 doesn't match, A matches, a matches
[^Aa69] Match anything as long as it is not A or a or 6 or 9
ex: 5 matches, A doesn't match, a doesn't match, 3 matches
To see some common character classes, go to:
http://www.regular-expressions.info/reference.html
The string can be up to 25 letters long.
(I'm not sure if regex can check length of strings)
You can absolutely check "length" but not in the way you might imagine. We measure repetition, NOT length strictly speaking using {}:
a{2} Match two a's together.
ex: a doesn't match, aa matches, aca doesn't match
4{3} Match three 4's together.
ex: 4 doesn't match, 44 doesn't match, 444 matches, 4434 doesn't match
Repetition has values we can set to have lower and upper limits:
a{2,} Match on two or more a's together.
ex: a doesn't match, aa matches, aaa matches, aba doesn't match, aaaaaaaaa matches
a{2,5} Match on two to five a's together.
ex: a doesn't match, aa matches, aaa matches, aba doesn't match, aaaaaaaaa doesn't match
Repetition extends to character classes, so:
[a-z]{5} Match any five lower-case characters together.
ex: bubba matches, Bubba doesn't match, BUBBA doesn't match, asdjo matches
[A-Z]{2,5} Match two to five upper-case characters together.
ex: bubba doesn't match, Bubba doesn't match, BUBBA matches, BUBBETTE doesn't match
[0-9]{4,8} Match four to eight numbers together.
ex: bubba doesn't match, 15835 matches, 44 doesn't match, 3456876353456 doesn't match
[a3g]{2} Match an a OR 3 OR g if they show up twice together.
ex: aa matches, ba doesn't match, 33 matches, 38 doesn't match, a3 DOESN'T match
Now let's look at your regex:
[^a-z]|[^A-Z]
Translation: Match anything as long as it is NOT a lowercase letter OR an upper-case letter.
To fix it so it meets your needs, we would rewrite it like this:
Step 1: Remove the negation
[a-z]|[A-Z]
Translation: Find any lowercase letter OR uppercase letter.
Step 2: While not stricly needed, let's clean up the OR logic a bit
[a-zA-Z]
Translation: Find any lowercase letter OR uppercase letter. Same as above but now using only a single set of [].
Step 3: Now let's indicate "length"
[a-zA-Z]{1,25}
Translation: Find any lowercase letter OR uppercase letter repeated one to twenty-five times.
This is where things get funky. You might think you were done here and you may well be depending on the technology you are using.
Strictly speaking the regex [a-zA-Z]{1,25} will match one to twenty-five upper or lower-case letters ANYWHERE on a line:
[a-zA-Z]{1,25}
a matches, aZgD matches, BUBBA matches, 243242hello242552 MATCHES
In fact, every example I have given so far will do the same. If that is what you want then you are in good shape but based on your question, I'm guessing you ONLY want one to twenty-five upper or lower-case letters on the entire line. For that we turn to anchors. Anchors allow us to specify those pesky details:
^ beginning of a line
(I know, we just used this for negation earlier, don't get me started)
$ end of a line
We can use them like this:
^a{3} From the beginning of the line match a three times together
ex: aaa matches, 123aaa doesn't match, aaa123 matches
a{3}$ Match a three times together at the end of a line
ex: aaa matches, 123aaa matches, aaa123 doesn't match
^a{3}$ Match a three times together for the ENTIRE line
ex: aaa matches, 123aaa doesn't match, aaa123 doesn't match
Notice that aaa matches in all cases because it has three a's at the beginning and end of the line technically speaking.
So the final, technically correct solution, for finding a "word" that is "up to five characters long" on a line would be:
^[a-zA-Z]{1,25}$
The funky part is that some technologies implicitly put anchors in the regex for you and some don't. You just have to test your regex or read the docs to see if you have implicit anchors.
/// <summary>
/// Checks if string contains only letters a-z and A-Z and should not be more than 25 characters in length
/// </summary>
/// <param name="value">String to be matched</param>
/// <returns>True if matches, false otherwise</returns>
public static bool IsValidString(string value)
{
string pattern = #"^[a-zA-Z]{1,25}$";
return Regex.IsMatch(value, pattern);
}
The string can be up to 25 letters long.
(I'm not sure if regex can check length of strings)
Regexes ceartanly can check length of a string - as can be seen from the answers posted by others.
However, when you are validating a user input (say, a username), I would advise doing that check separately.
The problem is, that regex can only tell you if a string matched it or not. It won't tell why it didn't match. Was the text too long or did it contain unallowed characters - you can't tell. It's far from friendly, when a program says: "The supplied username contained invalid characters or was too long". Instead you should provide separate error messages for different situations.
The regular expression you are using is an alternation of [^a-z] and [^A-Z]. And the expressions [^…] mean to match any character other than those described in the character set.
So overall your expression means to match either any single character other than a-z or other than A-Z.
But you rather need a regular expression that matches a-zA-Z only:
[a-zA-Z]
And to specify the length of that, anchor the expression with the start (^) and end ($) of the string and describe the length with the {n,m} quantifier, meaning at least n but not more than m repetitions:
^[a-zA-Z]{0,25}$
Do I understand correctly that it can only contain either uppercase or lowercase letters?
new Regex("^([a-z]{1,25}|[A-Z]{1,25})$")
A regular expression seems to be the right thing to use for this case.
By the way, the caret ("^") at the first place inside a character class means "not", so your "[^a-z]|[^A-Z]" would mean "not any lowercase letter, or not any uppercase letter" (disregarding that a-z are not all letters).

Categories

Resources