C# validate phone number with Regex [duplicate] - c#

I'm trying to put together a comprehensive regex to validate phone numbers. Ideally it would handle international formats, but it must handle US formats, including the following:
1-234-567-8901 x1234
1-234-567-8901 ext1234
1 (234) 567-8901
I'll answer with my current attempt, but I'm hoping somebody has something better and/or more elegant.

Better option... just strip all non-digit characters on input (except 'x' and leading '+' signs), taking care because of the British tendency to write numbers in the non-standard form +44 (0) ... when asked to use the international prefix (in that specific case, you should discard the (0) entirely).
Then, you end up with values like:
Then when you display, reformat to your hearts content. e.g.
1 (234) 567-8901
1 (234) 567-8901 x1234

If the users want to give you their phone numbers, then trust them to get it right. If they do not want to give it to you then forcing them to enter a valid number will either send them to a competitor's site or make them enter a random string that fits your regex. I might even be tempted to look up the number of a premium rate horoscope hotline and enter that instead.
I would also consider any of the following as valid entries on a web site:
"123 456 7890 until 6pm, then 098 765 4321"
"123 456 7890 or try my mobile on 098 765 4321"
"ex-directory - mind your own business"

It turns out that there's something of a spec for this, at least for North America, called the NANP.
You need to specify exactly what you want. What are legal delimiters? Spaces, dashes, and periods? No delimiter allowed? Can one mix delimiters (e.g., +0.111-222.3333)? How are extensions (e.g., 111-222-3333 x 44444) going to be handled? What about special numbers, like 911? Is the area code going to be optional or required?
Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:

I would also suggest looking at the "libphonenumber" Google Library. I know it is not regex but it does exactly what you want.
For example, it will recognize that:
is a possible number but not a valid number. It also supports countries outside the US.
Highlights of functionality:
Parsing/formatting/validating phone numbers for all countries/regions of the world.
getNumberType - gets the type of the number based on the number itself; able to distinguish Fixed-line, Mobile, Toll-free, Premium Rate, Shared Cost, VoIP and Personal Numbers (whenever feasible).
isNumberMatch - gets a confidence level on whether two numbers could be the same.
getExampleNumber/getExampleNumberByType - provides valid example numbers for all countries/regions, with the option of specifying which type of example phone number is needed.
isPossibleNumber - quickly guessing whether a number is a possible phonenumber by using only the length information, much faster than a full validation.
isValidNumber - full validation of a phone number for a region using length and prefix information.
AsYouTypeFormatter - formats phone numbers on-the-fly when users enter each digit.
findNumbers - finds numbers in text input.
PhoneNumberOfflineGeocoder - provides geographical information related to a phone number.
The biggest problem with phone number validation is it is very culturally dependant.
(408) 974–2042 is a valid US number
(999) 974–2042 is not a valid US number
0404 999 999 is a valid Australian number
(02) 9999 9999 is also a valid Australian number
(09) 9999 9999 is not a valid Australian number
A regular expression is fine for checking the format of a phone number, but it's not really going to be able to check the validity of a phone number.
I would suggest skipping a simple regular expression to test your phone number against, and using a library such as Google's libphonenumber (link to GitHub project).
Introducing libphonenumber!
Using one of your more complex examples, 1-234-567-8901 x1234, you get the following data out of libphonenumber (link to online demo):
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() true
Formatting Results:
E164 format +12345678901
Original format (234) 567-8901 ext. 123
National format (234) 567-8901 ext. 123
International format +1 234-567-8901 ext. 123
Out-of-country format from US 1 (234) 567-8901 ext. 123
Out-of-country format from CH 00 1 234-567-8901 ext. 123
So not only do you learn if the phone number is valid (which it is), but you also get consistent phone number formatting in your locale.
As a bonus, libphonenumber has a number of datasets to check the validity of phone numbers, as well, so checking a number such as +61299999999 (the international version of (02) 9999 9999) returns as a valid number with formatting:
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() true
Formatting Results
E164 format +61299999999
Original format 61 2 9999 9999
National format (02) 9999 9999
International format +61 2 9999 9999
Out-of-country format from US 011 61 2 9999 9999
Out-of-country format from CH 00 61 2 9999 9999
libphonenumber also gives you many additional benefits, such as grabbing the location that the phone number is detected as being, and also getting the time zone information from the phone number:
PhoneNumberOfflineGeocoder Results
Location Australia
PhoneNumberToTimeZonesMapper Results
Time zone(s) [Australia/Sydney]
But the invalid Australian phone number ((09) 9999 9999) returns that it is not a valid phone number.
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() false
Google's version has code for Java and Javascript, but people have also implemented libraries for other languages that use the Google i18n phone number dataset:
PHP: https://github.com/giggsey/libphonenumber-for-php
Python: https://github.com/daviddrysdale/python-phonenumbers
Ruby: https://github.com/sstephenson/global_phone
C#: https://github.com/twcclegg/libphonenumber-csharp
Objective-C: https://github.com/iziz/libPhoneNumber-iOS
JavaScript: https://github.com/ruimarinho/google-libphonenumber
Elixir: https://github.com/socialpaymentsbv/ex_phone_number
Unless you are certain that you are always going to be accepting numbers from one locale, and they are always going to be in one format, I would heavily suggest not writing your own code for this, and using libphonenumber for validating and displaying phone numbers.

/^(?:(?:\(?(?:00|\+)([1-4]\d\d|[1-9]\d+)\)?)[\-\.\ \\\/]?)?((?:\(?\d{1,}\)?[\-\.\ \\\/]?)+)(?:[\-\.\ \\\/]?(?:#|ext\.?|extension|x)[\-\.\ \\\/]?(\d+))?$/i
This matches:
- (+351) 282 43 50 50
- 90191919908
- 555-8909
- 001 6867684
- 001 6867684x1
- 1 (234) 567-8901
- 1-234-567-8901 x1234
- 1-234-567-8901 ext1234
- 1-234 567.89/01 ext.1234
- 1(234)5678901x1234
- (123)8575973
- (0055)(123)8575973
On $n, it saves:
Country indicator
Phone number
You can test it on https://regex101.com/r/kFzb1s/1

Although the answer to strip all whitespace is neat, it doesn't really solve the problem that's posed, which is to find a regex. Take, for instance, my test script that downloads a web page and extracts all phone numbers using the regex. Since you'd need a regex anyway, you might as well have the regex do all the work. I came up with this:
Here's a perl script to test it. When you match, $1 contains the area code, $2 and $3 contain the phone number, and $5 contains the extension. My test script downloads a file from the internet and prints all the phone numbers in it.
my $us_phone_regex =
my #tests =
"1-234-567-8901 x1234",
"1-234-567-8901 ext1234",
"1 (234) 567-8901",
"not a phone number"
foreach my $num (#tests)
if( $num =~ m/$us_phone_regex/ )
print "match [$1-$2-$3]\n" if not defined $4;
print "match [$1-$2-$3 $5]\n" if defined $4;
print "no match [$num]\n";
# Extract all phone numbers from an arbitrary file.
my $external_filename =
my #external_file = `curl $external_filename`;
foreach my $line (#external_file)
if( $line =~ m/$us_phone_regex/ )
print "match $1 $2 $3\n";
You can change \W* to \s*\W?\s* in the regex to tighten it up a bit. I wasn't thinking of the regex in terms of, say, validating user input on a form when I wrote it, but this change makes it possible to use the regex for that purpose.

I answered this question on another SO question before deciding to also include my answer as an answer on this thread, because no one was addressing how to require/not require items, just handing out regexs:
Regex working wrong, matching unexpected things
From my post on that site, I've created a quick guide to assist anyone with making their own regex for their own desired phone number format, which I will caveat (like I did on the other site) that if you are too restrictive, you may not get the desired results, and there is no "one size fits all" solution to accepting all possible phone numbers in the world - only what you decide to accept as your format of choice. Use at your own risk.
Quick cheat sheet
Start the expression: /^
If you want to require a space, use: [\s] or \s
If you want to require parenthesis, use: [(] and [)] . Using \( and \) is ugly and can make things confusing.
If you want anything to be optional, put a ? after it
If you want a hyphen, just type - or [-] . If you do not put it first or last in a series of other characters, though, you may need to escape it: \-
If you want to accept different choices in a slot, put brackets around the options: [-.\s] will require a hyphen, period, or space. A question mark after the last bracket will make all of those optional for that slot.
\d{3} : Requires a 3-digit number: 000-999. Shorthand for
[2-9] : Requires a digit 2-9 for that slot.
(\+|1\s)? : Accept a "plus" or a 1 and a space (pipe character, |, is "or"), and make it optional. The "plus" sign must be escaped.
If you want specific numbers to match a slot, enter them: [246] will require a 2, 4, or 6. (?:77|78) or [77|78] will require 77 or 78.
$/ : End the expression

I wrote simpliest (although i didn't need dot in it).
^([0-9\(\)\/\+ \-]*)$
As mentioned below, it checks only for characters, not its structure/order

Note that stripping () characters does not work for a style of writing UK numbers that is common: +44 (0) 1234 567890 which means dial either the international number:
or in the UK dial 01234567890

If you just want to verify you don't have random garbage in the field (i.e., from form spammers) this regex should do nicely:
Note that it doesn't have any special rules for how many digits, or what numbers are valid in those digits, it just verifies that only digits, parenthesis, dashes, plus, space, pound, asterisk, period, comma, or the letters e, x, t are present.
It should be compatible with international numbers and localization formats. Do you foresee any need to allow square, curly, or angled brackets for some regions? (currently they aren't included).
If you want to maintain per digit rules (such as in US Area Codes and Prefixes (exchange codes) must fall in the range of 200-999) well, good luck to you. Maintaining a complex rule-set which could be outdated at any point in the future by any country in the world does not sound fun.
And while stripping all/most non-numeric characters may work well on the server side (especially if you are planning on passing these values to a dialer), you may not want to thrash the user's input during validation, particularly if you want them to make corrections in another field.

Here's a wonderful pattern that most closely matched the validation that I needed to achieve. I'm not the original author, but I think it's well worth sharing as I found this problem to be very complex and without a concise or widely useful answer.
The following regex will catch widely used number and character combinations in a variety of global phone number formats:
/^\s*(?:\+?(\d{1,3}))?([-. (]*(\d{3})[-. )]*)?((\d{3})[-. ]*(\d{2,4})(?:[-.x ]*(\d+))?)\s*$/gm
+42 555.123.4567
+7 555 1234567
(926) 1234567
926 1234567
495 1234567
469 123 45 67
8 (926) 1234567
202 555 4567
1 416 555 9292
926 3 4
8 800 600-APPLE
Original source: http://www.regexr.com/38pvb

Have you had a look over at RegExLib?
Entering US phone number brought back quite a list of possibilities.

My attempt at an unrestrictive regex:
/^[+#*\(\)\[\]]*([0-9][ ext+-pw#*\(\)\[\]]*){6,45}$/
+(01) 123 (456) 789 ext555
*44 123-456-789 [321]
*****++[](][((( 123456tteexxttppww
mob 07777 777777
1234 567 890 after 5pm
john smith
It is up to you to sanitize it for display. After validating it could be a number though.

I found this to work quite well:
^\(*\+*[1-9]{0,3}\)*-*[1-9]{0,3}[-. /]*\(*[2-9]\d{2}\)*[-. /]*\d{3}[-. /]*\d{4} *e*x*t*\.* *\d{0,4}$
It works for these number formats:
1-234-567-8901 x1234
1-234-567-8901 ext1234
1 (234) 567-8901
1-234-567-8901 ext. 1234
(+351) 282 433 5050
Make sure to use global AND multiline flags to make sure.
Link: http://www.regexr.com/3bp4b

Here's my best try so far. It handles the formats above but I'm sure I'm missing some other possible formats.
^\d?(?:(?:[\+]?(?:[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?:[\d]{3})[\-/)]?(?:[ ]+)?)?(?:[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?:[\d]{1,5}))?$

This is a simple Regular Expression pattern for Philippine Mobile Phone Numbers:
((\+[0-9]{2})|0)[.\- ]?9[0-9]{2}[.\- ]?[0-9]{3}[.\- ]?[0-9]{4}
((\+63)|0)[.\- ]?9[0-9]{2}[.\- ]?[0-9]{3}[.\- ]?[0-9]{4}
will match these:
+63 917 123 4567
The first one will match ANY two digit country code, while the second one will match the Philippine country code exclusively.
Test it here: http://refiddle.com/1ox

If you're talking about form validation, the regexp to validate correct meaning as well as correct data is going to be extremely complex because of varying country and provider standards. It will also be hard to keep up to date.
I interpret the question as looking for a broadly valid pattern, which may not be internally consistent - for example having a valid set of numbers, but not validating that the trunk-line, exchange, etc. to the valid pattern for the country code prefix.
North America is straightforward, and for international I prefer to use an 'idiomatic' pattern which covers the ways in which people specify and remember their numbers:
^((((\(\d{3}\))|(\d{3}-))\d{3}-\d{4})|(\+?\d{2}((-| )\d{1,8}){1,5}))(( x| ext)\d{1,5}){0,1}$
The North American pattern makes sure that if one parenthesis is included both are. The international accounts for an optional initial '+' and country code. After that, you're in the idiom. Valid matches would be:
(xxx)xxx-xxxx x123
12 1234 123 1 x1111
12 12 12 12 12
12 1 1234 123456 x12345
+12 1234 1234
+12 12 12 1234
+12 1234 5678
+12 12345678
This may be biased as my experience is limited to North America, Europe and a small bit of Asia.

My gut feeling is reinforced by the amount of replies to this topic - that there is a virtually infinite number of solutions to this problem, none of which are going to be elegant.
Honestly, I would recommend you don't try to validate phone numbers. Even if you could write a big, hairy validator that would allow all the different legitimate formats, it would end up allowing pretty much anything even remotely resembling a phone number in the first place.
In my opinion, the most elegant solution is to validate a minimum length, nothing more.

You'll have a hard time dealing with international numbers with a single/simple regex, see this post on the difficulties of international (and even north american) phone numbers.
You'll want to parse the first few digits to determine what the country code is, then act differently based on the country.
Beyond that - the list you gave does not include another common US format - leaving off the initial 1. Most cell phones in the US don't require it, and it'll start to baffle the younger generation unless they've dialed internationally.
You've correctly identified that it's a tricky problem...

After reading through these answers, it looks like there wasn't a straightforward regular expression that can parse through a bunch of text and pull out phone numbers in any format (including international with and without the plus sign).
Here's what I used for a client project recently, where we had to convert all phone numbers in any format to tel: links.
So far, it's been working with everything they've thrown at it, but if errors come up, I'll update this answer.
/(\+*\d{1,})*([ |\(])*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})/
PHP function to replace all phone numbers with tel: links (in case anyone is curious):
function phoneToTel($number) {
$return = preg_replace('/(\+*\d{1,})*([ |\(])*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})/', '$1 ($3) $4-$5', $number); // includes international
return $return;

I believe the Number::Phone::US and Regexp::Common (particularly the source of Regexp::Common::URI::RFC2806) Perl modules could help.
The question should probably be specified in a bit more detail to explain the purpose of validating the numbers. For instance, 911 is a valid number in the US, but 911x isn't for any value of x. That's so that the phone company can calculate when you are done dialing. There are several variations on this issue. But your regex doesn't check the area code portion, so that doesn't seem to be a concern.
Like validating email addresses, even if you have a valid result you can't know if it's assigned to someone until you try it.
If you are trying to validate user input, why not normalize the result and be done with it? If the user puts in a number you can't recognize as a valid number, either save it as inputted or strip out undailable characters. The Number::Phone::Normalize Perl module could be a source of inspiration.

Do a replace on formatting characters, then check the remaining for phone validity. In PHP,
$replace = array( ' ', '-', '/', '(', ')', ',', '.' ); //etc; as needed
preg_match( '/1?[0-9]{10}((ext|x)[0-9]{1,4})?/i', str_replace( $replace, '', $phone_num );
Breaking a complex regexp like this can be just as effective, but much more simple.

I work for a market research company and we have to filter these types of input alllll the time. You're complicating it too much. Just strip the non-alphanumeric chars, and see if there's an extension.
For further analysis you can subscribe to one of many providers that will give you access to a database of valid numbers as well as tell you if they're landlines or mobiles, disconnected, etc. It costs money.

I found this to be something interesting. I have not tested it but it looks as if it would work
string validate_telephone_number (string $number, array $formats)
function validate_telephone_number($number, $formats)
$format = trim(ereg_replace("[0-9]", "#", $number));
return (in_array($format, $formats)) ? true : false;
/* Usage Examples */
// List of possible formats: You can add new formats or modify the existing ones
$formats = array('###-###-####', '####-###-###',
'(###) ###-###', '####-####-####',
'##-###-####-####', '####-####', '###-###-###',
'#####-###-###', '##########');
$number = '08008-555-555';
if(validate_telephone_number($number, $formats))
echo $number.' is a valid phone number.';
echo "<br />";
$number = '123-555-555';
if(validate_telephone_number($number, $formats))
echo $number.' is a valid phone number.';
echo "<br />";
$number = '1800-1234-5678';
if(validate_telephone_number($number, $formats))
echo $number.' is a valid phone number.';
echo "<br />";
$number = '(800) 555-123';
if(validate_telephone_number($number, $formats))
echo $number.' is a valid phone number.';
echo "<br />";
$number = '1234567890';
if(validate_telephone_number($number, $formats))
echo $number.' is a valid phone number.';

You would probably be better off using a Masked Input for this. That way users can ONLY enter numbers and you can format however you see fit. I'm not sure if this is for a web application, but if it is there is a very click jQuery plugin that offers some options for doing this.
They even go over how to mask phone number inputs in their tutorial.

Here's one that works well in JavaScript. It's in a string because that's what the Dojo widget was expecting.
It matches a 10 digit North America NANP number with optional extension. Spaces, dashes and periods are accepted delimiters.
"^(\\(?\\d\\d\\d\\)?)( |-|\\.)?\\d\\d\\d( |-|\\.)?\\d{4,4}(( |-|\\.)?[ext\\.]+ ?\\d+)?$"

I was struggling with the same issue, trying to make my application future proof, but these guys got me going in the right direction. I'm not actually checking the number itself to see if it works or not, I'm just trying to make sure that a series of numbers was entered that may or may not have an extension.
Worst case scenario if the user had to pull an unformatted number from the XML file, they would still just type the numbers into the phone's numberpad 012345678x5, no real reason to keep it pretty. That kind of RegEx would come out something like this for me:
\d+ ?\w{0,9} ?\d+
01234467 extension 123456

My inclination is to agree that stripping non-digits and just accepting what's there is best. Maybe to ensure at least a couple digits are present, although that does prohibit something like an alphabetic phone number "ASK-JAKE" for example.
A couple simple perl expressions might be:
#f = /(\d+)/g;
Use the first one to keep the digit groups together, which may give formatting clues. Use the second one to trivially toss all non-digits.
Is it a worry that there may need to be a pause and then more keys entered? Or something like 555-1212 (wait for the beep) 123?

Must end with a digit, can begin with ( or + or a digit, and may contain + - ( or )

For anyone interested in doing something similar with Irish mobile phone numbers, here's a straightforward way of accomplishing it:
$pattern = "/^(083|086|085|086|087)\d{7}$/";
$phone = "087343266";
if (preg_match($pattern,$phone)) echo "Match";
else echo "Not match";
There is also a JQuery solution on that link.
jQuery solution:
//original field values
var field_values = {
//id : value
'url' : 'url',
'yourname' : 'yourname',
'email' : 'email',
'phone' : 'phone'
var url =$("input#url").val();
var yourname =$("input#yourname").val();
var email =$("input#email").val();
var phone =$("input#phone").val();
$('input#url').inputfocus({ value: field_values['url'] });
$('input#yourname').inputfocus({ value: field_values['yourname'] });
$('input#email').inputfocus({ value: field_values['email'] });
$('input#phone').inputfocus({ value: field_values['phone'] });
//reset progress bar
$('#progress_text').html('0% Complete');
$('form').submit(function(){ return false; });
//remove classes
$('#first_step input').removeClass('error').removeClass('valid');
//ckeck if inputs aren't empty
var fields = $('#first_step input[type=text]');
var error = 0;
var value = $(this).val();
if( value.length<12 || value==field_values[$(this).attr('id')] ) {
$(this).effect("shake", { times:3 }, 50);
} else {
if(!error) {
if( $('#password').val() != $('#cpassword').val() ) {
$('#first_step input[type=password]').each(function(){
$(this).effect("shake", { times:3 }, 50);
return false;
} else {
//update progress bar
$('#progress_text').html('33% Complete');
//slide steps
} else return false;
//second section
//remove classes
$('#second_step input').removeClass('error').removeClass('valid');
var emailPattern = /^[a-zA-Z0-9._-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$/;
var fields = $('#second_step input[type=text]');
var error = 0;
var value = $(this).val();
if( value.length<1 || value==field_values[$(this).attr('id')] || ( $(this).attr('id')=='email' && !emailPattern.test(value) ) ) {
$(this).effect("shake", { times:3 }, 50);
} else {
function validatePhone(phone) {
var a = document.getElementById(phone).value;
var filter = /^[0-9-+]+$/;
if (filter.test(a)) {
return true;
else {
return false;
$('#phone').blur(function(e) {
if (validatePhone('txtPhone')) {
$('#spnPhoneStatus').css('color', 'green');
else {
$('#spnPhoneStatus').css('color', 'red');
if(!error) {
//update progress bar
$('#progress_text').html('66% Complete');
//slide steps
} else return false;
//update progress bar
$('#progress_text').html('100% Complete');
//prepare the fourth step
var fields = new Array(
var tr = $('#fourth_step tr');
//alert( fields[$(this).index()] )
//slide steps
url =$("input#url").val();
yourname =$("input#yourname").val();
email =$("input#email").val();
phone =$("input#phone").val();
//send information to server
var dataString = 'url='+ url + '&yourname=' + yourname + '&email=' + email + '&phone=' + phone;
alert (dataString);//return false;
type: "POST",
url: "http://clients.socialnetworkingsolutions.com/infobox/contact/",
data: "url="+url+"&yourname="+yourname+"&email="+email+'&phone=' + phone,
cache: false,
success: function(data) {
console.log("form submitted");
return false;
//back button
var container = $(this).parent('div'),
previous = container.prev();
switch(previous.attr('id')) {
case 'first_step' : $('#progress_text').html('0% Complete');
case 'second_step': $('#progress_text').html('33% Complete');
case 'third_step' : $('#progress_text').html('66% Complete');
default: break;


Regex - How to replace each of those digits with one character without erasing any character just around them?

I have need to scrub out numeric data from some xml-like strings that are being logged to a minimally secure logging tool. The data logged may contain some personal information like Phone or Date of Birth, or may contain some financial information like income or other.
The data is in xml format, but is not guaranteed to be well-formed. But the data I'm scrubbing would be the xml data that would be between the tags.
For Instance:
<DOB>January 4 1988</DOB>
What we would like to do, is scrub all the numeric values from some data. So our result sets would look like:
<DOB>January N NNNN</DOB>
This would help in identifying issues with the calls made with the data, so we could see (for instance) that a phone number contained 9 numbers instead of 10, or DOB year only had 3 digits.
So far, I have this
Regex.Replace(xmlIn, #"(?<=<DOB>)\d+(?=</DOB>)", "N"
But this only works with numeric-only entries, like all 7 digits of phone number crammed together.
Try this regex:
Sample code
Regex regex = new Regex(
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled
string regexReplace = "N";
string result = regex.Replace(InputText,regexReplace);
Sample Input
Given Examples
<DOB>January 4 1988</DOB>
Additional tests
<!-- This is a valid number: 99 -->
<DOB>14523 <-- Closing DOB tag is missing here...
Sample output
Given Examples
<DOB>January 4 NNNN</DOB>
Additional tests
<!-- This is a valid number: 99 -->
<DOB>14523 <-- Closing DOB tag is missing here...

Is there a regex to test if a string is for a locale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
function is_locale($locale=''){
return (array_search($locale,$locales)!==F);
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
catch (ArgumentException)
// Invalid culture.
You could try something like this:
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
On a similar note, here is a useful list of two letter country codes.
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
I used this regex and it works for locale only having optional '_'
For example:
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)

C# - Finding which prefix applies

just looking for a bit of help in terms of the best way to proceed with the following problem:
I have a list of a bunch of dialled numbers, don't think you need me to show you but for e.g.
006789 1234
006656 1234
006676 1234
006999 1234
007000 1234
006999 6789
Now: I also have a list of prefixes (prefix being the bit dialled first, also tells you where the call is going(important bit)). Important also - they have leading 0's and, they are of differing length.
say for e.g.
006789 = australia
006789 = russia
006656 = france
006676 = austria
0069 = brazil
00700 = china
So what i am trying to do is write C# algorithm to find which prefix to apply.
The logic works as follows, say we have one dialled number and these prefixes
dialled number:0099876 5555 6565,
prefix1: 0099876 = Lyon (France)
prefix2: 0099 = France
Now both prefixes apply, except "the more detailed one" always wins. i.e. this call is to Lyon (France) and 0099876 should be result even though 0099 also applies.
Any help on getting me started with this algorithm would be great, because looking at it, im not sure if I should be comparing strings or ints! I have .Contains with strings, but as portrayed in my examples, that doesn't exactly work if the prefix is later in the number
6999 6978
6978 1234
Looks like a good match for a trie to me. given your prefixes are guaranteed to be short this should be nice and quick to search. you could also find all matching prefixes at the same time. the longest prefix would be the last to match in the trie and would be O(m) to find (worst case) where m is the length of the prefix.
I guess you could sort your prefixes by length (longest first).
Then when you need to process a number, you can run through the prefixes in order, and stop when yourNumber.startsWith(prefix) is true.
Find longest. Use LINQ:
prefixes.Where(p => number.StartsWith(p)).OrderByDescending(p => p.Length).FirstOrDefault();
If you already know what prefixes you're looking for, you're better off using a HashMap (I believe it's a Dictionary in C#) to store the prefix and the country it corresponds to. Then, for every number that comes in, you can do a standard search on all the prefixes in the list. Store the ones that match in a list, and then pick the longest match.
Another approach would be to shorten the dialed number by one from the right and test if this number is within the list:
Dictionary<string, string> numbers = new Dictionary<string, string>();
//Build up the possible numbers from somewhere
numbers.Add("006789", "australia");
numbers.Add("006790", "russia");
numbers.Add("006656", "france");
numbers.Add("006676", "austria");
numbers.Add("0069", "brazil");
numbers.Add("00700", "china");
numbers.Add("0099876", "Lyon (France)");
numbers.Add("0099", "France");
//Get the dialed number from somewhere
string dialedNumber = "0099 876 1234 56";
//Remove all whitespaces (maybe minus signs, plus sign against double zero, remove brackets, etc)
string normalizedNumber = dialedNumber.Replace(" ", "");
string searchForNumber = normalizedNumber;
while (searchForNumber.Length > 0)
Console.WriteLine("The number '{0}' is calling from {1}", dialedNumber, numbers[searchForNumber]);
searchForNumber = searchForNumber.Remove(searchForNumber.Length - 1);
Console.WriteLine("The number '{0}' doesn't contain any valid prefix", dialedNumber);

Regular expression for parsing mailing addresses

I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.
Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.
I've included a subset here:
HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.
private void Parse(string line1)
HouseNumber = string.Empty;
Quadrant = string.Empty;
StreetName = string.Empty;
StreetType = string.Empty;
if (!String.IsNullOrEmpty(line1))
string noPeriodsLine1 = String.Copy(line1);
noPeriodsLine1 = noPeriodsLine1.Replace(".", "");
string addressParseRegEx =
(?:(?:\s+|-)(?<quadrant>" +
(?:(?:\s+|-)(?<quadrant>" +
(?:(?:\s+|-)(?<streettype>" + STREETTYPES +
(?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
(?:(?:\s+|-)(?<streettypequadrant>(" +
QUADRANTS + #")))??
Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
if (match.Success)
HouseNumber = match.Groups["housenumber"].Value;
Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
if (match.Groups["streetname"].Captures.Count > 1)
foreach (Capture capture in match.Groups["streetname"].Captures)
StreetName += capture.Value + " ";
StreetName = StreetName.Trim();
StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
StreetType = match.Groups["streettype"].Value;
//if the matched street type is found
//use the abbreviated version...especially for credit bureau calls
string streetTypeAbbreviation;
if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
StreetType = streetTypeAbbreviation;
Have fun with addresses and regexs, you're in for a long, horrible ride.
You're trying to lay order upon chaos.
For every "123 Simple Way", there's a "14 1/2 South".
Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".
Have fun with that.
There are more exceptions than rules when it comes to street adresses.
I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.
I think you should clarify your usage scenario.
Unless you're in a very, very limited scenario where you know that the addresses were entered following a strict schema, parsing addresses for content is an extremely hard problem to solve and, usually, quite futile (unless it's the raison d'être of your application).
If you're limited to a particular country that has very specific conventions for writing addresses, then using these regex might get you 90% of the way.
However, as soon as you have to start accepting foreign addresses, you're screwed.
Even if you're a US-centric site, there is a good chance that you may have to be able to accept addresses from US citizen living abroad for instance.
Again, it may be OK in a very narrow field, but it's almost always a bad idea to validate or split addresses that were not strictly validated and constrained at the time the user entered them.
When you do enforce some strict rules for users to enter their addresses, these end-up being inadequate in a small portion of cases, even in the best address validation components out there.
Just a few things that mess up address parsing:
postal codes (Zip codes) are sometimes placed before, after, or may even not exist at all.
postal codes follow strict rules: a 10-digit Zip code is probably easy to spot as invalid, but what about a non-existent one? What about more codes such as those used in the UK for instance?
What about a place like Hong Kong where you could write the address in either English, Traditional Chinese or Mandarin?
What if it's perfectly fine to split your address and write it out of sequence?
even if you're just parsing US addresses, there are at least a handfull of ways to describe a PO box: you can also use poste restante, general delivery and then need to add a 4-digit code to the Zip code, which would normally probably not be present at all...
Bottom line is
If getting addresses in a parseable format is really important, be 100% sure that you can get all possible combinations right or you're going to have a percentage of failures that will mean frustrated users and loss sales.
If you don't have 100% case coverage then don't enforce strict rules on the user.
I can't count the number of websites I gave up purchasing from because they would require a Zip/Postal Code when the place I live in has none.
Sorry for the rant, but I think it's important that people wanting to do address validation and parsing think hard about what they're getting themselves in.
This actually works pretty well except that it doesn't pull apartment numbers. We're working on that. It also coughed a little when we had an address of 769 Branch Ave. Of course "branch" is one of the street types that its looking for. It all goes back that making order out of chaos thing. We know that its going to break here and there.
If someone runs into this problem in 2013/2014 :)
You can use google geocode API. it provides more functionality than just regex - you can even get lat/long for address. And its free
For an address example-
I tried to get this to work, but it seems as though you have a static member of a StreetTypes class that is not included. It seems to work except for that, but I can not do much testing without it.
I'll agree that your strictness is going to be a problem. I'm writing an address parser designed to strip addresses from classified ads where the format could be just about anything. For instance, for your quadrant matches, you're ignoring punctuation altogether. I have to search data that could represent NE in all these different ways:
"NE", "N.E", "N E", "N.E.", "N. E", "North East", "Northeast"
so I am using the following pattern match which should catch all direction qualifiers no matter how they are expressed:
\b(?:(?:[nesw]\.? ?){0,2}|(?:north|no\.|east|south|so\.|west){0,2})\b
Of course, context is also important since "no" is going to be matched by this. But "NE" for Nebraska would be matched by either, so you really have to be careful about what's to the left and right in your larger expression. I'm having to compile lists of words that commonly appear interspersed in address texts which are not address components, such as "near, x-street, in, across", etc.
It is a very tough problem, and I agree Salt Lake City is a bitch. In addition to having the double direction/coordinate format, they also compound it by referring to stuff like "3700 North 5300 East Arborville Way" where the streets can be referenced by name, number, or both.

Phone Number Formatting, OnBlur

I have a .NET WinForms textbox for a phone number field. After allowing free-form text, I'd like to format the text as a "more readable" phone number after the user leaves the textbox. (Outlook has this feature for phone fields when you create/edit a contact)
1234567 becomes 123-4567
1234567890 becomes (123) 456-7890
(123)456.7890 becomes (123) 456-7890
123.4567x123 becomes 123-4567 x123
A fairly simple-minded approach would be to use a regular expression. Depending on which type of phone numbers you're accepting, you could write a regular expression that looks for the digits (for US-only, you know there can be 7 or 10 total - maybe with a leading '1') and potential separators between them (period, dash, parens, spaces, etc.).
Once you run the match against the regex, you'll need to write the logic to determine what you actually got and format it from there.
EDIT: Just wanted to add a very basic example (by no means is this going to work for all of the examples you posted above). Geoff's suggestion of stripping non-numeric characters might help out a bit depending on how you write your regex.
Regex regex = new Regex(#"(?<areaCode>([\d]{3}))?[\s.-]?(?<leadingThree>([\d]{3}))[\s.-]?(?<lastFour>([\d]{4}))[x]?(?<extension>[\d]{1,})?");
string phoneNumber = "701 123-4567x324";
Match phoneNumberMatch = regex.Match(phoneNumber);
if (phoneNumberMatch.Groups["areaCode"].Success)
if (phoneNumberMatch.Groups["leadingThree"].Success)
if (phoneNumberMatch.Groups["lastFour"].Success)
if (phoneNumberMatch.Groups["extension"].Success)
I think the easiest thing to do is to first strip any non-numeric characters from the string so that you just have a number then format as mentioned in this question
I thought about stripping any non-numeric characters and then formatting, but I don't think that works so well for the extension case (123.4567x123)
Lop off the extension then strip the non-numeric character from the remainder. Format it then add the extension back on.
Start: 123.4567x123
Lop: 123.4567
Strip: 1234567
Format: 123-4567
Add: 123-4567 x123
I don't know of any way other than doing it yourself by possibly making some masks and checking which one it matches and doing each mask on a case by case basis. Don't think it'd be too hard, just time consuming.
My guess is that you could accomplish this with a conditional statement to look at the input and then parse it into a specific format. But I'm guessing there is going to be a good amount of logic to investigate the input and format the output.
This works for me. Worth checking performance if you are doing this in a tight loop...
public static string FormatPhoneNumber(string phone)
phone = Regex.Replace(phone, #"[^\d]", "");
if (phone.Length == 10)
return Regex.Replace(phone,
"(${ac}) ${pref}-${num}");
else if ((phone.Length < 16) && (phone.Length > 10))
return Regex.Replace(phone,
"(${ac}) ${pref}-${num} x${ext}");
return string.Empty;

