I have the following to check if the phone number is in the following format
(XXX) XXX-XXXX. The below code always return true. Not sure why.
Match match = Regex.Match(input, #"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}");
// Below code always return true
if (match.Success) { ....}
The general complaint about regex patterns for phone numbers is that they require one to put in the truly optional characters as dashes and other items.
Why can't they be optional and have the pattern not care if they are there or not?
The below pattern makes dashes, periods and parenthesis optional for the user and focuses on the numbers as a result using named captures.
The pattern is commented (using the # and spans multiple lines) so use the Regex option IgnorePatternWhitespace unless one removes the comments. For that flag doesn't affect regex processing, it only allows for commenting of the pattern via the # character and line break .
string pattern = #"
^ # From Beginning of line
(?:\(?) # Match but don't capture optional (
(?<AreaCode>\d{3}) # 3 digit area code
(?:[\).\s]?) # Optional ) or . or space
(?<Prefix>\d{3}) # Prefix
(?:[-\.\s]?) # optional - or . or space
(?<Suffix>\d{4}) # Suffix
(?!\d) # Fail if eleventh number found";
The above pattern just looks for 10 numbers and ignores any filler characters such as a ( or a dash - or a space or a tab or even a .. Examples are
(555)555-5555 (OK)
5555555555 (ok)
555 555 5555(ok)
555.555.5555 (ok)
55555555556 (not ok - match failure - too many digits)
123.456.789 (failure)
Different Variants of same pattern
Pattern without comments no longer need to use IgnorePatternWhiteSpace:
^(?:\(?)(?<AreaCode>\d{3})(?:[\).\s]?)(?<Prefix>\d{3})(?:[-\.\s]?)(?<Suffix>\d{4})(?!\d)
Pattern when not using Named Captures
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d)
Pattern if ExplicitCapture option is used
^\(?(?<AreaCode>\d{3})[\).\s]?(?<Prefix>\d{3})[-\.\s](?<Suffix>\d{4})(?!\d)
It doesn't always match, but it will match any string that contains three digits, followed by a hyphen, followed by four more digits. It will also match if there's something that looks like an area code on the front of that. So this is valid according to your regex:
%%%%%%%%%%%%%%(999)123-4567%%%%%%%%%%%%%%%%%
To validate that the string contains a phone number and nothing else, you need to add anchors at the beginning and end of the regex:
#"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$"
Alan Moore did a good explaining what your exp is actually doing. +1
If you want to match exactly "(XXX) XXX-XXXX" and absolutely nothing else, then what you want is
#"^\(\d{3}\) \d{3}-\d{4}$"
Here is the C# code I use. It is designed to get all phone numbers from a page of text. It works for the following patters: 0123456789, 012-345-6789, (012)-345-6789, (012)3456789 012 3456789, 012 345 6789, 012 345-6789, (012) 345-6789, 012.345.6789
List<string> phoneList = new List<string>();
Regex rg = new Regex(#"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})");
MatchCollection m = rg.Matches(html);
foreach (Match g in m)
{
if (g.Groups[0].Value.Length > 0)
phoneList.Add(g.Groups[0].Value);
}
none of the comments above takes care of international numbers like +33 6 87 17 00 11 (which is a valid phone number for France for example).
I would do it in a two-step approach:
1. Remove all characters that are not numbers or '+' character
2. Check the + sign is at the beginning or not there. Check length (this can be very hard as it depends on local country number schemes).
Now if your number starts with +1 or you are sure the user is in USA, then you can apply the comments above.
Related
I am sure that has been asked before, but I cannot find the appropriate question(s).
Being new to C#'s Regex, I want to mimic what is possible e.g. with sed and awk where I would write s/_(20[0-9]{2})[.0-9]{1}/\1/g in order to find obtain a 4-digit year number after 2000 which is has an underscore as prefix and a number or a dot afterwards. The \1 refers to the value within brackets.
Example: Both files fx_201902.csv or fx_2019.csv should give me back myYear=2019. I was not successful with:
string myYear = Regex.Replace(Path.GetFileName(x), #"_20([0-9]{2})[.0-9]{1}", "\1")
How do I have to escape? Or is this kind of replacement not possible? If so, how would I do that?
Edit: My issue how to do the /1 in C#, in other words how to extract a regex-variable. Please forgive me my typos in the original post - I am trying the new SO app and I submitted earlier than intended.
I'd suggest more robust regex: _(20(?:0[1-9]|[1-9][0-9]))[\d.]
Explanation:
_ - match _ literally
(...) - first capturing group
20 - match 20 literally
(?:...) - non-capturing group
0[1-9]|[1-9][0-9] - alternation: match 0 and digit other than 0 OR match digit other then zero followed by any digits - this allows you to match ANY year after 2000
[\d.] - match dot or digit
And below is how you use capturing groups:
var regex = new Regex(#"_(20(?:0[1-9]|[1-9][0-9]))[\d.]");
regex.Match("fx_201902.csv").Groups[1].Value;
// "2019"
regex.Match("fx_20190.csv").Groups[1].Value;
// "2019"
regex.Match("fx_2019.csv").Groups[1].Value;
// "2019"
To extract the year using Regex.Replace, you need to capture only the year part of the string into a group and replace the entire string with just the capture group. That means you need to also match the characters before and after the year using (for example)
^.*_(20[0-9]{2})[.0-9].*$
That can then be replaced with $1 e.g.
Regex r = new Regex(#"^.*_(20[0-9]{2})[.0-9].*$");
string filename = "fx_201902.csv";
string myYear = r.Replace(filename, "$1");
Console.WriteLine(myYear);
filename = "fx_2019.csv";
myYear = r.Replace(filename, "$1");
Console.WriteLine(myYear);
Output:
2019
2019
If you want to exclude the year 2000 from your match, change the regex to
^.*_(20(?:0[1-9]|[1-9][0-9]))[.0-9].*$
You might use a capturing group for the first 4 digits and match what is before and after the 4 digits.
.*_(20[0-9]{2})[0-9]*\.\w+$
Explanation
.*_ Match the last underscore
(20[0-9]{2}) Match 20 and 2 digits
[0-9]*\. Match 0 or more occurrences of a digit followed by a dot
\w+$ Match 1 or or more word chars till the end of the string.
Regex demo | C# demo
In the replacement use:
$1
For example
string[] strings = {"fx_2019.csv", "fx_201902.csv"};
foreach (string s in strings)
{
string myYear = Regex.Replace(s, #".*_(20[0-9]{2})[0-9]*\.\w+$", "$1");
Console.WriteLine(myYear);
}
Output
2019
2019
Your second example does not contains the month's digits. If you still want to capture, make it optional:
Regex.Replace(Path.GetFileName(x), #"_20([1-9]{2})([.0-9]{2})?", "\1")
Note that I only added 3 characters to your query: (, ) and ?
If you want the returning value to be as expected: change the replacement to $1 from \1 as documented (with the correct parenthesis) and capture 2020, 2030, etc (still excluding 2000) with the usage of or operator and the combination of [0-9]{1} and [1-9]{1}:
Regex.Replace(Path.GetFileName(x), #"_(20(([1-9]{1})([0-9]{1})||([0-9]{1})([1-9]{1})))([.0-9]{2})?", "$1")
It worths mentioning that $3 and $4 matches the last and the 2nd last digit; and $2 matches with the last 2 digits (aka the combination of [0-9]{1} [1-9]{1} || [1-9]{1} [0-9]{1}).
How to match 4 char then jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) after 4 char again jump one char(which is unknown for me, so whatever may be such as some other chinese or special character occurance) again 4 etc.,
My check string : 1234 4567 7891 0934
this is 16digit char, each 4char separated by space.
Main string:
"ACCOUNT NUMBER NAME STATEMENT DATE PAYMENT DUE DATE 1234 4567 7891 0934 Jane Doe 01/01/2009 02/26/09 CREDIT LIMIT CREDIT AVAILABLE NEW BALANCE MINIMUM PAYMENT DUE ."
above text(Main string) comes from PDF document. which was extracted by OCR Engine.
since Main string has my check string, but it's separated by some unknown char instead of space. I tried replace with # to space in Visual studio's immediate window. but that space of in-between Main string's check string was not replaced. thus, I could able to say It is Non-ascii character, but seems like a space.
I could be able get rid from this issue by below code:
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
but,I would like to know Regex solution.
Even though non-ascii char occured, should be match with regex to check whether exists or not.
If you aren't sure whether the character between those 4 digits is a space or not, you can use a . character which matches any character and use this regex to match those group of 4 digits separated by a seemingly unknown character.
\d{4}.\d{4}.\d{4}.\d{4}
If you want to access those group of 4 digits, then you can put them in group and access them using all four grouping pattern from this regex,
(\d{4}).(\d{4}).(\d{4}).(\d{4})
Check this demo
Let me know if any of your query remains unresolved.
First time posting, please forgive the formatting. Not really a programmer, I work in C# with the Revit and AutoCAD APi's. Important to note, as the Revit API is a bit of mess, so the same code may produce different results in a different API. So I have three basic string patterns where I want to return certain numbers depending on what their prefix & suffix. They could be surrounded by other text than what I show, and the actual numbers and positions within the string may vary.
String 1: (12) #4x2'-0 # 6 EF
String 2: (12) #4 # 6 EF
String 3: STAGGER 2'-0, SPCG AT 6 AT 12 SLAB
The code I'm using:
if (LengthAsString.IsMatch(remarkdata) == true)
{
Regex remarklength = new Regex(#"isnertRegexPatternhere");
if (remarklength.IsMatch(remarkdata))
{
remarkdata = remarklength.Replace(remarkdata, "${0}\u0022");
}
}
remarkdata is the strings from above, and im adding inch marks " after each number match.
The patterns ive tested and their returns:
String 1 String 2 String 3
\d+(?!['-]|([(\d+)])) 0,6 4,6 0,6,12
(?<![#])\d+ 12,2,0,6 12,6 2,9,6,12
\d+(?= #)|(?<=# )\d+ 0,6 6 no matches
expected results: 0,6 6 0,6,12
so im close, but no cigar. Thoughts?
Double Edit: looking for the numbers that aren't preceded by #, nor between (). Ignore # and x, they may or may not be there.
You seem to be looking for
(?<!#)\d+(?!.*(?:['-]|[#x]\d))
See the regex demo
Details
(?<!#) - a negative lookbehind that fails the match if there is a # immediately to the left of the current location
\d+ - 1 or more digits (or [0-9]+ to only match ASCII digits)
(?!.*(?:['-]|[#x]\d)) - a negative lookahead that fails the match once there are any 0+ chars other than newline (.*) followed with ', -, or #/x followed with a digit immediately to the right of the current location.
Note that in case your strings always have balanced non-nested parentheses, and you may have (123) substrings after # or x1, you may also want to add [^()]*\) into the lookahead
(?<!#)\d+(?!.*(?:['-]|[#x]\d)|[^()]*\))
to avoid matching digits inside the parentheses.
See another .NET demo.
I need a regular expression to find groups of exactly 8 numbers in a row. The closest I have gotten is:
[0-9]{8}
but it's not exactly what I need. If I had a number that was 9 long it will match the first 8 but I want it to ignore it if it's longer or shorter than 8.
Here are some examples
1234567890 <- no match, it's longer than 8
12345678 <- match: "12345678"
1234567809876543 <- match 1: "12345678", match 2: "09876543" (two groups of 8)
,,111-11-1234,12345678, <- match: "12345678"
To summarize, for every group of exactly 8 numbers make a match.
I'm working with some results of OCR (Optical Character Recognition) and I have to work with the shortcomings of the results so my input can be varied as in the examples above.
Here is some use case data: http://pastebin.com/uijF9K9n
You can use the following regex in .NET:
(?<=^|\D|(?:\d{8})+)\d{8}(?=$|\D|(?:\d{8})+)
See regex demo
It is based on variable-width lookbehind and a lookahead.
Regex breakdown:
(?<=^|\D|(?:\d{8})+) - only if at the string start (^), or preceded with not a digit (\D) or 1 or more sequences of 8 digits ((?:\d{8})+)...
\d{8} - match 8 digits that are followed by...
(?=$|\D|(?:\d{8})+) - either end of string ($) or not a digit (\D) or 1 or more sequences of 8 digits ((?:\d{8})+).
IMPORTANT:
If I got a downvote for the "extra" complexity compared with another answer, note our solutions are different: my regex matches 8-digit number in ID12345678, and the other one does not due to the word boundary.
You can also try this regex
(?:\b|\G)\d{8}(?=(?:\d{8})*\b)
(?:\b|\G) \b match a word boundary | or \G continue where last match attempt ended
\d{8} matches 8 digits [0-9] followed by a lookahead (?=... to check
(?:\d{8})*\b if followed by any amount of {8 digits} until another word boundary
It will match {8 digits} or out of a sequence of such if between two word boundaries.
See demo at regexstorm
\b[0-9]{8}\b this will give you what you want
For more details check this out
http://www.rexegg.com/regex-boundaries.html
im looking for a regex pattern, which matches a number with a length of exactly x (say x is 2-4) and nothing else.
Examples:
"foo.bar 123 456789", "foo.bar 456789 123", " 123", "foo.bar123 " has to match only "123"
So. Only digits, no spaces, letters or other stuff.
How do I have to do it?
EDIT: I want to use the Regex.Matches() function in c# to extract this 2-4 digit number and use it in additional code.
Any pattern followed by a {m,n} allows the pattern to occur m to n times. So in your case \d{m,n} for required values of m and n. If it has to be exactly an integer, use\d{m}
If you want to match 123 in x123y and not in 1234, use \d{3}(?=\D|$)(?<=(\D|^)\d{3})
It has a look ahead to ensure the character following the 3 digits is a non-digitornothing at all and a look behind to ensure that the character before the 3 digits is a non-digit or nothing at all.
You can achieve this with basic RegEx:
\b(\d\d\d)\b or \b(\d{3})\b - for matching a number with exactly 3 digits
If you want variable digits: \b(\d{2,4})\b (explained demo here)
If you want to capture matches next to words: \D(\d{2,4})\D (explained demo here)
\b is a word boundary (does not match anything, it's a zero-match character)
\d matches only digits
\D matches any character that is NOT a digit
() everything in round brackets will capture a match