Regex replace with bracket variable in C# - c#

I am sure that has been asked before, but I cannot find the appropriate question(s).
Being new to C#'s Regex, I want to mimic what is possible e.g. with sed and awk where I would write s/_(20[0-9]{2})[.0-9]{1}/\1/g in order to find obtain a 4-digit year number after 2000 which is has an underscore as prefix and a number or a dot afterwards. The \1 refers to the value within brackets.
Example: Both files fx_201902.csv or fx_2019.csv should give me back myYear=2019. I was not successful with:
string myYear = Regex.Replace(Path.GetFileName(x), #"_20([0-9]{2})[.0-9]{1}", "\1")
How do I have to escape? Or is this kind of replacement not possible? If so, how would I do that?
Edit: My issue how to do the /1 in C#, in other words how to extract a regex-variable. Please forgive me my typos in the original post - I am trying the new SO app and I submitted earlier than intended.

I'd suggest more robust regex: _(20(?:0[1-9]|[1-9][0-9]))[\d.]
Explanation:
_ - match _ literally
(...) - first capturing group
20 - match 20 literally
(?:...) - non-capturing group
0[1-9]|[1-9][0-9] - alternation: match 0 and digit other than 0 OR match digit other then zero followed by any digits - this allows you to match ANY year after 2000
[\d.] - match dot or digit
And below is how you use capturing groups:
var regex = new Regex(#"_(20(?:0[1-9]|[1-9][0-9]))[\d.]");
regex.Match("fx_201902.csv").Groups[1].Value;
// "2019"
regex.Match("fx_20190.csv").Groups[1].Value;
// "2019"
regex.Match("fx_2019.csv").Groups[1].Value;
// "2019"

To extract the year using Regex.Replace, you need to capture only the year part of the string into a group and replace the entire string with just the capture group. That means you need to also match the characters before and after the year using (for example)
^.*_(20[0-9]{2})[.0-9].*$
That can then be replaced with $1 e.g.
Regex r = new Regex(#"^.*_(20[0-9]{2})[.0-9].*$");
string filename = "fx_201902.csv";
string myYear = r.Replace(filename, "$1");
Console.WriteLine(myYear);
filename = "fx_2019.csv";
myYear = r.Replace(filename, "$1");
Console.WriteLine(myYear);
Output:
2019
2019
If you want to exclude the year 2000 from your match, change the regex to
^.*_(20(?:0[1-9]|[1-9][0-9]))[.0-9].*$

You might use a capturing group for the first 4 digits and match what is before and after the 4 digits.
.*_(20[0-9]{2})[0-9]*\.\w+$
Explanation
.*_ Match the last underscore
(20[0-9]{2}) Match 20 and 2 digits
[0-9]*\. Match 0 or more occurrences of a digit followed by a dot
\w+$ Match 1 or or more word chars till the end of the string.
Regex demo | C# demo
In the replacement use:
$1
For example
string[] strings = {"fx_2019.csv", "fx_201902.csv"};
foreach (string s in strings)
{
string myYear = Regex.Replace(s, #".*_(20[0-9]{2})[0-9]*\.\w+$", "$1");
Console.WriteLine(myYear);
}
Output
2019
2019

Your second example does not contains the month's digits. If you still want to capture, make it optional:
Regex.Replace(Path.GetFileName(x), #"_20([1-9]{2})([.0-9]{2})?", "\1")
Note that I only added 3 characters to your query: (, ) and ?
If you want the returning value to be as expected: change the replacement to $1 from \1 as documented (with the correct parenthesis) and capture 2020, 2030, etc (still excluding 2000) with the usage of or operator and the combination of [0-9]{1} and [1-9]{1}:
Regex.Replace(Path.GetFileName(x), #"_(20(([1-9]{1})([0-9]{1})||([0-9]{1})([1-9]{1})))([.0-9]{2})?", "$1")
It worths mentioning that $3 and $4 matches the last and the 2nd last digit; and $2 matches with the last 2 digits (aka the combination of [0-9]{1} [1-9]{1} || [1-9]{1} [0-9]{1}).

Related

Get string after the last comma or the last number using Regex in C#

How can I get the string after the last comma or last number using regex for this examples:
"Flat 1, Asker Horse Sports", -- get string after "," result: "Asker
Horse Sports"
"9 Walkers Barn" -- get string after "9" result:
Walkers Barn
I need that regex to support both cases or to different regex rules, each / case.
I tried /,[^,]*$/ and (.*),[^,]*$ to get the strings after the last comma but no luck.
You can use
[^,\d\s][^,\d]*$
See the regex demo (and a .NET regex demo).
Details
[^,\d\s] - any char but a comma, digit and whitespace
[^,\d]* - any char but a comma and digit
$ - end of string.
In C#, you may also tell the regex engine to search for the match from the end of the string with the RegexOptions.RightToLeft option (to make regex matching more efficient. although it might not be necessary in this case if the input strings are short):
var output = Regex.Match(text, #"[^,\d\s][^,\d]*$", RegexOptions.RightToLeft)?.Value;
You were on the right track the capture group in (.*),[^,]*$, but the group should be the part that you are looking for.
If there has to be a comma or digit present, you could match until the last occurrence of either of them, and capture what follows in the capturing group.
^.*[\d,]\s*(.+)$
^ Start of string
.* Match any char except a newline 0+ times
[\d,] Match either , or a digit
\s* Match 0+ whitespace chars
(.+) Capture group 1, match any char except a newline 1+ times
$ End of string
.NET regex demo | C# demo

Regex to take first set after Space and want to remove $ with same regex

My input string:-
" $440,765.12 12-108(e)\n3 "
Output String i want as:-
"440,765.12"
I have tried with below regex and it's working but I am not able to remove $ with the same regex so anyone knows how to do the same task with below regex?
Regex rx = new Regex(#"^(.*? .*?) ");
var match = rx.Match(" $440,765.12 12-108(e)\n3 ");
var text = match.Groups[1].Value;
output after using above regex:-
$440,765.12
I know I can do the same task using string.replace function but I want to do the same with regex only.
You may use
var result = Regex.Match(s, #"\$(\d[\d,.]*)")?.Groups[1].Value;
See the regex demo:
Details
\$ - matches a $ char
(\d[\d,.]*) - captures into Group 1 ($1) a digit and then any 0 or more digits, , or . chars.
If you want a more "precise" pattern (just in case the match may appear within some irrelevant dots or commas), you may use
\$(\d{1,3}(?:,\d{3})*(?:\.\d+)?)
See this regex demo. Here, \d{1,3}(?:,\d{3})*(?:\.\d+)? matches 1, 2 or 3 digits followed with 0 or more repetitions of , and 3 digits, followed with an optional sequence of a . char and 1 or more digits.
Also, if there can be any currency symol other than $ replace \$ with \p{Sc} Unicode category that matches any currency symbol:
\p{Sc}(\d{1,3}(?:,\d{3})*(?:\.\d+)?)
See yet another regex demo.

How to extract numbers from a string using regular expressions?

This little challenge just screams regular expressions to me, but so far I am stumped.
I have an arbitrary string that contains two numbers embedded in it. I need to extract those two numbers, which will be n and m digits long (n,m are unknown in advance). The format of the string is always
FixedWord[n digits]anotherfixedword[m digits]alotmorestuffontheend
The first number is of the format 1.2.3.4 (the number of digits varying) eg 5.3.20 or 5.3.10.1 or 5.4.
and the second is a simpler 'm' digits (eg 25 or 2)
eg "AppName5.2.6dbVer44Oracle.Group"
It shouts 'pattern matching' and hence "extraction using regexes". Can anyone guide me further?
TIA
The following pattern:
(\d+(?>\.\d+)*)\w+?(\d+)
Will match this:
AppName5.2.6dbVer44Oracle.Group
\__________/ <-- match
\___/ \/ <-- captures
Demo
And will capture the two values you're interested in in capture groups.
Use it like this:
var match = Regex.Match(input, #"(\d+(?>\.\d+)*)\w+?(\d+)");
if (match.Success)
{
var first = match.Groups[1].Value;
var second = match.Groups[2].Value;
// ...
}
Pattern explanation:
( # Start of group 1
\d+ # a series of digits
(?> # start of atomic group
\.\d+ # dot followed by digits
)* # .. 0 to n times
)
\w+? # some word characters (as few as possible)
(\d+) # a series of digits captured in group 2
Try this:
\w*?([\d|\.]+)\w*?([\d{1,4}]+).*
You could start from the following:
^[a-zA-Z]+((?:\d+\.)+\d)[a-zA-Z]+(\d+).*$
I assumed that the fixed words are just made of letters and that you want to match the entire string. If you prefer, you could substitute the parts not in parentheses with the actual fixed words or change the character sets as desired. I recommend using a tool like https://regex101.com to fine-tune the expression.
Keep it basic by specifing a match ( ) by looking for a digit \d, then zero or more * digits or periods in a set [\d.] (the set is \d -or- a literal period):
var data = "AppName5.2.6dbVer44Oracle.Group";
var pattern = #"(\d[\d.]*)";
// Outputs:
// 5.2.6
// 44
Console.WriteLine (Regex.Matches(data, pattern)
.OfType<Match>()
.Select (mt => mt.Groups[1].Value));
Each match will be a number within the sentence. So if the total set of numbers change, the pattern will not fail and dutifully report 1 to N numbers.
Simply look for the numbers, since you only care for the numbers and don't want to check the syntax of the whole input string.
Matches matches = Regex.Matches(input, #"\d+(\.\d+)*");
if (matches.Count >= 2) {
string number1 = matches[0].Value;
string number2 = matches[1].Value;
} else {
// Less than two numbers found
}
The expression \d+(\.\d+)* means:
\d+ one or more digits.
( )* repeat zero, one or more times.
\.\d+ one decimal point (escaped with \) followed by one or more digits.
and
\d one digit.
( ) grouping.
+ repeat the expression to the left one or more times.
* repeat the expression to the left zero, one or more times.
\ escapes characters that have a special meaning in regex.
. any character (without escaping).
\. period character (".").

Regular expression for updating version number

I have a version numbers as given below.
020. 000. 1234. 43567 (please note the whitespace after the dot(.))
020,000,1234,43567
20.0.1234.43567
20,0,1234,43567
I want a regular expression for updating the numbers after last two dots(.) to for example 1298 and 45678 (any number)
020. 000. 1298. 43568 (please note the whitespace after the dot(.))
020,000,1298,45678
20.0.1298.45678
20,0,1298,45678
Thanks,
resultString = Regex.Replace(subjectString,
#"(\d+) # any number
([.,]\s*) # dot or comma, optional whitespace
(\d+) # etc.
([.,]\s*)
\d+
([.,]\s*)
\d+",
"$1$2$3${4}1298${5}43568", RegexOptions.IgnorePatternWhitespace);
Note the ${4} instead of $4 because otherwise the following 1 would be interpreted as belonging to the group number ($41).
Also note the difference between (\d+) and (\d)+. While both match 1234, the first one will capture 1234 into the group created by the parentheses. The second one will capture only 4 because the previous captures will be overwritten by the next.
To replace version with 1298 and 43568
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+(?<seperator>[.,]\s*)\d+$");
regex.Replace(source, "1298${seperator}43568");
This is because
(?<=) doesn't includethe group in the match but requires it to exist before the match
^ match start of string followed by at least one digit
(?:\d+[.,]\s*) non capturing group, match at least one digit followed by a . or , followed by 0 or more spaces
{2} previous match should occur twice
\d+ first part of the capture, 1 or more digits
(?<seperator>[.,]\s*) get the seperator of a . or , followed by optional spaces into a named capture group called seperator
\d+ capture one or more digits
$ match end of string
in the replacement string you are just providing the replacement version and using ${seperator} to insert the original seperator.
If you are not bothered about preserving the seperator you can just do
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+[.,]\s*\d+$");
regex.Replace(source, "1298.43568");

C# Regex Phone Number Check

I have the following to check if the phone number is in the following format
(XXX) XXX-XXXX. The below code always return true. Not sure why.
Match match = Regex.Match(input, #"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}");
// Below code always return true
if (match.Success) { ....}
The general complaint about regex patterns for phone numbers is that they require one to put in the truly optional characters as dashes and other items.
Why can't they be optional and have the pattern not care if they are there or not?
The below pattern makes dashes, periods and parenthesis optional for the user and focuses on the numbers as a result using named captures.
The pattern is commented (using the # and spans multiple lines) so use the Regex option IgnorePatternWhitespace unless one removes the comments. For that flag doesn't affect regex processing, it only allows for commenting of the pattern via the # character and line break .
string pattern = #"
^ # From Beginning of line
(?:\(?) # Match but don't capture optional (
(?<AreaCode>\d{3}) # 3 digit area code
(?:[\).\s]?) # Optional ) or . or space
(?<Prefix>\d{3}) # Prefix
(?:[-\.\s]?) # optional - or . or space
(?<Suffix>\d{4}) # Suffix
(?!\d) # Fail if eleventh number found";
The above pattern just looks for 10 numbers and ignores any filler characters such as a ( or a dash - or a space or a tab or even a .. Examples are
(555)555-5555 (OK)
5555555555 (ok)
555 555 5555(ok)
555.555.5555 (ok)
55555555556 (not ok - match failure - too many digits)
123.456.789 (failure)
Different Variants of same pattern
Pattern without comments no longer need to use IgnorePatternWhiteSpace:
^(?:\(?)(?<AreaCode>\d{3})(?:[\).\s]?)(?<Prefix>\d{3})(?:[-\.\s]?)(?<Suffix>\d{4})(?!\d)
Pattern when not using Named Captures
^(?:\(?)(\d{3})(?:[\).\s]?)(\d{3})(?:[-\.\s]?)(\d{4})(?!\d)
Pattern if ExplicitCapture option is used
^\(?(?<AreaCode>\d{3})[\).\s]?(?<Prefix>\d{3})[-\.\s](?<Suffix>\d{4})(?!\d)
It doesn't always match, but it will match any string that contains three digits, followed by a hyphen, followed by four more digits. It will also match if there's something that looks like an area code on the front of that. So this is valid according to your regex:
%%%%%%%%%%%%%%(999)123-4567%%%%%%%%%%%%%%%%%
To validate that the string contains a phone number and nothing else, you need to add anchors at the beginning and end of the regex:
#"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$"
Alan Moore did a good explaining what your exp is actually doing. +1
If you want to match exactly "(XXX) XXX-XXXX" and absolutely nothing else, then what you want is
#"^\(\d{3}\) \d{3}-\d{4}$"
Here is the C# code I use. It is designed to get all phone numbers from a page of text. It works for the following patters: 0123456789, 012-345-6789, (012)-345-6789, (012)3456789 012 3456789, 012 345 6789, 012 345-6789, (012) 345-6789, 012.345.6789
List<string> phoneList = new List<string>();
Regex rg = new Regex(#"\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})");
MatchCollection m = rg.Matches(html);
foreach (Match g in m)
{
if (g.Groups[0].Value.Length > 0)
phoneList.Add(g.Groups[0].Value);
}
none of the comments above takes care of international numbers like +33 6 87 17 00 11 (which is a valid phone number for France for example).
I would do it in a two-step approach:
1. Remove all characters that are not numbers or '+' character
2. Check the + sign is at the beginning or not there. Check length (this can be very hard as it depends on local country number schemes).
Now if your number starts with +1 or you are sure the user is in USA, then you can apply the comments above.

Categories

Resources