Regex: Give priority to optional pattern - c#

Let's say I have a string like this:
555 3553 666 555
And a regex like this
var pat = new Regex("3?553?");
When the string above is matched pat.Match(mystring) the result returned will be "55".
I need the result returned to be "3553" if possible, and if not, then only then I want the result to be "55". As in: The 3? is optional and doesn't have to be there, but if it is it will always be matched first.
So this 555 3553 666 555 will return 3553
And this 222 5555 777 will return 55
Is this possible to achieve without using two separate regex definitions?
Thank you.

Regex engines always go through the string from left to right (assuming a left-to-right script). In your case, the first two characters match the regex, therefore it returns.
So, instead of stopping after the first match, you need to do all the matches and choose the longest one. However, there is a caveat: Regex matches can't overlap (every character can be matched only once). Therefore, in a string like
55553553
your regex would return 55, 553, and 553.
The solution is to use a lookahead assertion, combined with a capturing group:
var pat = new Regex("(?=(3?553?))", "g");
and get all its matches
var match = pat.exec(subject);
while (match != null) {
// matched text: match[1], add that to an array
}
match = pat.exec(subject);
}
Then choose the longest match.

I think you want to use a priority over matches, if yes! I think below code can help you:
var matches = Regex.Matches(txt, #"(?<G1>3553)|(?<G2>55)").OfType<Match>();
var res = matches
.GroupBy(x => x.Success)
.Select(x =>
new {
Success = x.Key,
G = !string.IsNullOrEmpty(x.Max(w => w.Groups["G1"].Value))
? x.Max(w => w.Groups["G1"].Value)
: x.Max(w => w.Groups["G2"].Value)
})
.SingleOrDefault();
C# Demo

Your regex matches 55 simply because that was the first match it can find. There is nothing to do with priorities.
I think what you want here is to get the longest match. You should use Matches to get all the matches and get the longest one by checking Length.
var matches = Regex.Matches("555 3553 666 555", "3?553?");
var longestMatch = matches.Cast<Match>().OrderByDescending(x => x.Value.Length).First().Value

Related

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?
Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]
I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();
There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.
To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();
Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Regex Match multiple occurences with numbers in string C#

I've been searching for my problem answer, but couldn't find so I write here.
I want to take a string example: = "37513220102304920105590"
and find all matches for numbers of length 11 which starts 3 or 4.
I have been trying to do so:
string input = "37513220102304920105590"
var regex = new Regex("^[3-4][0-9]{10}$");
var matches = regex.Matches(trxPurpose);
// I expect it to have 3 occurances "37513220102", "32201023049" and "30492010559"
// But my matches are empty.
foreach (Match match in matches)
{
var number = match.Value;
// do stuff
}
My question is: Is my regex bad or I do something wrong with mathing?
Use capturing inside a positive lookahead, and you need to remove anchors, too. Note the - between 3 and 4 is redundant.
(?=([34][0-9]{10}))
See the regex demo.
In C#, since the values are captured, you need to collect .Groups[1].Value contents, see C# code:
var s = "37513220102304920105590";
var result = Regex.Matches(s, #"(?=([34][0-9]{10}))")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();

Regex match not returning group when using wildcard

Why does this work (returns 25):
var match = Regex.Match("Age: 25 yrs.", #"(\d+)");
Console.WriteLine(match.Groups[1].Value);
But this doesn't (returns a blank group):
var match = Regex.Match("Age: 25 yrs.", #"(\d*)");
Console.WriteLine(match.Groups[1].Value);
There must be something fundamental about how .NET handles regular expressions that I'm missing.
The point is \d* also matches empty string. And Match finds only first match. And as we know, you can fit as many empty strings as you want in front of any string. So it returns the first empty one.
So if you do this, it does match total of 13 strings with 25 being one of them.
var matches = Regex.Matches("Age: 25 yrs.", #"(\d*)");
foreach (var match in matches.Cast<Match>())
{
Console.WriteLine(match.Index + ":" + match.Value);
}
(\d*) will try to take 0-infinite therefor the result will be infinite and this isn't valid.
You meant to use (\d)+ this will take 1 or more digits.

Replace all alphanumeric characters in a string except pattern

I'm trying to obfuscate a string, but need to preserve a couple patterns. Basically, all alphanumeric characters need to be replaced with a single character (say 'X'), but the following (example) patterns need to be preserved (note that each pattern has a single space at the beginning)
QQQ"
RRR"
I've looked through a few samples on negative lookahead/behinds, but still not haven't any luck with this (only testing QQQ).
var test = #"""SOME TEXT AB123 12XYZ QQQ""""empty""""empty""1A2BCDEF";
var regex = new Regex(#"((?!QQQ)(?<!\sQ{1,3}))[0-9a-zA-Z]");
var result = regex.Replace(test, "X");
The correct result should be:
"XXXX XXXX XXXXX XXXXX QQQ""XXXXX""XXXXX"XXXXXXXX
This works for an exact match, but will fail with something like ' QQR"', which returns
"XXXX XXXX XXXXX XXXXX XQR""XXXXX""XXXXX"XXXXXXXX
You can use this:
var regex = new Regex(#"((?> QQQ|[^A-Za-z0-9]+)*)[A-Za-z0-9]");
var result = regex.Replace(test, "$1X");
The idea is to match all that must be preserved first and to put it in a capturing group.
Since the target characters are always preceded by zero or more things that must be preserved, you only need to write this capturing group before [A-Za-z0-9]
Here's a non-regex solution. Works quite nice, althought it fails when there is one pattern in an input sequence more then once. It would need a better algorithm fetching occurances. You can compare it with a regex solution for a large strings.
public static string ReplaceWithPatterns(this string input, IEnumerable<string> patterns, char replacement)
{
var patternsPositions = patterns.Select(p =>
new { Pattern = p, Index = input.IndexOf(p) })
.Where(i => i.Index > 0);
var result = new string(replacement, input.Length);
if (!patternsPositions.Any()) // no pattern in the input
return result;
foreach(var p in patternsPositions)
result = result.Insert(p.Index, p.Pattern); // return patterns back
return result;
}

Regex to remove all (non numeric OR period)

I need for text like "joe ($3,004.50)" to be filtered down to 3004.50 but am terrible at regex and can't find a suitable solution. So only numbers and periods should stay - everything else filtered. I use C# and VS.net 2008 framework 3.5
This should do it:
string s = "joe ($3,004.50)";
s = Regex.Replace(s, "[^0-9.]", "");
The regex is:
[^0-9.]
You can cache the regex:
Regex not_num_period = new Regex("[^0-9.]")
then use:
string result = not_num_period.Replace("joe ($3,004.50)", "");
However, you should keep in mind that some cultures have different conventions for writing monetary amounts, such as: 3.004,50.
You are dealing with a string - string is an IEumerable<char>, so you can use LINQ:
var input = "joe ($3,004.50)";
var result = String.Join("", input.Where(c => Char.IsDigit(c) || c == '.'));
Console.WriteLine(result); // 3004.50
For the accepted answer, MatthewGunn raises a valid point in that all digits, commas, and periods in the entire string will be condensed together. This will avoid that:
string s = "joe.smith ($3,004.50)";
Regex r = new Regex(#"(?:^|[^w.,])(\d[\d,.]+)(?=\W|$)/)");
Match m = r.match(s);
string v = null;
if (m.Success) {
v = m.Groups[1].Value;
v = Regex.Replace(v, ",", "");
}
The approach of removing offending characters is potentially problematic. What if there's another . in the string somewhere? It won't be removed, though it should!
Removing non-digits or periods, the string joe.smith ($3,004.50) would transform into the unparseable .3004.50.
Imho, it is better to match a specific pattern, and extract it using a group. Something simple would be to find all contiguous commas, digits, and periods with regexp:
[\d,\.]+
Sample test run:
Pattern understood as:
[\d,\.]+
Enter string to check if matches pattern
> a2.3 fjdfadfj34 34j3424 2,300 adsfa
Group 0 match: "2.3"
Group 0 match: "34"
Group 0 match: "34"
Group 0 match: "3424"
Group 0 match: "2,300"
Then for each match, remove all commas and send that to the parser. To handle case of something like 12.323.344, you could do another check to see that a matching substring has at most one ..

Categories

Resources