C# Regex Anomaly

C# Regex Anomaly - c#

I'm a bit perplexed here.
I have a Regex which is to limit decimal places to two points.
My second and third captures work as expected. But including the 1st capture ($1) corrupts the string and includes all the decimal places (I get the original string).
var t = "553.17765";
var from = #"(\d+)(\.*)(\d{0,2})";
var to = "$1$2$3";
var rd = Regex.Replace(t, from,to);
var r = Regex.Match(t, from);
Why can't I get the 553 in the $1 variable?
LinqPad

What is happening is that you are matching the number multiple times, once before the . and once after. You could work around that by looking for the longest match, but it seems you could improve your Regex instead
(\d+\.?\d{0,2})
Steps are as follows
The capture group covers the whole number at once.
Look for digits, greedy match.
Look for a decimal point, either one or none.
Look for zero to two digits
Furthermore, if you want to replace using Regex.Replace you need something to match the rest of the string.
text = Regex.Replace(text, #".+?(\d+\.?\d{0,2}).+", "$1");
dotnetfiddle

Your example does not work because it triggers twice per definition. The statement (\d+)(\.*)(\d{0,2}) will split the string 553.17765 as follows:
Match 1: 533.17
$1 = 553
$2 = .
$3 = 17
Replace 533.17 with 553.17
Match 2: 765
$1 = 765
Replace 765 with 765
The first match includes - as expected - only two of the decimal places. With this action, the match is complete and the regex starts looking for the next match, because Replace replaces all matches, not the first one only. As you can see, this regex does nothing by design.
The way replace works btw. is to find a match and replace the whole match with the replace pattern. So no need to include the surrounding text. The problem is, that your regex matches too well. It only matches the first two decimal places. Therefore the match only includes the first two decimal places.
That means that whatever you will replace that with, will only replace 553.17 and nothing more. For finding decimal numbers this is good. For replacing not so much, here you want to find the whole number with all decimal places and then replace it.
So a working replace regex would look like this: (\d+\.\d{1,2})\d*. First there is only one capture group, as we don't intend to change the order of numbers around. Second, the point is required as we are only interested to replace numbers that actually have decimal places. Same reason we need at least one, up to two, decimal places. Every decimal place after that is optional, but will be captured greedily to give the whole number to the match so it will be replaced completely.
Match 1: 533.17765
$1 = 533.17
Replace 533.17765 with 533.17
This regex does not handle thousands-separators btw, if that is required.

Related

repeat a group of characters

I have the following input to be matched by a regex:
1.1.1.1
1.01.1.1
01.01.091.01
1.10.100.0010
So I have allways four groups consisting of digits. While the first three ones should match, the last one should not.
So I wrote this regex:
^(\d*[1-9]+\.){4}$
In general this regex should return all those strings where any of the digits in any of the groups is not followed by a zero. Or more easily: I want to not match all numbers with trailing zeros.
However this doesn´t match anything. regex1010.com tells this:
A repeated capturing group will only capture the last iteration. Put a
capturing group around the repeated group to capture all iterations or
use a non-capturing group instead if you're not interested in the data
But when I add a further capturing group I get the same message:
^((\d*[1-9]+\.)){4}$
The same applies to a non-capturing group:
^(?:\d*[1-9]+\.){4}$
Of course I could just write the same group four times, but that´s fairly clumsy and hard to read.

As mentioned by others the dot is the point, so we have three identical groups and one without the dot.
So this regex does it for me:
(?:\d*[1-9]\.){3}(?:\d*[1-9])

You never specify the dot in your patterns. What you ask for is, in fact, not a repetition of four, it is a specific single pattern of four numbers separated with dots.
^(\d*[1-9]+\.\d*[1-9]+\.\d*[1-9]+\.\d*[1-9]+)$
The only thing in there you could consider a repetition is the "number + dot" part, but then you repeat that three times and add another number. Then the regex would become this:
^((\d*[1-9]+\.){3}\d*[1-9]+)$
However, your third line contains a space at the end, so you may want to add extra checks to trim those off.

The problem with your regex is by not including the . your regex fails to find four matches of digits because they always have dots in between.'
Try this instead:
(?:(\d*[1-9])\.?){4}

Split credit card number into 4 chunks using Regex lookahead?

I want to chunk a credit card number (in my case I always have 16 digits) into 4 chunks of 4 digits.
I've succeeded doing it via positive look ahead :
var s="4581458245834584";
var t=Regex.Split(s,"(?=(?:....)*$)");
Console.WriteLine(t);
But I don't understand why the result is with two padded empty cells:
I already know that I can use "Remove Empty Entries" flag , But I'm not after that.
However - If I change the regex to (?=(?:....)+$) , then I get this result :
Question
Why does the regex emit empty cells ? and how can I fix my regex so it produce 4 chunks at first place ( without having to 'trim' those empty entries )

But I don't understand why the result is with two padded empty cells:
Let's try breaking down your regex.
Regex: (?=(?:....)*$)
Explanation: Lookahead (?=) for anything 4 times(?:....) for zero or more times. Just looking ahead and matching nothing will match zero width.
Since you are using * quantifier which says zero or more it matches first zero width at beginning or string and also at end of string.
Visualize it from this snapshot of Regex101 Demo
[
So How can I select only those 3 splitters in the middle ?
I don't know C# very well but this 3 step method might work for you.
Search with (\d{4}) and replace with -\1. Result will be -4581-4582-4583-4584. Demo
Now replace first - by searching with ^-. Result will be 4581-4582-4583-4584. Demo
At last search for - and split on it. Demo. Used \n to substitute for demo purpose.
Alternative Solution Inspired from Royi's answer.
Regex: (?=(?!^)(?:\d{4})+$)
Explanation:
(?= // Look ahead for
(?!^) // Not the start of string
(?:\d{4})+$ // Multiple group of 4 digits till end of string
)
Since nothing is matched and only lookaround assertions are used, it will pinpoint Zero width after a group of 4 digits.
Regex101 Demo

It seems like I've found an answer.
Looking at those splitters - I needed to get rid of the edges :
So I thought - how can I tell the regex engine "not at the start of the line " ?
Which is exactly what (?!^) does
So here is the new regex :
var s="4581458245834584";
var t=Regex.Split(s,"(?!^)(?=(?:....)+$)");
Console.WriteLine(t);
Result :

Umm, I don't know WHY you need Regex for this. You just overcomplicate things. Better way is to just split it manually:
var values = new List<int>();
for(int i =0;i < 4;i++)
{
var value = int.Parse(s.Substring(i*4, 4));
values.Add(value);
}
Regex solution:
var s = "4581458245834584";
var separated = Regex.Match(s, "(.{4}){4}").Groups[1].Captures.Cast<Capture>().Select(x => x.Value).ToArray();

It has been mentioned already that the * quantifier also matches at the end of string where there are zero group-matches ahead. To avoid matching at start and end you can use \B non word boundary which only matches between two word characters not giving matches for start and end.
\B(?=(?:.{4})+$)
See demo at regex101
Because the lookahead won't be triggered at start or end of the string you could even use *

Regex only returns value when anchor is provided

I'm using the following pattern to match numbers in a string. I know there is only one number in a given string I'm trying to match.
var str = "Store # 100";
var regex = new Regex(#"[0-9]*");
When I call regex.Match.Value, this returns an empty string. However, if I change it to:
var regex = new Regex(#"[0-9]*$");
It does return the value I wanted. What gives?

Ok I figured it out.
The problem with [0-9]* or let's make it simpler: \d* is that * makes it optional so it will also result in zero-length match for every character before the '100'.
To rectify this you could use \d\d*, this will cause at least one mandatory digit before the rest and clear out zero-length matches.
Edit: The dollar version, e.g. \d*$ will only work if your number is at the end of the input string.
More information here!
Aaaaand One more link for yet even more info (what a time to be alive).

According to MSDN,
The quantifiers *, +, and {n,m} and their lazy counterparts never
repeat after an empty match when the minimum number of captures has
been found. This rule prevents quantifiers from entering infinite
loops on empty subexpression matches when the maximum number of
possible group captures is infinite or near infinite.
So, as the minimum number of captures is zero, the [0-9]* pattern returns so many NULLs. And [0-9]+ will capture 100 without any problems.

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?

You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")

You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.

You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Match at least one character and one number, regardless of order, with no suffix?

I need a RegEx that fulfills this statement:
There is at least one character and one digit, regardless of order, and there is no suffix (i.e. domain name) at the end.
So I have this test list:
ra182
jas182
ra1z4
And I have this RegEx:
[a-z]+[0-9]+$
It's matching the first two fully, but it's only matching the z4 in the last one. Though it makes sense to me why it's only matching that piece of the last entry, I need a little help getting this the rest of the way.

You can check the first two conditions with lookaheads:
/^(?=.*[a-z])(?=.*[0-9])/i
... and if the third one is just about the absence of ., it's simple to check too:
/^(?=.*[a-z])(?=.*[0-9])[^.]+$/i
But I'd probably prefer to use three separate tests instead: with first check for symbols (are you sure it's enough to check just for a range - [a-z] - and not for a Unicode Letter property?), the second for digits, and the final one for this pesky dot, like this:
if (Regex.IsMatch(string, "[a-zA-Z]")
&& Regex.IsMatch(string, "[0-9]")
&& ! Regex.IsMatch(string, #"\.") )
{
// string IS valid, proceed
}
The regex in the question will try to match one or more symbols, followed by one or more digits; it obviously will fail for the strings like 9a.

I suggest to use
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)(?!.*\.).*");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)(?!.*[.]).*");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])(?=.*\d)[^.]*$");
or
Match match = Regex.Match(str, #"^(?=.*[a-zA-Z])[^.]*\d[^.]*$");
if (match.Success) ...

You need to match alphanumeric strings that have at least one letter and one number? Try something like this:
\w*[a-z]\w*[0-9]\w*
This will make sure you have at least one letter and one number, with the number after the letter. If you want to take into account numbers before letters, just use the corresponding expressiong (numbers before letters) and | the two.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.