Using Regular Expression to find exact length match multiple times - c#

I need a regular expression to find groups of exactly 8 numbers in a row. The closest I have gotten is:
[0-9]{8}
but it's not exactly what I need. If I had a number that was 9 long it will match the first 8 but I want it to ignore it if it's longer or shorter than 8.
Here are some examples
1234567890 <- no match, it's longer than 8
12345678 <- match: "12345678"
1234567809876543 <- match 1: "12345678", match 2: "09876543" (two groups of 8)
,,111-11-1234,12345678, <- match: "12345678"
To summarize, for every group of exactly 8 numbers make a match.
I'm working with some results of OCR (Optical Character Recognition) and I have to work with the shortcomings of the results so my input can be varied as in the examples above.
Here is some use case data: http://pastebin.com/uijF9K9n

You can use the following regex in .NET:
(?<=^|\D|(?:\d{8})+)\d{8}(?=$|\D|(?:\d{8})+)
See regex demo
It is based on variable-width lookbehind and a lookahead.
Regex breakdown:
(?<=^|\D|(?:\d{8})+) - only if at the string start (^), or preceded with not a digit (\D) or 1 or more sequences of 8 digits ((?:\d{8})+)...
\d{8} - match 8 digits that are followed by...
(?=$|\D|(?:\d{8})+) - either end of string ($) or not a digit (\D) or 1 or more sequences of 8 digits ((?:\d{8})+).
IMPORTANT:
If I got a downvote for the "extra" complexity compared with another answer, note our solutions are different: my regex matches 8-digit number in ID12345678, and the other one does not due to the word boundary.

You can also try this regex
(?:\b|\G)\d{8}(?=(?:\d{8})*\b)
(?:\b|\G) \b match a word boundary | or \G continue where last match attempt ended
\d{8} matches 8 digits [0-9] followed by a lookahead (?=... to check
(?:\d{8})*\b if followed by any amount of {8 digits} until another word boundary
It will match {8 digits} or out of a sequence of such if between two word boundaries.
See demo at regexstorm

\b[0-9]{8}\b this will give you what you want
For more details check this out
http://www.rexegg.com/regex-boundaries.html

Related

Regular expression that stops at first letter encountered

I want my regex expression to stop matching numbers of length between 2 and 10 after it encounters a letter.
So far I've come up with (\d{2,10})(?![a-zA-Z]) this. But it continues to match even after letters are encountered.
2216101225 /ROC/PL FCT DIN 24.03.2022 PL ERBICIDE' - this is the text I've been testing the regex on, but it matches 24 03 and 2022 also.
This is tested and intended for C#.
Can you help ? Thanks
Another option is to anchor the pattern and to match any character except chars a-zA-Z or a newline, and then capture the 2-10 digits in a capture group.
Then get the capture group 1 value from the match.
^[^A-Za-z\r\n]*\b([0-9]{2,10})\b
Explanation
^ Start of string
[^A-Za-z\r\n]* Optionally match chars other than a-zA-Z or a newline
\b([0-9]{2,10})\b Capture 2-10 digits between word boundaries in group 1
See a regex demo.
Note that in .NET \d matches all numbers except for only 0-9.
You can use the following .NET regex
(?<=^\P{L}*)(?<!\d)\d{2,10}(?!\d)
(?<=^[^a-zA-Z]*)(?<!\d)\d{2,10}(?!\d)
See the regex demo. Details:
(?<=^\P{L}*) - there must be no letters from the current position till the start of string ((?<=^[^a-zA-Z]*) only supports ASCII letters)
(?<!\d) - no digit immediately on the left is allowed.
\d{2,10} - two to ten digits
(?!\d) - no digit immediately on the right is allowed.

Regex to match 7 same digits in a number regardless of position

I want to match an 8 digit number. Currently, I have the following regex but It is failing in some cases.
(\d+)\1{6}
It matches only when a number is different at the end such as 44444445 or 54444444. However, I am looking to match cases where at least 7 digits are the same regardless of their position.
It is failing in cases like
44454444
44544444
44444544
What modification is needed here?
It's probably a bad idea to use this in a performance-sensitive location, but you can use a capture reference to achieve this.
The Regex you need is as follows:
(\d)(?:.*?\1){6}
Breaking it down:
(\d) Capture group of any single digit
.*? means match any character, zero or more times, lazily
\1 means match the first capture group
We enclose that in a non-capturing group {?:
And add a quantifier {6} to match six times
You can sort the digits before matching
string input = "44444445 54444444 44454444 44544444 44444544";
string[] numbers = input.Split(' ');
foreach (var number in numbers)
{
number = String.Concat(str.OrderBy(c => c));
if (Regex.IsMatch(number, #"(\d+)\1{6}"))
// do something
}
Still not a good idea to use regex for this though
The pattern that you tried (\d+)\1{6} matches 6 of the same digits in a row. If you want to stretch the match over multiple same digits, you have to match optional digits in between.
Note that in .NET \d matches more digits than 0-9 only.
If you want to match only digits 0-9 using C# without matching other characters in between the digits:
([0-9])(?:[0-9]*?\1){6}
The pattern matches:
([0-9]) Capture group 1
(?: Non capture group
[0-9]*?\1 Match optional digits 0-9 and a backreference to group 1
){6} Close non capture group and repeat 6 times
See a .NET Regex demo
If you want to match only 8 digits, you can use a positive lookahead (?= to assert 8 digits and word boundaries \b
\b(?=\d{8}\b)[0-9]*([0-9])(?:[0-9]*?\1){6}\d*\b
See another .NET Regex demo

Regex with mixed character set and different counts for each

I'm trying to find a number with a fixed length with a regex. The Problem is, that in some cases the number is split into portions devided by spaces or dashes. Examples are:
123456789
123 456 789
12 34567 89
123-456-789
12-345678-9
I think you get what I mean. The Regex I'm currently using would only get the first number:
(?<=^|\D)([0-9]{9})(?=$|\D)
When I add spaces and dashes to my character list like this:
(?<=^|\D)([0-9 -]{9})(?=$|\D)
I still don't get the desired results, as the "numbers" containing them are longer than 9 characters. If I take more characters I would end up with a lot of false results.
What I would need is a way to tell the regex to take numbers, spaces and dashes but with the following restrictions:
The number can only be 9 characters long (without spaces and dashes)
no two spaces or dashes or a mix of them should be in a row
Additionally it would be nice if the dashes and spaces wouldn't be returned, but thats not that important
I suggest using
(?<!\d)\d(?:[ -]?\d){8}(?!\d)
See the regex demo. To only match ASCII digits, pass RegexOptions.ECMAScript option to the regex constructor.
Pattern details:
(?<!\d) - a negative lookbehind that fails the match if there is a digit symbol immediately to the left of the current location (same as (?<=^|\D)) (NOTE: to avoid matching 234.123 4567-89 replace this lookbehind with (?<!\d\.?))
\d - a digit
(?:[ -]?\d){8} - exactly 8 occurrences of a space or - and then any digit
(?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current location (NOTE: to prevent matching 123456789.34, use (?!\.?\d) instead).
C# usage to extract matches:
var results = Regex.Matches(s, #"(?<!\d)\d(?:[ -]?\d){8}(?!\d)", RegexOptions.ECMAScript)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
This regex works:
(\d\s?-?){9}
It's looking for 9 groups of any digit followed by optional whitespace and optional hyphen character.
So it would match all of your examples, but would also match the following:
1 2 3 4 5 6 7 8 9
1 -2 -3 -4 -5 -6 -7 -8 -9 -
etc.
It's a simple regex, but it might not meet your requirements if you want to exclude matches with a trailing space or hyphen. Wiktor Stribiżew's answer provides a more complex regex which may suite your needs more thoroughly.

Regex failing in max length

I want regex which will allow following format
1234567-8
123456B
Now here if user enter second pattern then he should be lock to enter maximum 7 characters so
1234568B
123456V1
this becomes invalid
I have tried
[0-9]{7}-[0-9]|[[0-9]{6}[A-z]{1}]{7,7}
but this fails
For the sample input you provided, you may use ^([0-9]{7}-[0-9]|[0-9]{6}[A-Za-z])$.
A bit more contracted version: ^[0-9]{6}(?:[0-9]-[0-9]|[A-Za-z])$.
Note that 1234567-8 has 7 digits and a hyphen followed with a digit, so the whole string length cannot be limited to just 7 characters all in all.
In .NET and almost all other regex flavors [A-z] is a mistake, as it can match more than just letters.
Placing a quantifier {1} into a character class makes it a simple symbol combination, so [{1}] matches either { or 1 or }.
The {7,7} (={7}) will not limit the whole string length to 7, as you do not have anchors (^ and $) around the expression and you "ruined" the preceding quantifiers by putting them into a character class.
I think this is what you need:
^(\d{7}-\d|\d{6}[A-Z])$
7 digits, dash, digit OR 6 digits, 1 large latin letter.
^\d{6}(?:\d-\d|[A-Z])$
It can satisfy well with 2 your above formats
1234567-8
123456B

Get a number with exactly x digits from string

im looking for a regex pattern, which matches a number with a length of exactly x (say x is 2-4) and nothing else.
Examples:
"foo.bar 123 456789", "foo.bar 456789 123", " 123", "foo.bar123 " has to match only "123"
So. Only digits, no spaces, letters or other stuff.
How do I have to do it?
EDIT: I want to use the Regex.Matches() function in c# to extract this 2-4 digit number and use it in additional code.
Any pattern followed by a {m,n} allows the pattern to occur m to n times. So in your case \d{m,n} for required values of m and n. If it has to be exactly an integer, use\d{m}
If you want to match 123 in x123y and not in 1234, use \d{3}(?=\D|$)(?<=(\D|^)\d{3})
It has a look ahead to ensure the character following the 3 digits is a non-digitornothing at all and a look behind to ensure that the character before the 3 digits is a non-digit or nothing at all.
You can achieve this with basic RegEx:
\b(\d\d\d)\b or \b(\d{3})\b - for matching a number with exactly 3 digits
If you want variable digits: \b(\d{2,4})\b (explained demo here)
If you want to capture matches next to words: \D(\d{2,4})\D (explained demo here)
\b is a word boundary (does not match anything, it's a zero-match character)
\d matches only digits
\D matches any character that is NOT a digit
() everything in round brackets will capture a match

Categories

Resources