How to create a repeating non-capturing group? - c#

I'm trying to create what I think is a repeating non-capturing group, and I just can't figure out how.
In plain words, I want to match:
Any number which is both
Preceded by any amount of blocks that doesn't contain a space, but is not either just a number.
Followed by any amount of blocks that doesn't contain a space, but is not either just a number.
Here is what I tried:
Pattern: (?:\w.)+(\d+)(?:.\w+)+
Test Set:
3.AAA
AAA.BBB
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.4
AAA.3.BBB.4.CCC
AAA.3.BBB.CCC
AAA.3.BBB.CCC.4
AAA.3.BBB.CCC.4.DDD
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.4
ZZZ.AAA.3.BBB.4.CCC
ZZZ.AAA.3.BBB.CCC
ZZZ.AAA.3.BBB.CCC.4
ZZZ.AAA.3.BBB.CCC.4.DDD
I would want it to match only to:
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC
Note: I saw some other posts asking the same-ish question, but I can't use the answers because they were all like "Instead of trying to repeat a group, just match 'this' and it will work for your specific case".

Code
See regex in use here
^(?:(?!(?:\.|^)\d+\.)\S)+\.\d+\.(?:(?!\.\d+(?:\.|$))\S)+$
Results
Input
3.AAA
AAA.BBB
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.4
AAA.3.BBB.4.CCC
AAA.3.BBB.CCC
AAA.3.BBB.CCC.4
AAA.3.BBB.CCC.4.DDD
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.4
ZZZ.AAA.3.BBB.4.CCC
ZZZ.AAA.3.BBB.CCC
ZZZ.AAA.3.BBB.CCC.4
ZZZ.AAA.3.BBB.CCC.4.DDD
Output
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC
Explanation
^ Assert position at the start of the line
(?:(?!(?:\.|^)\d+\.)\S)+ Match the following one or more times
(?!(?:\.|^)\d+\.) Negative lookahead ensuring what follows doesn't match
(?:\.|^) Match either of the following
\. Match a literal dot character .
^ Assert position at the start of the line
\d+ Match one or more digits
\. Match a literal dot character .
\S Match any non-whitespace character
\. Match a literal dot character .
\d+ Match one or more digits
\. Match a literal dot chracter .
(?:(?!\.\d+(?:\.|$))\S)+ Match the following one or more times
(?!\.\d+(?:\.|$)) Negative lookahead ensuring what follows doesn't match
\. Match a literal dot chracter .
\d+ Match one or more digits
(?:\.|$) Match either of the following
\. Match a literal dot chracter .
$ Assert position at the end of the line
\S Match any non-whitespace character
$ Assert position at the end of the line

There is a bit simpler solution:
^(?:(?!\d+\.)\w+\.)+\d+(?:\.(?!\d+(?=\.|$))\w+)+$
See the .NET regex demo (since it is a multiline demo, \r? has to be added before $, it is not necessary when matching standalone strings).
Details
^ - start of string
(?:(?!\d+\.)\w+\.)+ - 1 or more occurrences (due to (?:...)+) of any 1+ word chars (letters, digits, _ - due to \w+) that are not all digits followed with a dot (note that to match only letters and digits, you need to use [\w-[_]] or [^\W_] instead of \w, or if you are really after matching the blocks that may even have symbols or punctuation, replace \w with [^\s.] - any char but whitespace or dot)
\d+ - 1 or more digits
(?:\.(?!\d+(?=\.|$))\w+)+ - 1 or more occurrences of
\. - a dot
(?!\d+(?=\.|$)) - not followed with 1+ digits (\d+) followed with a dot or end of string
\w+ - 1 or more word chars
$ - end of string.
C# demo:
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var lst = new List<string> {"3.AAA", "AAA.BBB", "AAA.3.BBB", "AAA.3.B555B", "AAA.3.BBB.4",
"AAA.3.BBB.4.CCC", "AAA.3.BBB.CCC", "AAA.3.BBB.CCC.4", "AAA.3.BBB.CCC.4.DDD",
"ZZZ.AAA.3.BBB","ZZZ.AAA.3.BBB.4","ZZZ.AAA.3.BBB.4.CCC", "ZZZ.AAA.3.BBB.CCC",
"ZZZ.AAA.3.BBB.CCC.4", "ZZZ.AAA.3.BBB.CCC.4.DDD"};
var rx = new Regex(#"^(?:(?!\d+\.)[^\s.]+\.)+\d+(?:\.(?!\d+(?=\.|$))[^\s.]+)+$",
RegexOptions.Compiled | RegexOptions.ECMAScript);
foreach (var s in lst)
{
if (rx.IsMatch(s))
Console.WriteLine(s);
}
}
}
Results:
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC

Related

Regex to match a "global::" prefixed fully qualified C# type name

I'm trying to match fully qualified C# type names, but the + after \w+ captures too much:
global((::|\.)\w+(?!\s|\())+
Tried to play with quantifiers and negative lookahead but without success.
Online sandbox:
https://regex101.com/r/L6Y8kv/1
Sample:
public global::libebur128.EBUR128StateInternal D
{
get
{
var __result0 = global::libebur128.EBUR128StateInternal.__GetOrCreateInstance(((__Internal*)__Instance)->d, false);
return __result0;
}
Result:
global::libebur128.EBUR128StateInterna
global::libebur128.EBUR128StateInternal.__GetOrCreateInstanc
Expected:
global::libebur128.EBUR128StateInternal
global::libebur128.EBUR128StateInternal
For the example data, you might use:
\bglobal::[^\W_]+(?:\.[^\W_]+)*
The pattern matches:
\bglobal:: A word boundary, followed by matching global::
[^\W_]+ Match 1+ word characters excluding _
(?:\.[^\W_]+)* Optionally repeat matching . and 1+ word characters excluding _
See a regex101 demo.
If the last part should not be followed by ( and you don't want to take the underscore into account, you might add a word boundary and a negative lookahead:
\bglobal::\w+(?:\.\w+)*\b(?!\()
The pattern matches:
\b A word boundary
global:: Match literally
\w+ Match 1+ word chars
(?:\.\w+)* Optionally repeat . and 1+ word chars
\b A word boundary (to prevent backtracking to make the next assertion true)
(?!\() Negative lookahead, assert not ( directly to the right of the current position
regex101 demo

Regex match multiple digits after '-'

This seems like it should be easy, but I'm not so good with regex, and this doesn't seem to be easy to find on google.
I need a regex that starts with the string 'SP-multiple digits' and ends with the string '- multiple digits'
For example i have to match '-12' in "Sp-1234-12".
My attempt was: [^*-]*$ -> This case matches everything after the minus but i need the minus included.
For that digit and hyphen format, you could use a capture group for the part of the string that you want:
^Sp(?:-\d+)*(-\d+)$
Explanation
^ Start of string
Sp Match literally
(?:-\d+)* Optionally repeat - and 1+ digits
(-\d+) Capture group 1, match - and 1+ digits
$ End of string
Regex demo
Note that in C# you can use [0-9] instead of \d to match only digits 0-9

Regex start new match at specific pattern

Hello im kinda new to regex and have a small, maybe simple question.
I have the given text:
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
My current regex (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*)
matches only till sleeping but reates 3 matches correctly.
But i need the Additional test text also in the second group.
i tried something like (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?([,.:\w\s]*) but now i have only one huge match because the second group takes everything until the end.
How can i match everything until a new line with a date starts and create a new match from there on?
If you are sure there is only one additional line to be matched you can use
(?m)^(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2})\s*(.*(?:\n.*)?)
See the regex demo. Details:
(?m) - a multiline modifier
^ - start of a line
(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2}) - Group 1: a datetime string
\s* - zero or more whitespaces
(.*(?:\n.*)?) - Group 2: any zero or more chars other than a newline char as many as possible and then an optional line, a newline followed with any zero or more chars other than a newline char as many as possible.
If there can be any amount of lines, you may consider
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2})[\p{Zs}\t]*(?s)(.*?)(?=\n\d{2}\.\d{2}\.\d{4}|\z)
See this regex demo. Here,
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2}) - matches the same as above, just \s is replaced with [\p{Zs}\t] that only matches horizontal whitespace
[\p{Zs}\t]* - 0+ horizontal whitespace chars
(?s) - now, . will match any chars including a newline
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\n\d{2}\.\d{2}\.\d{4}|\z) - up to the leftmost occurrence of a newline, followed with a date string, or up to the end of string.
You are using \s repeatedly using the * quantifier with the character class [,.:\w\s]* and \s also matches newlines and will match too much.
You can just match the rest of the line using (.*\r?\n.*) which would not match a newline, then match a newline and the next line in the same group.
^(\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*\r?\n.*)
Regex demo
If multiple lines can follow, match all following lines that do not start with a date like pattern.
^(\d{2}\.\d{2}\.\d{4})\s*(.*(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)*)
Explanation
^ Start of the string
( Capture group1
\d{2}\.\d{2}\.\d{4} Match a date like pattern
) Close group 1
\s* Match 0+ whitespace chars (Or match whitespace chars without newlines [^\S\r\n]*)
( Capture group 2
.* Match the whole line
(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)* Optionally repeat matching the whole line if it does not start with a date like pattern
) Close group 2
Regex demo

Remove Dashes but Not Hyphens

I want to remove dashes before, after, and between spaced words, but not hyphenated words.
This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.
should become:
This is a test-sentence. Test One-Two--Three---Four.
Remove multiple dashes ---.
Keep multiple hyphens Three---Four.
I was trying to do it with this:
http://rextester.com/SXQ57185
string sentence = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
string regex = #"(?<!\w)\-(?!\-)|(?<!\-)\-(?!\w)";
sentence = Regex.Replace(sentence, regex, "");
Console.WriteLine(sentence);
But the output is:
This is a test-sentence. Test - One-TwoThree-Four--.
What I would recommend doing is a combination of both a positive lookback and a positive lookahead against the characters that you don't want the dashes to be next to. In your case, that would be spaces and full stops. If either the lookbehind or lookahead match, you want to remove that dash.
This would be: ((?<=[\s\.])\-+)|(\-+(?=[\s\.])).
Breaking this down:
((?<=[\s\.])\-+) - match hyphens that follow either a space or a full stop
| - or
(\-+(?=[\s\.]) - match hyphens that are followed by either a space or a full stop
Here's a JavaScript example showcasing that:
const string = 'This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.';
const regex = /((?<=[\s\.])\-+)|(\-+(?=[\s\.]))/g;
console.log(string.replace(regex, ''));
And this can also been seen on Regex101.
Note that you'll probably also want to trim the excess spaces after using this, which can simply be done with .Trim() in C#.
You can use \b|\s for this task.
/(\b|\s)(-{3})(\b|\s)/g
DEMO
Breakdown shamelessly copied from regex101.com:
/(\b|\s)(-{3})(\b|\s)/g
1st Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Capturing Group (-{3})
-{3} matches the character - literally (case sensitive)
{3} Quantifier — Matches exactly 3 times
3rd Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
You may just match all hyphens in between word chars, and remove all others with a simple
Regex.Replace(s, #"\b(-+)\b|-", "$1")
See the regex demo
Details
\b(-+)\b - word boundary, followed with 1+ hyphens, and then again a word boundary (that is, hyphen(s) in between letters, digits and underscores)
| - or
- - a hyphen in other contexts (it will be removed).
See the C# demo:
var s = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
var result = Regex.Replace(s, #"\b(-+)\b|-", "$1");
Console.WriteLine(result);
// => This is a test-sentence. Test One-Two--Three---Four.

Regular expression for updating version number

I have a version numbers as given below.
020. 000. 1234. 43567 (please note the whitespace after the dot(.))
020,000,1234,43567
20.0.1234.43567
20,0,1234,43567
I want a regular expression for updating the numbers after last two dots(.) to for example 1298 and 45678 (any number)
020. 000. 1298. 43568 (please note the whitespace after the dot(.))
020,000,1298,45678
20.0.1298.45678
20,0,1298,45678
Thanks,
resultString = Regex.Replace(subjectString,
#"(\d+) # any number
([.,]\s*) # dot or comma, optional whitespace
(\d+) # etc.
([.,]\s*)
\d+
([.,]\s*)
\d+",
"$1$2$3${4}1298${5}43568", RegexOptions.IgnorePatternWhitespace);
Note the ${4} instead of $4 because otherwise the following 1 would be interpreted as belonging to the group number ($41).
Also note the difference between (\d+) and (\d)+. While both match 1234, the first one will capture 1234 into the group created by the parentheses. The second one will capture only 4 because the previous captures will be overwritten by the next.
To replace version with 1298 and 43568
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+(?<seperator>[.,]\s*)\d+$");
regex.Replace(source, "1298${seperator}43568");
This is because
(?<=) doesn't includethe group in the match but requires it to exist before the match
^ match start of string followed by at least one digit
(?:\d+[.,]\s*) non capturing group, match at least one digit followed by a . or , followed by 0 or more spaces
{2} previous match should occur twice
\d+ first part of the capture, 1 or more digits
(?<seperator>[.,]\s*) get the seperator of a . or , followed by optional spaces into a named capture group called seperator
\d+ capture one or more digits
$ match end of string
in the replacement string you are just providing the replacement version and using ${seperator} to insert the original seperator.
If you are not bothered about preserving the seperator you can just do
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+[.,]\s*\d+$");
regex.Replace(source, "1298.43568");

Categories

Resources