Regex - Extract second position digit from string

Regex - Extract second position digit from string - c#

I have a regex:
var thisMatch = Regex.Match(result, #"(?-s).+(?=[\r\n]+The information appearing in this document)", RegexOptions.IgnoreCase);
This returns the line before "The information appearing in this document" just fine.
The output of my regex is
10 880 $10,000 $800 $25 $10
I need to extract 880, which will always be in second position (the number before 880 could be vary, so \d{0,2} shouldn't be allowed).
How can I grab the second position number?

You can use something like
(?<=^\S+[\p{Zs}\t]+)\d+(?=.*[\r\n]+The information appearing in this document)
See the .NET regex demo. In C#:
var output = Regex.Match(result, #"(?<=^\S+[\p{Zs}\t]+)\d+(?=.*[\r\n]+The information appearing in this document)", RegexOptions.Multiline)?.Value;
Or, you could capture the number and grab it from a group with
^\S+[\p{Zs}\t]+(\d+).*[\r\n]+The information appearing in this document
See this regex demo. In C#:
var output = Regex.Match(result, #"^\S+[\p{Zs}\t]+(\d+).*[\r\n]+The information appearing in this document", RegexOptions.Multiline)?.Groups[1].Value;
Regex details:
(?<= - start of a positive lookbehind that requires its pattern to match immediately to the left of the current location:
^ - start of a line (due to the RegexOptions.Multiline)
\S+ - one or more non-whitespace chars
[\p{Zs}\t]+ - one or more horizontal whitespaces
) - end of the lookbehind
\d+ - one or more digits (use \S+ if you are sure this will always be the non-whitespace char streak)
(?= - start of a positive lookahead that requires its pattern to match immediately to the right of the current location:
.* - the rest of the line (as . does not match an LF char)
[\r\n]+ - one or more CR/LF chars
The information appearing in this document - literal text
) - end of the lookahead.

If you insert
\d+\s(\d+)
this will capture a leading number (\d+), separated by a whitespace (\s) from the number you're looking for ((\d+)), captured in a capture group so you can easily access it.
Check the tab Split List in this online demo

Related

Regular expression that stops at first letter encountered

I want my regex expression to stop matching numbers of length between 2 and 10 after it encounters a letter.
So far I've come up with (\d{2,10})(?![a-zA-Z]) this. But it continues to match even after letters are encountered.
2216101225 /ROC/PL FCT DIN 24.03.2022 PL ERBICIDE' - this is the text I've been testing the regex on, but it matches 24 03 and 2022 also.
This is tested and intended for C#.
Can you help ? Thanks

Another option is to anchor the pattern and to match any character except chars a-zA-Z or a newline, and then capture the 2-10 digits in a capture group.
Then get the capture group 1 value from the match.
^[^A-Za-z\r\n]*\b([0-9]{2,10})\b
Explanation
^ Start of string
[^A-Za-z\r\n]* Optionally match chars other than a-zA-Z or a newline
\b([0-9]{2,10})\b Capture 2-10 digits between word boundaries in group 1
See a regex demo.
Note that in .NET \d matches all numbers except for only 0-9.

You can use the following .NET regex
(?<=^\P{L}*)(?<!\d)\d{2,10}(?!\d)
(?<=^[^a-zA-Z]*)(?<!\d)\d{2,10}(?!\d)
See the regex demo. Details:
(?<=^\P{L}*) - there must be no letters from the current position till the start of string ((?<=^[^a-zA-Z]*) only supports ASCII letters)
(?<!\d) - no digit immediately on the left is allowed.
\d{2,10} - two to ten digits
(?!\d) - no digit immediately on the right is allowed.

Regex start new match at specific pattern

Hello im kinda new to regex and have a small, maybe simple question.
I have the given text:
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
My current regex (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*)
matches only till sleeping but reates 3 matches correctly.
But i need the Additional test text also in the second group.
i tried something like (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?([,.:\w\s]*) but now i have only one huge match because the second group takes everything until the end.
How can i match everything until a new line with a date starts and create a new match from there on?

If you are sure there is only one additional line to be matched you can use
(?m)^(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2})\s*(.*(?:\n.*)?)
See the regex demo. Details:
(?m) - a multiline modifier
^ - start of a line
(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2}) - Group 1: a datetime string
\s* - zero or more whitespaces
(.*(?:\n.*)?) - Group 2: any zero or more chars other than a newline char as many as possible and then an optional line, a newline followed with any zero or more chars other than a newline char as many as possible.
If there can be any amount of lines, you may consider
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2})[\p{Zs}\t]*(?s)(.*?)(?=\n\d{2}\.\d{2}\.\d{4}|\z)
See this regex demo. Here,
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2}) - matches the same as above, just \s is replaced with [\p{Zs}\t] that only matches horizontal whitespace
[\p{Zs}\t]* - 0+ horizontal whitespace chars
(?s) - now, . will match any chars including a newline
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\n\d{2}\.\d{2}\.\d{4}|\z) - up to the leftmost occurrence of a newline, followed with a date string, or up to the end of string.

You are using \s repeatedly using the * quantifier with the character class [,.:\w\s]* and \s also matches newlines and will match too much.
You can just match the rest of the line using (.*\r?\n.*) which would not match a newline, then match a newline and the next line in the same group.
^(\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*\r?\n.*)
Regex demo
If multiple lines can follow, match all following lines that do not start with a date like pattern.
^(\d{2}\.\d{2}\.\d{4})\s*(.*(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)*)
Explanation
^ Start of the string
( Capture group1
\d{2}\.\d{2}\.\d{4} Match a date like pattern
) Close group 1
\s* Match 0+ whitespace chars (Or match whitespace chars without newlines [^\S\r\n]*)
( Capture group 2
.* Match the whole line
(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)* Optionally repeat matching the whole line if it does not start with a date like pattern
) Close group 2
Regex demo

Trying to capture a decimal number out of a string

In a text file, I'm looking for a part of a document that contains a piece like min ISO 1133 0.2-0.35. What I want to capture is the ranged decimal part of that piece of text (0.2-0.35). Since there are other ranged decimal numbers, I cannot simply use a regular expression to look only for the ranged part. Till now, I could make min.*(\d+)((?:\.)?)(\d*)-(\d+)((?:\.)?)(\d*) but the result is not correct and I'm stuck. Can anyone please help me with this?
Below, you can see the final result (yellow part):

Maybe the following would work for you?
\bmin\s.*?(\d+(?:\.\d+)?)-(\d+(?:\.\d+)?)
See the online demo
The answer is currently based on the assumption (looking at your current attempt) you'd want these ranges in seperate groups. However, if not, this answer can be swiftly transformed to capture the whole substring (or see #TheFourthBird's answer).
\b - Match word boundary.
min - Literally match 'min'.
\s - Match a whitespace character.
.*? - Match any character other than newline up to (lazy):
( - Open 1st capture group
\d+ - At least a single digit.
(?: - Open non-capturing group.
\.\d+ - Match a literal dot and at least a single digit.
)? - Close non-capturing group and make it optional.
) - Close 1st capture group.
- Match a literal hyphen.
( - Open 2nd capture group
\d+ - At least a single digit.
(?: - Open non-capturing group.
\.\d+ - Match a literal dot and at least a single digit.
)? - Close non-capturing group and make it optional.
) - Close 2nd capture group.

You could get the decimal part matching 1+ digit in the optional part and making the quantifier non greedy. The value is in capture group 1.
\bmin [A-Z]+ [0-9]+ ([0-9]+(?:\.[0-9]+)?-[0-9](?:\.[0-9]+)?)\b
Regex demo
Or a bit more specific pattern
\bmin [A-Z]+ [0-9]+ ([0-9]+(?:\.[0-9]+)?-[0-9]+(?:\.[0-9]+)?)\b
Regex demo

Remove Dashes but Not Hyphens

I want to remove dashes before, after, and between spaced words, but not hyphenated words.
This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.
should become:
This is a test-sentence. Test One-Two--Three---Four.
Remove multiple dashes ---.
Keep multiple hyphens Three---Four.
I was trying to do it with this:
http://rextester.com/SXQ57185
string sentence = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
string regex = #"(?<!\w)\-(?!\-)|(?<!\-)\-(?!\w)";
sentence = Regex.Replace(sentence, regex, "");
Console.WriteLine(sentence);
But the output is:
This is a test-sentence. Test - One-TwoThree-Four--.

What I would recommend doing is a combination of both a positive lookback and a positive lookahead against the characters that you don't want the dashes to be next to. In your case, that would be spaces and full stops. If either the lookbehind or lookahead match, you want to remove that dash.
This would be: ((?<=[\s\.])\-+)|(\-+(?=[\s\.])).
Breaking this down:
((?<=[\s\.])\-+) - match hyphens that follow either a space or a full stop
| - or
(\-+(?=[\s\.]) - match hyphens that are followed by either a space or a full stop
Here's a JavaScript example showcasing that:
const string = 'This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.';
const regex = /((?<=[\s\.])\-+)|(\-+(?=[\s\.]))/g;
console.log(string.replace(regex, ''));
And this can also been seen on Regex101.
Note that you'll probably also want to trim the excess spaces after using this, which can simply be done with .Trim() in C#.

You can use \b|\s for this task.
/(\b|\s)(-{3})(\b|\s)/g
DEMO
Breakdown shamelessly copied from regex101.com:
/(\b|\s)(-{3})(\b|\s)/g
1st Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Capturing Group (-{3})
-{3} matches the character - literally (case sensitive)
{3} Quantifier — Matches exactly 3 times
3rd Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])

You may just match all hyphens in between word chars, and remove all others with a simple
Regex.Replace(s, #"\b(-+)\b|-", "$1")
See the regex demo
Details
\b(-+)\b - word boundary, followed with 1+ hyphens, and then again a word boundary (that is, hyphen(s) in between letters, digits and underscores)
| - or
- - a hyphen in other contexts (it will be removed).
See the C# demo:
var s = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
var result = Regex.Replace(s, #"\b(-+)\b|-", "$1");
Console.WriteLine(result);
// => This is a test-sentence. Test One-Two--Three---Four.

Regular expression for updating version number

I have a version numbers as given below.
020. 000. 1234. 43567 (please note the whitespace after the dot(.))
020,000,1234,43567
20.0.1234.43567
20,0,1234,43567
I want a regular expression for updating the numbers after last two dots(.) to for example 1298 and 45678 (any number)
020. 000. 1298. 43568 (please note the whitespace after the dot(.))
020,000,1298,45678
20.0.1298.45678
20,0,1298,45678
Thanks,

resultString = Regex.Replace(subjectString,
#"(\d+) # any number
([.,]\s*) # dot or comma, optional whitespace
(\d+) # etc.
([.,]\s*)
\d+
([.,]\s*)
\d+",
"$1$2$3${4}1298${5}43568", RegexOptions.IgnorePatternWhitespace);
Note the ${4} instead of $4 because otherwise the following 1 would be interpreted as belonging to the group number ($41).
Also note the difference between (\d+) and (\d)+. While both match 1234, the first one will capture 1234 into the group created by the parentheses. The second one will capture only 4 because the previous captures will be overwritten by the next.

To replace version with 1298 and 43568
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+(?<seperator>[.,]\s*)\d+$");
regex.Replace(source, "1298${seperator}43568");
This is because
(?<=) doesn't includethe group in the match but requires it to exist before the match
^ match start of string followed by at least one digit
(?:\d+[.,]\s*) non capturing group, match at least one digit followed by a . or , followed by 0 or more spaces
{2} previous match should occur twice
\d+ first part of the capture, 1 or more digits
(?<seperator>[.,]\s*) get the seperator of a . or , followed by optional spaces into a named capture group called seperator
\d+ capture one or more digits
$ match end of string
in the replacement string you are just providing the replacement version and using ${seperator} to insert the original seperator.
If you are not bothered about preserving the seperator you can just do
var regex = new Regex(#"(?<=^(?:\d+[.,]\s*){2})\d+[.,]\s*\d+$");
regex.Replace(source, "1298.43568");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex - Extract second position digit from string - c#

If you insert \d+\s(\d+) this will capture a leading number (\d+), separated by a whitespace (\s) from the number you're looking for ((\d+)), captured in a capture group so you can easily access it. Check the tab Split List in this online demo

Related

Regular expression that stops at first letter encountered

Regex start new match at specific pattern

Trying to capture a decimal number out of a string

Remove Dashes but Not Hyphens

Regular expression for updating version number

Categories

Resources