Regex - Find multiple matches - c#

I have following 1010159552597 and I would like to find the numbers that start with 10, followed by 1 or 0 and ending with 7 digits. I use following RegEx to search
(10[01][0-9]{7})
Following result is given: 1010159552
But I also would have expected the following: 1015955259
How can I manage to get both results?
Thanks

Regular expressions consume characters and don't go back over previous matches. A way around this is to use zero-length assertions (see code below) to capture what you want.
Code
See regex in use here
(?=(10[01]\d{7}))
Results are in capture group 1:
1010159552
1015955259
Explanation
(?=(10[01]\d{7})) Positive lookahead ensuring what follows matches
(10[01]\d{7}) Capture your original expression into capture group 1

You're right in that your expectation does match your regex, however, it will try to find the first instance of that match.
In your case the first term is:
10 - 1 - 0159552
so this is the solution given.
Since your results are overlapping, you might want to check out this article.
Overlapping matches in Regex

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

How to tell a RegEx to be greedy on an 'Or' Expression

Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).
I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])
You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.

Capture Text Surrounding Regex Match .NET

I am building an application, and I have a requirement to capture characters before and after matches. This seems to work okay, except when there are multiple matches within the surrounding capture.
Regex:
.{0,10}(?=abc)
This should capture up to 10 characters before the string "abc" is found.
The issue comes up if there is a recurrence of the match in the preceding text:
"qqqqabcabcqqq"
With the above text, I would expect two captures:
qqqq (the 4 characters before the first abc occurrence)
qqqqabc (the 7 characters before the second abc occurrence)
I am not, however getting these matches. The only match I get is:
qqqqabc
I am certain that I am missing something, but I am not sure what. I believe that my regex is somehow being too greedy, and so it is overlooking the first match in favor of the larger, second one. Here is what I need:
I need a regex that:
1. Is for .NET
2. Looks within a string for X characters before an exact match on string S.
3. Includes any secondary match on S (call S') that is found within X characters before S
4. does not care in the slightest what these characters are.
I assure you, I tried looking for similar answers but I wasn't able to find anything that directly answers this question (which has been plaguing me for two days. Yes, I have to use regular expression). As for Regex flavor, I am working in .NET.
Thank you so much for any help.
Here it is:
(?<=(?<CharsBefore>.{0,10}))(?=abc)
Took me a while to remember that .NET allows positive lookbehinds with variability.
Regex test
Demo in C#
I changed the way your initial version worked a bit.
Hope it helps!
PS: I've named the group, but you are obviously free to keep it nameless and work with numbered groups if you want a less cluttered regex, like so:
(?<=(.{0,10}))(?=abc)

What is the correct RegEx to extract my substring

I have an input string like this
$(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz)
I want to be able to pull out each subset of strings matching $(anything inside here), so for the example above I would like to get 3 substrings.
the characters in between the brackets do not necessarily always match the same pattern.
I have tried using the following regex
(\$\([a-z]+.*\))
but this matches whole string, due to the fact it starts with '$', anything in middle, and ends with ')'
Hopefully this makes sense.
I should also note that I have very limited experience using regex.
Thanks
(\$\([a-z]+.*?\))
Use ? to make your search non greedy.* is greedy and consumes the max it can.adding ? to * makes it non greedy and it will stop at the first instance of ).
See demo.
http://regex101.com/r/sU3fA2/28
try the below
\((.*?)\)\g
for the given string $(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz) it returns the three substring..
MATCH 1
1. [2-10] `xx.xx.xx`
MATCH 2
1. [18-29] `yyy.yyy.yyy`
MATCH 3
1. [38-51] `zzz.zz.zz.zzz`
http://regex101.com/r/bX7qR2/1

Regex logical OR

This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string
<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>
Using C# Regex I'm attempting to extract the groups of 3 capital letters.
At the moment if I use pattern >.+?</ I get
Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</
If I then use id=".+?"> I get
Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">
Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</
However, this isn't giving me the combined results of both patterns
My questions are:
Can someone explain why this isn't working as expected?
How can I correct the pattern to get both results shown combined in correct order listed
How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.
Thank you
Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:
(id="|>).+?(">|</)
However, regex is not the right tool to parse XML.
Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:
(id="|>)(.+?)(">|</)
will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:
(?:id="|>)(.+?)(?:">|</)
to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.
You need to group the alternatives together
(?:id="|>).+?(?:">|</)
And to get the letters only use positve lookbehind and lookahead assertions
(?<=id="|>).+?(?=">|</)
See it here on Regexr
The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.
I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)
Test it here on RegExr.
Capturing groups FTW!
#">(?<content>.+?)<|id=""(?<content>.+?)"""
Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).

Categories

Resources