I have following 1010159552597 and I would like to find the numbers that start with 10, followed by 1 or 0 and ending with 7 digits. I use following RegEx to search
(10[01][0-9]{7})
Following result is given: 1010159552
But I also would have expected the following: 1015955259
How can I manage to get both results?
Thanks
Regular expressions consume characters and don't go back over previous matches. A way around this is to use zero-length assertions (see code below) to capture what you want.
Code
See regex in use here
(?=(10[01]\d{7}))
Results are in capture group 1:
1010159552
1015955259
Explanation
(?=(10[01]\d{7})) Positive lookahead ensuring what follows matches
(10[01]\d{7}) Capture your original expression into capture group 1
You're right in that your expectation does match your regex, however, it will try to find the first instance of that match.
In your case the first term is:
10 - 1 - 0159552
so this is the solution given.
Since your results are overlapping, you might want to check out this article.
Overlapping matches in Regex
Related
I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester
Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).
I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])
You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.
I am building an application, and I have a requirement to capture characters before and after matches. This seems to work okay, except when there are multiple matches within the surrounding capture.
Regex:
.{0,10}(?=abc)
This should capture up to 10 characters before the string "abc" is found.
The issue comes up if there is a recurrence of the match in the preceding text:
"qqqqabcabcqqq"
With the above text, I would expect two captures:
qqqq (the 4 characters before the first abc occurrence)
qqqqabc (the 7 characters before the second abc occurrence)
I am not, however getting these matches. The only match I get is:
qqqqabc
I am certain that I am missing something, but I am not sure what. I believe that my regex is somehow being too greedy, and so it is overlooking the first match in favor of the larger, second one. Here is what I need:
I need a regex that:
1. Is for .NET
2. Looks within a string for X characters before an exact match on string S.
3. Includes any secondary match on S (call S') that is found within X characters before S
4. does not care in the slightest what these characters are.
I assure you, I tried looking for similar answers but I wasn't able to find anything that directly answers this question (which has been plaguing me for two days. Yes, I have to use regular expression). As for Regex flavor, I am working in .NET.
Thank you so much for any help.
Here it is:
(?<=(?<CharsBefore>.{0,10}))(?=abc)
Took me a while to remember that .NET allows positive lookbehinds with variability.
Regex test
Demo in C#
I changed the way your initial version worked a bit.
Hope it helps!
PS: I've named the group, but you are obviously free to keep it nameless and work with numbered groups if you want a less cluttered regex, like so:
(?<=(.{0,10}))(?=abc)
I have an input string like this
$(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz)
I want to be able to pull out each subset of strings matching $(anything inside here), so for the example above I would like to get 3 substrings.
the characters in between the brackets do not necessarily always match the same pattern.
I have tried using the following regex
(\$\([a-z]+.*\))
but this matches whole string, due to the fact it starts with '$', anything in middle, and ends with ')'
Hopefully this makes sense.
I should also note that I have very limited experience using regex.
Thanks
(\$\([a-z]+.*?\))
Use ? to make your search non greedy.* is greedy and consumes the max it can.adding ? to * makes it non greedy and it will stop at the first instance of ).
See demo.
http://regex101.com/r/sU3fA2/28
try the below
\((.*?)\)\g
for the given string $(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz) it returns the three substring..
MATCH 1
1. [2-10] `xx.xx.xx`
MATCH 2
1. [18-29] `yyy.yyy.yyy`
MATCH 3
1. [38-51] `zzz.zz.zz.zzz`
http://regex101.com/r/bX7qR2/1
This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string
<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>
Using C# Regex I'm attempting to extract the groups of 3 capital letters.
At the moment if I use pattern >.+?</ I get
Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</
If I then use id=".+?"> I get
Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">
Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</
However, this isn't giving me the combined results of both patterns
My questions are:
Can someone explain why this isn't working as expected?
How can I correct the pattern to get both results shown combined in correct order listed
How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.
Thank you
Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:
(id="|>).+?(">|</)
However, regex is not the right tool to parse XML.
Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:
(id="|>)(.+?)(">|</)
will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:
(?:id="|>)(.+?)(?:">|</)
to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.
You need to group the alternatives together
(?:id="|>).+?(?:">|</)
And to get the letters only use positve lookbehind and lookahead assertions
(?<=id="|>).+?(?=">|</)
See it here on Regexr
The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.
I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)
Test it here on RegExr.
Capturing groups FTW!
#">(?<content>.+?)<|id=""(?<content>.+?)"""
Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).