Regex logical OR - c#

This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string
<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>
Using C# Regex I'm attempting to extract the groups of 3 capital letters.
At the moment if I use pattern >.+?</ I get
Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</
If I then use id=".+?"> I get
Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">
Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</
However, this isn't giving me the combined results of both patterns
My questions are:
Can someone explain why this isn't working as expected?
How can I correct the pattern to get both results shown combined in correct order listed
How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.
Thank you

Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:
(id="|>).+?(">|</)
However, regex is not the right tool to parse XML.
Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:
(id="|>)(.+?)(">|</)
will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:
(?:id="|>)(.+?)(?:">|</)
to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.

You need to group the alternatives together
(?:id="|>).+?(?:">|</)
And to get the letters only use positve lookbehind and lookahead assertions
(?<=id="|>).+?(?=">|</)
See it here on Regexr
The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.

I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)
Test it here on RegExr.

Capturing groups FTW!
#">(?<content>.+?)<|id=""(?<content>.+?)"""
Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

Regex - Find multiple matches

I have following 1010159552597 and I would like to find the numbers that start with 10, followed by 1 or 0 and ending with 7 digits. I use following RegEx to search
(10[01][0-9]{7})
Following result is given: 1010159552
But I also would have expected the following: 1015955259
How can I manage to get both results?
Thanks
Regular expressions consume characters and don't go back over previous matches. A way around this is to use zero-length assertions (see code below) to capture what you want.
Code
See regex in use here
(?=(10[01]\d{7}))
Results are in capture group 1:
1010159552
1015955259
Explanation
(?=(10[01]\d{7})) Positive lookahead ensuring what follows matches
(10[01]\d{7}) Capture your original expression into capture group 1
You're right in that your expectation does match your regex, however, it will try to find the first instance of that match.
In your case the first term is:
10 - 1 - 0159552
so this is the solution given.
Since your results are overlapping, you might want to check out this article.
Overlapping matches in Regex

What is the correct RegEx to extract my substring

I have an input string like this
$(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz)
I want to be able to pull out each subset of strings matching $(anything inside here), so for the example above I would like to get 3 substrings.
the characters in between the brackets do not necessarily always match the same pattern.
I have tried using the following regex
(\$\([a-z]+.*\))
but this matches whole string, due to the fact it starts with '$', anything in middle, and ends with ')'
Hopefully this makes sense.
I should also note that I have very limited experience using regex.
Thanks
(\$\([a-z]+.*?\))
Use ? to make your search non greedy.* is greedy and consumes the max it can.adding ? to * makes it non greedy and it will stop at the first instance of ).
See demo.
http://regex101.com/r/sU3fA2/28
try the below
\((.*?)\)\g
for the given string $(xx.xx.xx)abcde$(yyy.yyy.yyy)fghijk$(zzz.zz.zz.zzz) it returns the three substring..
MATCH 1
1. [2-10] `xx.xx.xx`
MATCH 2
1. [18-29] `yyy.yyy.yyy`
MATCH 3
1. [38-51] `zzz.zz.zz.zzz`
http://regex101.com/r/bX7qR2/1

Simple RegEx Example

I'm horrible at regex so please bear with me here:
I need to a match where the first character can be anything and the next two have to be RS.
so...
XRS123445 - Match
I suggest you start reading this. Matching any character at a position is basically the simplest thing you can do with regular expressions. There are many different things you can use too:
Any alphanumeric character(\w)
Any character whatsoever(.)
A range of characters ([A-Z])
Any character in a certain unicode range ([\uxxx-\uxxx])
and more. You should also be careful as certain regex languages have ceratin nuances and certain flags have to be set to get the same result. I wouldn't get into more detail to avoid confusion here.
This is the regex you're looking for:
^.RS.*
This would match on any of these:
XRS123445
4RSabc
YRS
.RS.*
Should match as . means any character and then RS as per your requirements
Use this pattern
var pattern = "^.RS";

What Regex would I use to match file names with the format 'number-text-text'?

I have code that searches a folder that contains SQL patch files. I want to add file names to an array if they match the following name format:
number-text-text.sql
What Regex would I use to match number-text-text.sql?
Would it be possible to make the Regex match file names where the number part of the file name is between two numbers? If so what would be the Regex syntax for this?
The following regex make it halfway there:
\d+-[a-zA-Z]+-[a-zA-Z]+\.sql
Regarding to match in a specific range it gets trickier as regex doesn't have a simple way to handle ranges. To limit the match to a filename with a number between 2 and 13 this is the regex:
([2-9]|1[0-3])-[a-zA-Z]+-[a-zA-Z]+\.sql
Your regular expression should be:
(\d+)-[a-zA-Z]+-[a-zA-Z]+\.sql
You would then use the first captured group to check if your number is between the two numbers you desire. Don't try to check if a number is within a range with a regular expression; do it in two steps. Your code will be much clearer for it.
How about:
\d+-[^-]+-[^-]+\.sql
Edit: You want just letters, so here it is without specific ranges.
\d+-[a-z]+-[a-z]+\.sql - You'll also want to use the i flag, not sure how that's done in c#, but here it is in js/perl:
/\d+-[a-z]+-[a-z]+\.sql/i
Ranges are more difficult. Here's an example of how to match 0-255:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
So to match (0-255)-text-text.sql, you'd have this:
/^(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])-[a-z]+-[a-z]+\.sql/i
(I put the digits in a non-capturing group and matched from the beginning of the string to prevent partial matches on the number and in case you're expecting numbered groups or something).
Basically every time you need another digit of possibility, you'll need to add a new condition inside this case. The smaller the digit you'd like to match, the more cases you'll need as well. What is your desired min/max? AFAIK there's not a simple way to do this dynamically (although I'd love for someone to show me I'm wrong about that).
The simplest way to get around this would be to simply capture the digits, and use native syntax to see if it's in your range. Example in js:
var match = filename.match(/(\d+)-[a-z]+-[a-z]+\.sql/i);
if(match && match[1] < maximumNumber && match[1] > minimumNumber){
doStuff();
}
This should work:
select '4-dfsg-asdfg.sql' ~ E'^[0-9]+-[a-zA-Z]+-[a-zA-Z]+\\.sql$'
This restricts the TEXT to simple ASCII characters. May or may not be what you want.
This is tested in PostgreSQL. Regular expression flavors differ a lot between implementations. You probably know that?
Anchors at begin ^ and end $ are optional, depending how you are going to do it.

Categories

Resources