Single Regex with multiple match patterns

Single Regex with multiple match patterns - c#

Regex is hard :)
What I have right now is a string value that I am succesfuly matching values on specific keys. I need to expand my regex to match a using a value rather than a key.
https://www.me.com/?name=bob&identify=bob1&test=email#me.com&validKey=validValue
The current Regex being applied is
((name|identify|test)(=|%3D)[^&]*)
What I want to add/extend is to match any values that contain an # symbol. I wont know what the 'key' is in as its dynamic so I cant just add 'badKey' into the matched pattern. So an example input for this would be:
https://www.me.com/?name=bob&identify=bob1&test=email#me.com&validKey=validValue&badKey=test2#test.com
Basically I want to match all the existing parts and then also the 'badKey' one. I know I can run the string through a second Regex but for performance sake I would like this to be a single pattern.
Any help here would be appreciated.

I found a pattern that will match specifically emails and not just values that contain an # symbol
(name|identify|test)(=|%3D)[^&|^%26]*|((=|%3D)(\\w+([-+.]\\w+)*[#|%40]\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*)[^&|^%26]*)

U can use the String.Split to break down by key=value pairs (split by "&") and processing each pair

Related

Regex to get the word next to all of the given words

I need a regex to capture the word immediately next to all of the words I provide.
Example Sentence:
user="panda" is trying to access resource="system"
Words to be captured: panda & system (i.e., the word immediately next to the words 'user' & 'resource')
Currently, I use this regex (?<=name=\")(.*?)(?=\";) which returns the name 'panda'. I'm looking for a query that would capture both the user and the resource in the above sentence.
Can someone help with the regex query to do this?

Since .NET's regex supports non-fixed length Lookbehinds, you can just add all the words you want in a non-capturing group and use alternation:
(?<=(?:user|resource)=\").*?(?=\")
Demo.
You can also get rid of the Lookahead by using something like this:
(?<=(?:user|resource)=\")[^"]*
Demo #2

just a simple regex with lazy matching should do the job
user="(.*?)".*resource="(.*?)"
it gets more complicated if you need to match more than two words in any order, I wouldn't use a RegEx in this case at all, you would rather want to make a lexer for that. Just make a class/procedure that will tokenize the sentence first, then parser to get the information you want

Regex logical OR

This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string
<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>
Using C# Regex I'm attempting to extract the groups of 3 capital letters.
At the moment if I use pattern >.+?</ I get
Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</
If I then use id=".+?"> I get
Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">
Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</
However, this isn't giving me the combined results of both patterns
My questions are:
Can someone explain why this isn't working as expected?
How can I correct the pattern to get both results shown combined in correct order listed
How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.
Thank you

Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:
(id="|>).+?(">|</)
However, regex is not the right tool to parse XML.
Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:
(id="|>)(.+?)(">|</)
will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:
(?:id="|>)(.+?)(?:">|</)
to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.

You need to group the alternatives together
(?:id="|>).+?(?:">|</)
And to get the letters only use positve lookbehind and lookahead assertions
(?<=id="|>).+?(?=">|</)
See it here on Regexr
The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.

I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)
Test it here on RegExr.

Capturing groups FTW!
#">(?<content>.+?)<|id=""(?<content>.+?)"""
Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).

What Regex would I use to match file names with the format 'number-text-text'?

I have code that searches a folder that contains SQL patch files. I want to add file names to an array if they match the following name format:
number-text-text.sql
What Regex would I use to match number-text-text.sql?
Would it be possible to make the Regex match file names where the number part of the file name is between two numbers? If so what would be the Regex syntax for this?

The following regex make it halfway there:
\d+-[a-zA-Z]+-[a-zA-Z]+\.sql
Regarding to match in a specific range it gets trickier as regex doesn't have a simple way to handle ranges. To limit the match to a filename with a number between 2 and 13 this is the regex:
([2-9]|1[0-3])-[a-zA-Z]+-[a-zA-Z]+\.sql

Your regular expression should be:
(\d+)-[a-zA-Z]+-[a-zA-Z]+\.sql
You would then use the first captured group to check if your number is between the two numbers you desire. Don't try to check if a number is within a range with a regular expression; do it in two steps. Your code will be much clearer for it.

How about:
\d+-[^-]+-[^-]+\.sql
Edit: You want just letters, so here it is without specific ranges.
\d+-[a-z]+-[a-z]+\.sql - You'll also want to use the i flag, not sure how that's done in c#, but here it is in js/perl:
/\d+-[a-z]+-[a-z]+\.sql/i
Ranges are more difficult. Here's an example of how to match 0-255:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
So to match (0-255)-text-text.sql, you'd have this:
/^(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])-[a-z]+-[a-z]+\.sql/i
(I put the digits in a non-capturing group and matched from the beginning of the string to prevent partial matches on the number and in case you're expecting numbered groups or something).
Basically every time you need another digit of possibility, you'll need to add a new condition inside this case. The smaller the digit you'd like to match, the more cases you'll need as well. What is your desired min/max? AFAIK there's not a simple way to do this dynamically (although I'd love for someone to show me I'm wrong about that).
The simplest way to get around this would be to simply capture the digits, and use native syntax to see if it's in your range. Example in js:
var match = filename.match(/(\d+)-[a-z]+-[a-z]+\.sql/i);
if(match && match[1] < maximumNumber && match[1] > minimumNumber){
doStuff();
}

This should work:
select '4-dfsg-asdfg.sql' ~ E'^[0-9]+-[a-zA-Z]+-[a-zA-Z]+\\.sql$'
This restricts the TEXT to simple ASCII characters. May or may not be what you want.
This is tested in PostgreSQL. Regular expression flavors differ a lot between implementations. You probably know that?
Anchors at begin ^ and end $ are optional, depending how you are going to do it.

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.

You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex

Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.

.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.

a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.

You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.

Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single Regex with multiple match patterns - c#

I found a pattern that will match specifically emails and not just values that contain an # symbol (name|identify|test)(=|%3D)[^&|^%26]|((=|%3D)(\\w+([-+.]\\w+)[#|%40]\\w+([-.]\\w+)\\.\\w+([-.]\\w+))[^&|^%26]*)

U can use the String.Split to break down by key=value pairs (split by "&") and processing each pair

Related

Regex to get the word next to all of the given words

Regex logical OR

What Regex would I use to match file names with the format 'number-text-text'?

Regex.Matches returns one match per line, not per "word"

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single Regex with multiple match patterns - c#

I found a pattern that will match specifically emails and not just values that contain an # symbol (name|identify|test)(=|%3D)[^&|^%26]*|((=|%3D)(\\w+([-+.]\\w+)*[#|%40]\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*)[^&|^%26]*)

U can use the String.Split to break down by key=value pairs (split by "&") and processing each pair

Related

Regex to get the word next to all of the given words

Regex logical OR

What Regex would I use to match file names with the format 'number-text-text'?

Regex.Matches returns one match per line, not per "word"

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Categories

Resources

I found a pattern that will match specifically emails and not just values that contain an # symbol (name|identify|test)(=|%3D)[^&|^%26]|((=|%3D)(\\w+([-+.]\\w+)[#|%40]\\w+([-.]\\w+)\\.\\w+([-.]\\w+))[^&|^%26]*)