regular expression greedy on left side only (.net)

regular expression greedy on left side only (.net) - c#

I am trying to capture matches between two strings.
For example, I am looking for all text that appears between Q and XYZ, using the "soonest" match (not continuing to expand outwards). This string:
circus Q hello there Q SOMETEXT XYZ today is the day XYZ okay XYZ
Should return:
Q SOMETEXT XYZ
But instead, it returns:
Q hello there Q SOMETEXT XYZ
Here is the expression I'm using:
Q.*?XYZ
It's going too far back to the left. It's working fine on the ride side when I use the question mark after the asterisk. How can I do the same for the left side, and stop once I hit that first left Q, making it work the same as the right side works? I've tried question marks and other symbols from http://msdn.microsoft.com/en-us/library/az24scfc.aspx, but there's something I'm just not figuring out.
I'm a regex novice, so any help on this would be appreciated!

Well, the non Greedy match is working - it gets the shortest string that satisfies the regex. The thing that you have to remember is that regex is a left to right process. So it matches the first Q, then gets the shortest number of characters followed by an XYZ. If you want it not to go past any Qs, you have to use a negated character class:
Q[^Q]*?XYZ
[^Q] matches any one character that is not a Q. Mind that this will only work for a single character. If your opening delimeter is multiple characters, you have to do it a different way. Why? Well, take the delimiter 'PQR' and the string is
foo PQR bar XYZ
If you try to use the regex from before, but you extended the character class to :
PQR[^PQR]*?XYZ
then you'll get
'PQR bar XYZ'
As you expected. But if your string is
foo PQR Party Time! XYZ
You'll get no matches. It's because [] delineates a "character class" - which matches exactly one character. Using these classes, you can match a range of characters, simply by listing them.
th[ae]n
will match both 'than' and 'then', but not 'thin'. Placing a carat ('^') at the beginning negates the class - meaning "match anything but these characters" - so by turning our one-character delimiter into [^PQR], rather than saying "not 'PQR'", you're saying "not 'P', 'Q', or 'R'". You can still use this if you want, but only if you're 100% sure that the characters from your delimiter will only be in your delimiter. If that's the case, it's faster to use greedy matching and only negate the first character of your delimiter. The regex for that would be:
PQR[^P]*XYZ
But, if you can't make that guarantee, then match with:
PQR(?:.(?!PQR))*?XYZ
Regex doesn't directly support negative string matching (because it's impossible to define, when you think about it), so you have to use a negative lookahead.
(?!PQR)
is just such a lookahead. It means "Assert that the next few characters are not this internal regex", without matching any characters, so
.(?!PQR)
matches any character not followed by PQR. Wrap that in a group so that you can lazily repeat it,
(.(?!PQR))*?
and you have a match for "string that doesn't contain my delimiter". The only thing I did was add a ?: to make it a non-capturing group.
(?:.(?!PQR))*?
Depending on the language you use to parse your regex, it may try to pass back every matched group individually (useful for find and replace). This keeps it from doing that.
Happy regexing!

The concept of greediness only works on the right side.
To make the expression only match from the last Q before XYZ, make it not match Q between them:
Q[^Q]*?XYZ

Related

Match anything except for the character selected by the group previously

I have this regular expression:
(?=(([a-z]{1})([a-z]{1})\2))
Through which I am trying to fetch all the palindromic strings. So, if this is my string:
mnonopooo
My regular expression does select all the palindromic strings in the string but it selects ooo also and I know the reason, it is because of this center part of my regular expression:
(?=(([a-z]{1}) "([a-z]{1})" \2))
This part should be like this, match everything except for the backreference group \2.
So I tried something like this, but it didn't work:
(?=(([a-z]{1}) (?!\2) \2))
So basically, my regular expression has three parts:
Select any single character (This is working)
Match any single character not equal to the character matched in point 1 (Not working)
Select the same character that is matched in point 1 (Working using backreference)
So, the second part I am not able to make.
Can anybody please help

Just add a negative Lookahead (i.e., (?!\2)) to make sure the first matched letter is not repeated and keep the 3rd group as is (you still need it):
(?=(([a-z])(?!\2)([a-z])\2))
Please note that the usage of {1} is redundant so I removed them.
Demo: https://regex101.com/r/BVvwnp/1

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?

You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")

You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.

You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Failing to recognize both keywords

Let me know if I'm asking the question in a wrong way. Not sure if I'm approaching it from the right angle.
My regex looks like this.
^.+(ef?)|(mn?).+$
I'm trying to match line 2 and 4 in the text below.
abcd
efgh
ijkl
mnop
qrst
As it seem, only the last one catches the editors eye. What am I missing?
I've tried to follow some examples for detecting e.g. "ALPHA" and "BETA" words but, apparently, I'm too ignorant of how it works.

regex engine would split the below regex into two parts.
^.+(ef?)|(mn?).+$
Part 1| Part 2
At-first, part1 will be executed.
^.+(ef?)
.+ ensures that there must be atleast a single character present before e, but there isn't. So it fails to match the second one. And fails for all the others because there isn't a character e present in the remaining strings.
| OR
Now the regex engine moves to the second part,
(mn?).+$
Matches the string which contains the letter m. m is present only in the fourth string. So it matches the m plus the following one or more characters because of .+.
The correct approach to match the 2 and 4th strings is:
^.*(ef?).*$|^.*(mn?).*$
OR
^.*(?:(ef?)|(mn?)).*$
DEMO
Use ^.*(?:(ef?)|(mn?)).+$, if there must be a character follows e and an optional f or m and an optional n
If you want to match the strings starts with e or m, then use the below regex.
^(ef?|mn?).+$
Note:
.* matches any character zero or more times.
.+ matches any character one or more times.

Difficulty finding where to insert "word exclusion" in a regex

I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?

ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)

use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo

Regular expression to replace a string

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.

As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)

The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regular expression greedy on left side only (.net) - c#

The concept of greediness only works on the right side. To make the expression only match from the last Q before XYZ, make it not match Q between them: Q[^Q]*?XYZ

Related

Match anything except for the character selected by the group previously

RegEx : Find match based on 1st two chars

Failing to recognize both keywords

Difficulty finding where to insert "word exclusion" in a regex

Regular expression to replace a string

Categories

Resources