Match anything except for the character selected by the group previously - c#

I have this regular expression:
(?=(([a-z]{1})([a-z]{1})\2))
Through which I am trying to fetch all the palindromic strings. So, if this is my string:
mnonopooo
My regular expression does select all the palindromic strings in the string but it selects ooo also and I know the reason, it is because of this center part of my regular expression:
(?=(([a-z]{1}) "([a-z]{1})" \2))
This part should be like this, match everything except for the backreference group \2.
So I tried something like this, but it didn't work:
(?=(([a-z]{1}) (?!\2) \2))
So basically, my regular expression has three parts:
Select any single character (This is working)
Match any single character not equal to the character matched in point 1 (Not working)
Select the same character that is matched in point 1 (Working using backreference)
So, the second part I am not able to make.
Can anybody please help

Just add a negative Lookahead (i.e., (?!\2)) to make sure the first matched letter is not repeated and keep the 3rd group as is (you still need it):
(?=(([a-z])(?!\2)([a-z])\2))
Please note that the usage of {1} is redundant so I removed them.
Demo: https://regex101.com/r/BVvwnp/1

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

How to tell a RegEx to be greedy on an 'Or' Expression

Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).
I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])
You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.

Advanced Regex - Capture Whole Group of Complex Statement inside Replace

I'm working on a project, and I need to parse related data... the tools I work with is fully command based, and return all kind of stuff, so the regex come handy instead of guess that this line is that, and the other is this, ... so I need to parse this like:
1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S
which depending on the condition may appear on many shapes, but, this will work hopefully:
.*((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
There is just an issue, I need to capture only this part:
YR VC MC and there's no guarantee that there's always three of them... I tried parentheses grouping, as well as naming as you can see, I don't know how to capture a group in C#, though I think it use the Regex->Replace and then replace the whole data with the selected group (in hear 'Class' group), but it only match the last part,.. of inner parentheses, not the whole of it. for example in the above line it will returns "MC" not three of them, i also tried to replace (\w{2}\s+)+) with (\w{2}\s+|\w{2}\s+\w{2}\s+|\w{2}\s+\w{2}\s+\w{2}\s+) but it didn't worked either.
Any one can help me with this matter?
Thank you.
Capture Groups
Let's back up a bit. First, we need to understand what capture groups are. Everything put within parenthesis will be a capturing group. So, for instance, the regex (\d)(\d) with the string 89 will capture 8 in the first group and 9 in the second group. Let's say you make the second digit optional, so (\d)(\d?). Now, if you try to match just 8, the first group will be 8, and the second group will just be an empty string. In this way, we can match all groups, even if some are 'missing'.
Non-Capture Groups
Your regular expression seems to have a ton of unnecessary capture groups. If you don't need it, don't use parenthesis. For example, for (/)?, you can simply remove the parenthesis. What if you want to match the string "123" ten times? You'd probably do something like (123){10}. But hey, that's another unneeded capture group! You can create a non-capture group by using (?:) instead of (). This way, you won't be capturing whatever is within the parenthesis, but you'll be effectively using the parentheses to your convenience.
Your Regex
Removing all unneccessary capture groups from your regex, we end up with:
.*/?(\w{2}\s+)+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
Which includes the space within the capture group, so let's bring that out:
.*/?(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*.
At this point, the capture group (\w{2}) only matches the MC in your sample string, so let's do what you did and split it off into three different capture groups. Note that we can't do something like (\w{2}){1,3} (which will match \w{2} one to three times), because this still only has one single set of parenthesis, so it only has one single capture group. As such, we will need to expand our (\w{2})\s+ to (\w{2})\s+(\w{2})\s+(\w{2})\s+. This regex will correctly capture your three strings.
Regex in C#
In C#, we have this handy Regex class in System.Text.RegularExpressions. This is how you would use it:
string regex = #".*/?(\w{2})\s+(\w{2})\s+(\w{2})\s+(?:\w{2}\d{2}\w{3})?\s+\w{6}.*";
string sample = "1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S";
Match matches = Regex.Match (sample, regex);
string[] stringGroups = matches.Groups
.Cast<Group> ()
.Select (el => el.Value)
.ToArray ();
Here, stringGroups will be a string array with all the capture groups. stringGroups[0] will be the entire match (so in this case, 1 QB 1283 /YR VC MC MO22AUG IFNTHR 2240 2335 100 0 S), stringGroups[1] will be the first capture group (YR in this case), stringGroups[2] the second, and stringGroups[3] the third.
PS: I highly recommend Debuggex for testing this type of stuff.
Make it un-greedy:
.*?((/)?(?<Class>(\w{2}\s+)+)(\w{2}\d{2}\w{3})?\s+\w{6}).*
^
Or remove both greedy dots from both ends. You don't need them:
/?(?<Class>(?:\w{2}\s+)+)(?:\w{2}\d{2}\w{3})?\s+\w{6}

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?
You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")
You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.
You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

regular expression greedy on left side only (.net)

I am trying to capture matches between two strings.
For example, I am looking for all text that appears between Q and XYZ, using the "soonest" match (not continuing to expand outwards). This string:
circus Q hello there Q SOMETEXT XYZ today is the day XYZ okay XYZ
Should return:
Q SOMETEXT XYZ
But instead, it returns:
Q hello there Q SOMETEXT XYZ
Here is the expression I'm using:
Q.*?XYZ
It's going too far back to the left. It's working fine on the ride side when I use the question mark after the asterisk. How can I do the same for the left side, and stop once I hit that first left Q, making it work the same as the right side works? I've tried question marks and other symbols from http://msdn.microsoft.com/en-us/library/az24scfc.aspx, but there's something I'm just not figuring out.
I'm a regex novice, so any help on this would be appreciated!
Well, the non Greedy match is working - it gets the shortest string that satisfies the regex. The thing that you have to remember is that regex is a left to right process. So it matches the first Q, then gets the shortest number of characters followed by an XYZ. If you want it not to go past any Qs, you have to use a negated character class:
Q[^Q]*?XYZ
[^Q] matches any one character that is not a Q. Mind that this will only work for a single character. If your opening delimeter is multiple characters, you have to do it a different way. Why? Well, take the delimiter 'PQR' and the string is
foo PQR bar XYZ
If you try to use the regex from before, but you extended the character class to :
PQR[^PQR]*?XYZ
then you'll get
'PQR bar XYZ'
As you expected. But if your string is
foo PQR Party Time! XYZ
You'll get no matches. It's because [] delineates a "character class" - which matches exactly one character. Using these classes, you can match a range of characters, simply by listing them.
th[ae]n
will match both 'than' and 'then', but not 'thin'. Placing a carat ('^') at the beginning negates the class - meaning "match anything but these characters" - so by turning our one-character delimiter into [^PQR], rather than saying "not 'PQR'", you're saying "not 'P', 'Q', or 'R'". You can still use this if you want, but only if you're 100% sure that the characters from your delimiter will only be in your delimiter. If that's the case, it's faster to use greedy matching and only negate the first character of your delimiter. The regex for that would be:
PQR[^P]*XYZ
But, if you can't make that guarantee, then match with:
PQR(?:.(?!PQR))*?XYZ
Regex doesn't directly support negative string matching (because it's impossible to define, when you think about it), so you have to use a negative lookahead.
(?!PQR)
is just such a lookahead. It means "Assert that the next few characters are not this internal regex", without matching any characters, so
.(?!PQR)
matches any character not followed by PQR. Wrap that in a group so that you can lazily repeat it,
(.(?!PQR))*?
and you have a match for "string that doesn't contain my delimiter". The only thing I did was add a ?: to make it a non-capturing group.
(?:.(?!PQR))*?
Depending on the language you use to parse your regex, it may try to pass back every matched group individually (useful for find and replace). This keeps it from doing that.
Happy regexing!
The concept of greediness only works on the right side.
To make the expression only match from the last Q before XYZ, make it not match Q between them:
Q[^Q]*?XYZ

Categories

Resources