Regular expression match text between tag

Regular expression match text between tag - c#

I need a help with regular expression as I do not have good knowledge in it.
I have regular expression as:
Regex myregex = new Regex("testValue=\"(.+?)\"");
What does (.+?) indicate?
The string it matches is "testValue=123e4567" and returns 123e4567 as output.
Now I need help in regular expression to match a string "<helpMe>123e4567</helpMe>" where I need 123e4567 as output. How do I write a regular expression for it?

This means:
( Begin captured group
. Match any character
+ One or more times
? Non-greedy quantifier
) End captured group
In the case of your regex, the non-greedy quantifier ? means that your captured group will begin after the first double-quote, and then end immediately before the very next double-quote it encounters. If it were greedy (without the ?), the group would extend to the very last double-quote it encounters on that line (i.e., "greedily" consuming as much of the line as possible).
For your "helpMe" example, you'd want this regex:
<helpMe>(.+?)</helpMe>
Given this string:
<div>Something<helpMe>ABCDE</helpMe></div>
You'd get this match:
ABCDE
The value of the non-greedy quantifier is evident in this variation:
Regex: <helpMe>(.+)</helpMe>
String: <div>Something<helpMe>ABCDE</helpMe><helpMe>FGHIJ</helpMe></div>
The greedy capture would look like this:
ABCDE</helpMe><helpMe>FGHIJ
There are some useful interactive tools to play with these variations:
Regex Tester
Regex Pal

Ken Redler has a great answer regarding your first question. For the second question try:
<(helpMe)>(.*?)</\1>
Using the back reference \1 you can find values between the set of matching tags. The first group finds the tag name, the second group matches the content itself, and the \1 back reference re-uses the first group's match (in this case the tag name).
Also, in C# you can use named groups, like: <(helpMe)>(?<value>.*?)</\1> where now match.Groups["value"].Value contains your value.

What does (.+?) indicate?
It means match any character (.) one or more times (+?)
A simple regex to match your second string would be
<helpMe>([a-z0-9]+)<\/helpMe>
This will match any character of a-z and any digit inside <helpme> and </helpMe>.
The pharanteses are used to capture a group. This is useful if you need to reference the value inside this group later.

Related

How do I find a match which has already been captured by another match?

How can I replace all occurrences of matches in a string if some parts have already been captured:
E.g. Given the pattern "AB|BC" and the target "ABC" we match "AB" but not "BC"
I've been trying to understand the various regex grouping options (Grouping Constructs in Regular Expressions) without success. I'm probably barking up the wrong tree. :-(
var test = Regex.Replace("(AB)(BC)(AC)(ABC)", #"AB|BC", string.Empty);
In the example, test evaluates to "()()(AC)(C)", but what I actually want is "()()(AC)()"

Without taking care of the parenthesis, you cou use and alternation with an optional character using the question mark.
Match AB with an optional C or Match an optional A followed by BC. In the replacement use an empty string.
ABC?|A?BC
Regex demo
Including the parenthesis you might use a capturing group or lookarounds to assert what is on the left and on the right are opening and closing parenthesis.
(?<=\()(?:ABC?|A?BC)(?=\))
Explanation
(?<=\() Assert what is on the left is (
(?: Non capturing group
ABC? Match AB with optional C
-| Or
A?BC Match optional A and BC
) Close non capturing group
(?=\)) Assert what is on the right is )
Regex demo

In order to consume the overlaps buddy, it has to be matched.
Therefore, one side of the alternation has to include its buddies last
or first literal (doesn't have to be both).
AB|BC ~ ABC?|BC = A?BC|AB

Regular expression, match anything not enclosed in

Given the string foobarbarbarfoobar, I want to have everything between foo. So I used this expression for that and the result is: barbarbar. It's working great.
(?<=foo).*(?=foo)
Now I also want the opposite. So given the string foobarbarbarfoobar I want everything that is not enclosed by foo. I tried the following regular expression:
(?<!foo).*(?!foo)
I expected bar as result but instead it returns a match for foobarbarbarfoobar. It doesn't make sense to me. What am I missing?
The explanation from: https://regex101.com/ looks good to me?
(?<!foo).*(?!foo)
(?<!foo) Negative Lookbehind - Assert that it is impossible to match the regex below
foo matches the characters foo literally (case sensitive)
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?!foo) Negative Lookahead - Assert that it is impossible to match the regex below
foo matches the characters foo literally (case sensitive)
Any help is really appreciated

I'm hoping someone finds a better approach, but this abomination may do what you want:
(.*)foo(?<=foo).*(?=foo)foo(.*)
The text before the first foo is in capture group 1 (with your provided example this would be empty) and after is in capture group 2 (would be 'bar' in this case)
If you want the 'foo's included on either end, use this instead: (.*)(?<=foo).*(?=foo)(.*). This would result in 'foo' in group 1, and 'foobar' in group 2.

I found a solution for it:
^((?!foo).)+
Explanation from regex101
^ assert position at start of the string
1st Capturing group ((?!foo).)+
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
(?!foo) Negative Lookahead - Assert that it is impossible to match the regex below
foo matches the characters foo literally (case sensitive)
. matches any character (except newline)

Regex - Get matches of #[SomeText] in a string

I want to get all matches of #[SomeText] pattern in a string.
For example, for this string:
here is #[text1] some text #[text2]
I want #[text1] and #[text2].
I'm using Regex Hero to check my pattern matching online,
and my pattern works fine when there's one expression to match,
For example:
here is #[text1] text
but with more then one, I get both matches with the text in the middle.
This is my regex:
#\[.*\]
I would appreciate assistance in isolating the occurrences.

The problem here is that you are using greedy quantifier (*). To capture all you need, you should use lazy quantifier (*?) with a global modifier:
/(#\[.*?\])/g
Take a look here https://regex101.com/r/pH0gA5/1

This should work :
#\[(.*?)\]
Details :
(.*?) : match everything in a non-greedy way and capture it.
Because the *? quantifier is lazy (non-greedy), it matches as few characters as possible to allow the overall match attempt to succeed, i.e. text1. For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match.

.* is greedy by default, so it only finds one match, treating "text1] and #[text2" as the text between the two square brackets.
If you add a questions mark after the .* then it will find the minimum number of characters before reaching a ].
So the regex \#[.*?] do what you want.

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?

You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")

You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.

You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Regex replace/search using values/variables in search text

What is the regex syntax to use part of a matched expression in the subsequent part of the search?
So, for example, if I have:
"{marker=1}some text{/marker=1}"
or
"{marker=2}some text{/marker=2}"
I want to use the first digit found in the pattern to find the second digit. So in
"{marker=1}{marker=2}some text{/marker=2}{/marker=1}"
the regex would match the 1's and then the 2's.
So far I've come up with {marker=(\d)}(.*?){/marker=(\d)} but don't know how to specify the second \d to refer to the value found in the first \d.
I'm doing this in C#.

try:
{marker=(\d)}(.*?){/marker=(\1)}

Numbered backreference is just \n, so \1 should work here:
Regex re = new Regex(#"\{marker=(\d)\}(.*?)\{/marker=(\1)\}");
// expect to work
Console.WriteLine(re.IsMatch(#"{marker=1}some text{/marker=1}"));
// expect to fail (end marker is different)
Console.WriteLine(re.IsMatch(#"{marker=1}some text{/marker=2}"));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression match text between tag - c#

Related

How do I find a match which has already been captured by another match?

Regular expression, match anything not enclosed in

Regex - Get matches of #[SomeText] in a string

RegEx : Find match based on 1st two chars

Regex replace/search using values/variables in search text

Categories

Resources