Simple RegEx Example - c#

I'm horrible at regex so please bear with me here:
I need to a match where the first character can be anything and the next two have to be RS.
so...
XRS123445 - Match

I suggest you start reading this. Matching any character at a position is basically the simplest thing you can do with regular expressions. There are many different things you can use too:
Any alphanumeric character(\w)
Any character whatsoever(.)
A range of characters ([A-Z])
Any character in a certain unicode range ([\uxxx-\uxxx])
and more. You should also be careful as certain regex languages have ceratin nuances and certain flags have to be set to get the same result. I wouldn't get into more detail to avoid confusion here.

This is the regex you're looking for:
^.RS.*
This would match on any of these:
XRS123445
4RSabc
YRS

.RS.*
Should match as . means any character and then RS as per your requirements

Use this pattern
var pattern = "^.RS";

Related

The value of regex match groups remain empty [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Extracting words from lines that match different patterns

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.
These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).
An example of a subject would be:
"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"
The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".
The format of MZ5051CLA will be:
- Always starts with "MZ"
- Follows by a number
- Always ends with "CLA"
Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?
I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.
Edit:
As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.
Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.
Try this regex:
^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$
Demo
The below regex would match only the first string MZ5051CLA
\bMZ\d+CLA\b
DEMO
But this would match the both strings MZ5051CLA and 11123MTX,
\b[A-Z0-9]+$
All alphanumeric characters present at the last of a line are matched.
DEMO
This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx
(?:\b[A-Z0-9]+$|\b\d+MTX\b)
DEMO
Both Codes in One Pattern
It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:
\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*
In the demo, see how only the correct groups are matched. Of course false positives are possible.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Regex logical OR

This is a purely academic exercise relating to regex and my understanding of grouping multiple patterns. I have the following example string
<xContext id="ABC">
<xData id="DEF">
<xData id="GHI">
<ID>JKL</ID>
<str>MNO</str>
<str>PQR</str>
<str>
<order id="STU">
<str>VWX</str>
</order>
<order id="YZA">
<str>BCD</str>
</order>
</str>
</xContext>
Using C# Regex I'm attempting to extract the groups of 3 capital letters.
At the moment if I use pattern >.+?</ I get
Found 5 matches:
>JKL</
>MNO</
>PQR</
>VWX</
>BCD</
If I then use id=".+?"> I get
Found 5 matches:
id="ABC">
id="DEF">
id="GHI">
id="STU">
id="YZA">
Now I'm trying to combine them by using logic OR | for each term on both sides id="|>.+?">|</
However, this isn't giving me the combined results of both patterns
My questions are:
Can someone explain why this isn't working as expected?
How can I correct the pattern to get both results shown combined in correct order listed
How can I further enhance the combined pattern to just give letters only? I'm hoping it's still ?<= and ?=< but just want to check.
Thank you
Your regex doesn't know where to start or stop the alternativ options separated by |. So you need to put them in subpatterns:
(id="|>).+?(">|</)
However, regex is not the right tool to parse XML.
Those round brackets also add capturing subpatterns. This can be returned by themselves. So this:
(id="|>)(.+?)(">|</)
will return the whole match at index 0, the front-delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regex engines you can do this:
(?:id="|>)(.+?)(?:">|</)
to avoid capturing the delimiters. Now index 0 will have the whole match, and index 1 only the 3 letters. Unfortunately, I can't tell you how to retrieve them in C#.
You need to group the alternatives together
(?:id="|>).+?(?:">|</)
And to get the letters only use positve lookbehind and lookahead assertions
(?<=id="|>).+?(?=">|</)
See it here on Regexr
The groups starting with ?<= and ?= are zero width assertions, that means, they don't match (what they match is not part of the result), they just "look" behind or ahead.
I would suggest you to use regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)
Test it here on RegExr.
Capturing groups FTW!
#">(?<content>.+?)<|id=""(?<content>.+?)"""
Specifically, named capturing groups, because the .NET regex flavor lets you use the same group name as many times as you want in the same regex. Calling Groups["content"] on the Match object will return the content without regard to its location (i.e., between two tags or in an id attribute).

Need some C# Regular Expression Help

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second
some random text and HTML</ol></b> bla </ol>
Here's the pattern I'm currently using: string pattern = #"some random text(.|\r|\n)*</ol>";
What am I doing wrong?
string pattern = #"some random text(.|\r|\n)*?</ol>";
Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.
Make your wild-card "ungreedy" by adding a ?. e.g.
some random text(.|\r|\n)*?</ol>
^- Addition
This will make regex match as few characters as possible, instead of matching as many (standard behavior).
Oh, and regex shouldn't parse [X]HTML
While not a Regex, why not simply use the Substring functions, like:
string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);
That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.
This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:
Regex pattern = new Regex(
#"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;
Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.
I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).
BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:
Regex A = new Regex(#"some random text.*?</ol>", RegexOptions.Singleline);
Use the Singleline modifier. It just makes the . match also newline characters.

C# Regex: only letters followed by an optional

I am looking for a way to get words out of a sentence. I am pretty far with the following expression:
\b([a-zA-Z]+?)\b
but there are some occurrences that it counts a word when I want it not to. E.g a word followed by more than one period like "text..". So, in my regex I want to have the period to be at the end of a word zero or one time. Inserting \.? did not do the trick, and variations on this have not yielded anything fruitful either.
Hope someone can help!
A single dot means any character. You must escape it as
\.?
Maybe you want an expression like this:
\w+\.?
or
\p{L}+\.?
You need to add \.? (and not .?) because the period has special meaning in regexes.
to avoid a match on your example "test.." you ask for you not only need to put the \.? for checking first character after the word to be a dot but also look one character further to check the second character after the word.
I did end up with something like this
\w{2,}\.?[^.]
You should also consider that a sentence not always ends with a . but also ! or ? and alike.
I usually use rubulator.com to quick test a regexp

Categories

Resources