Not that lazy Regex match? - c#

I have a C# Regex class matching multiple subgroups such as
(?<g1>abc)|(?<g2>def)|(?<g3>ghi)
but with much more complicated sub-patterns. I basically want to match anything that doesn't belong to any of those groups, in addition to existing groups.
I tried
(?<g1>abc)|(?<g2>def)|(?<g3>ghi)|(.+?)
but it turned out too slow. I can't do negation because I don't want to copy those complex subpatterns redundantly. Using just (.+) overrides all other groups as expected.
Is there any other way? If that doesn't work I'll have to write an ad-hoc parser.
Additional details: All these groups are evaluated against a MatchEvaluator. So a Regex class behavior that sends "unmatched strings" to the MatchEvaluator will also work.
A sample text would be
.......abc........ghi.....def.....abc....def...ghi......abc.......
I want to catch parts inbetween.

Your regex generates separate match for every single character outside g1,g2,g3. So when you use it with MatchEvaluator it generates lots of evaluator calls. Thats why its slow.
If you try following regex:
(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)
you will get single "rest" group match for entire fragment of text that doesnt contain "g" group.
Regex C# code:
Regex regex = new Regex(
#"(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)",
RegexOptions.Singleline
| RegexOptions.Compiled
);

but it turned out too slow. I can't do
negation because I don't want to copy
those complex subpatterns redundantly.
Why not something like:
const string COMPLEX_REGEX_PATTERN = "\Gobbel[dy]go0\k"

Have you tried setting the regex option to be compiled? I find using a static compiled regex can speed things up considerably.

If your regex is four pages long, writing a state machine yourself would probably be a better idea...

Related

Regex to get the word next to all of the given words

I need a regex to capture the word immediately next to all of the words I provide.
Example Sentence:
user="panda" is trying to access resource="system"
Words to be captured: panda & system (i.e., the word immediately next to the words 'user' & 'resource')
Currently, I use this regex (?<=name=\")(.*?)(?=\";) which returns the name 'panda'. I'm looking for a query that would capture both the user and the resource in the above sentence.
Can someone help with the regex query to do this?
Since .NET's regex supports non-fixed length Lookbehinds, you can just add all the words you want in a non-capturing group and use alternation:
(?<=(?:user|resource)=\").*?(?=\")
Demo.
You can also get rid of the Lookahead by using something like this:
(?<=(?:user|resource)=\")[^"]*
Demo #2
just a simple regex with lazy matching should do the job
user="(.*?)".*resource="(.*?)"
it gets more complicated if you need to match more than two words in any order, I wouldn't use a RegEx in this case at all, you would rather want to make a lexer for that. Just make a class/procedure that will tokenize the sentence first, then parser to get the information you want

regex to highlight XML values

DISCLAIMER: I know that using regex on xml is risky and generally a bad idea, but I can only feed regex into my syntax highlighting engine, and I can't spend the ressources required to create a new system just for xml-based languages.
So I'm trying to use regex to get the values inside XML tags, as such:
<LoremIpsum>I NEED THIS PART</LoremIpsum>
I thought this would be nice and easy, and I could just use (>.*<\/). It works perfectly on any online regex tester, however, as soon as I try using it in .NET, it completely messes up, and I end up getting a completely unpredictable output. What would be the correct way to do this, in one regex expression, considering I'm using .NETs System.Text.RegularExpressions?
This is probably because .NET Regex are greedy. My suggestion would be to use non greedy .*? or [^<] instead of .:
(>.*?<\/)
(>[^<]*<\/)
Like that it can't move over a <.
You never define what it completely messed up means, but try doing this:
(>.*?<\/)
The ? in .*? makes it a non-greedy match. By default, regular expressions operators greedy meaning they will match as much as possible. The non-greedy form matches as little as possible. To see the difference, match 'is test of' against both forms: With (>.*<\/) you will match: is <a>test</a> of. With (>.*?<\/) you will match is <a>test.
If you want to avoid any XML tags in the match, then you should use #ThomasWeller's solution.

Regex that returns all integers in C# "111; 222; 3333" and "213" in a string with alpha

I am extracting all numbers used in an xml file. The numbers are written in following two patterns
<Environment Id="11" StringId="8407" DescriptionId="5014" RemoteControlAppStringId="8119; 8118" EnvironmentType="BlueToothBridge" AlternateId="1" XML_NAME_ID="BTBSpeechPlusM" FactoryGainType="LIN18">
<Offsets />
</Environment>
I am using regex: "\"\d*;\"" and "\"\d*\"" to extract all numbers.
from the above when i ran Regex "\"\d*\"" using
Regex.Match(myString, "\"\\d*\"")
the above line returns 8407, 11,5014 but it is not returning 8119 and 8118
Your regex will fail to match 8119; 8118 because your pattern is finding quoted numbers.
try with
\b\d+\b
\b specify that \d+ will match only in word boundary. So LIN18 will not match.
Depening on whether you can assume that the provided input is valid XML, you could use the following regular expression:1
Regex.match(myString, "(?<=\")\\d+(?=\")|(?<=\")\\d+(?=; ?\\d+\")|(?<=\"\\d+; ?)\\d+(?=\")" )
The main idea behind this is that it takes the three possible situations into account:
"[number]"
"[number]; [other_number]" (With or without a space before [other_number])
"[other_number]; [number]" (With or without a space before [number])
There are two new concepts I included in the regular expression:2
Positive lookahead: (?=[regex])
Positive lookbehind: (?<=[regex])
These concepts allow the regular expression to check if something specific is before or after it, without putting it in the match.
This regular expression could easily be optimised, but this is meant as an example of a basic approach.
One good tip for developing a regular expression like this is to use a tool (online or offline) to test your regular expression. The tool I used was .NET Regex Tester.
As #poke stated in the comment, it's because your regex doesn't match the string. Change your regex to capture specific matches and account for the possibility of the ';'.
Something like below should probably do the trick.
EDIT: (\b\d+\b)|(\b\d+[;*]\d+\b)

Need some C# Regular Expression Help

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second
some random text and HTML</ol></b> bla </ol>
Here's the pattern I'm currently using: string pattern = #"some random text(.|\r|\n)*</ol>";
What am I doing wrong?
string pattern = #"some random text(.|\r|\n)*?</ol>";
Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.
Make your wild-card "ungreedy" by adding a ?. e.g.
some random text(.|\r|\n)*?</ol>
^- Addition
This will make regex match as few characters as possible, instead of matching as many (standard behavior).
Oh, and regex shouldn't parse [X]HTML
While not a Regex, why not simply use the Substring functions, like:
string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);
That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.
This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:
Regex pattern = new Regex(
#"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;
Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.
I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).
BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:
Regex A = new Regex(#"some random text.*?</ol>", RegexOptions.Singleline);
Use the Singleline modifier. It just makes the . match also newline characters.

refering to already existing group in regex, c#

I have a regex where
%word% can occur multiple times, separated by a "<"
%word% is defined as ".*?"|[a-zA-Z]+
so i wrote
(".*"|[a-zA-Z]+)([<](".*"|[a-zA-Z]+))*
Is there any way i can shrink it using capturing groups?
(".*"|[a-zA-Z]+)([<]\1)*,
But i don't think \1 can be used as it'd mean repeat the first capture, as i would not know what was captured as it can be a quoted string or a word.
Any thing similar i can use to refer matching the previously written group. I'm working in C#.
using String.Format to avoid repetition and no there is no way to repeat the regex group literally
String.Format("{0}([<]{0})*", #"("".*""|[a-zA-Z]+)")
As the support is not there yet for the feature, i made a string replacer, where i wrote the specific words i need to replaced by regex using %% and then wrote the program to replace it by the regular expression defined for the text.

Categories

Resources