C# Regex Escape Sequences - c#

Is there a complete list of regex escape sequences somewhere? I found this, but it was missing \\ and \e for starters. Thus far I have come up with this regex pattern that hopefully matches all the escape sequences:
#"\\([bBdDfnreasStvwWnAZG\\]|x[A-Z0-9]{2}|u[A-Z0-9]{4}|\d{1,3}|k<\w+>)"

Alternatively, if you only want to escape a string correctly, you could just depend on Regex.Escape() which will do the necessary escaping for you.
Hint: There is also a Regex.Unescape()

This MSDN page (Regular Expression Language Elements) is a good starting place, with this subpage specifically about escape sequences.

Don't forget the zillions of possible unicode categories: \p{Lu}, \P{Sm} etc.
There are too many of these for you to match individually, but I suppose you could use something along the lines of \\[pP]\{[A-Za-z0-9 \-_]+?\} (untested).
And there's also the simpler stuff that's missing from your list: \., \+, \*, \? etc etc.
If you're simply trying to unescape an existing regex then you could try Regex.Unescape. It's not perfect, but it's probably better than anything you or I could knock up in a short space of time.

Related

regex to highlight XML values

DISCLAIMER: I know that using regex on xml is risky and generally a bad idea, but I can only feed regex into my syntax highlighting engine, and I can't spend the ressources required to create a new system just for xml-based languages.
So I'm trying to use regex to get the values inside XML tags, as such:
<LoremIpsum>I NEED THIS PART</LoremIpsum>
I thought this would be nice and easy, and I could just use (>.*<\/). It works perfectly on any online regex tester, however, as soon as I try using it in .NET, it completely messes up, and I end up getting a completely unpredictable output. What would be the correct way to do this, in one regex expression, considering I'm using .NETs System.Text.RegularExpressions?
This is probably because .NET Regex are greedy. My suggestion would be to use non greedy .*? or [^<] instead of .:
(>.*?<\/)
(>[^<]*<\/)
Like that it can't move over a <.
You never define what it completely messed up means, but try doing this:
(>.*?<\/)
The ? in .*? makes it a non-greedy match. By default, regular expressions operators greedy meaning they will match as much as possible. The non-greedy form matches as little as possible. To see the difference, match 'is test of' against both forms: With (>.*<\/) you will match: is <a>test</a> of. With (>.*?<\/) you will match is <a>test.
If you want to avoid any XML tags in the match, then you should use #ThomasWeller's solution.

Need some C# Regular Expression Help

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second
some random text and HTML</ol></b> bla </ol>
Here's the pattern I'm currently using: string pattern = #"some random text(.|\r|\n)*</ol>";
What am I doing wrong?
string pattern = #"some random text(.|\r|\n)*?</ol>";
Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.
Make your wild-card "ungreedy" by adding a ?. e.g.
some random text(.|\r|\n)*?</ol>
^- Addition
This will make regex match as few characters as possible, instead of matching as many (standard behavior).
Oh, and regex shouldn't parse [X]HTML
While not a Regex, why not simply use the Substring functions, like:
string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);
That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.
This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:
Regex pattern = new Regex(
#"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;
Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.
I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).
BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:
Regex A = new Regex(#"some random text.*?</ol>", RegexOptions.Singleline);
Use the Singleline modifier. It just makes the . match also newline characters.

Regex : replace a string

I'm currently facing a (little) blocking issue. I'd like to replace a substring by one another using regular expression. But here is the trick : I suck at regex.
Regex.Replace(contenu, "Request.ServerVariables("*"))",
"ServerVariables('test')");
Basically I'd like to replace whatever is between the " by "test". I tried ".{*}" as a pattern but it doesn't work.
Could you give me some tips, I'd appreciate it!
There are several issues you need to take care of.
You are using special characters in your regex (., parens, quotes) -- you need to escape these with a slash. And you need to escape the slashes with another slash as well because we 're in a C# string literal, unless you prefix the string with # in which case the escaping rules are different.
The expression to match "any number of whatever characters" is .*. In this case, you would want to match any number of non-quote characters, which is [^"]*.
In contrast to (1) above, the replacement string is not a regular expression so you don't want any slashes there.
You need to store the return value of the replace somewhere.
The end result is
var result = Regex.Replace(contenu,
#"Request\.ServerVariables\(""[^""]*""\)",
"Request.ServerVariables('test')");
Based purely on my knowledge of regex (and not how they are done in C#), the pattern you want is probably:
"[^"]*"
ie - match a " then match everything that's not a " then match another "
You may need to escape the double-quotes to make your regex-parser actually match on them... that's what I don't know about C#
Try to avoid where you can the '.*' in regex, you can usually find what you want to get by avoiding other characters, for example [^"]+ not quoted, or ([^)]+) not in parenthesis. So you may just want "([^"]+)" which should give you the whole thing in [0], then in [1] you'll find 'test'.
You could also just replace '"' with '' I think.
Taryn Easts regex includes the *. You should remove it, if it is just a placeholder for any value:
"[^"]"
BTW: You can test this regex with this cool editor: http://rubular.com/r/1MMtJNF3kM

Regex setting word characters and matching exact word

I need my C# regex to only match full words, and I need to make sure that +-/*() delimit words as well (I'm not sure if the last part is already set that way.) I find regexes very confusing and would like some help on the matter.
Currently, my regex is:
public Regex codeFunctions = new Regex("draw_line|draw_rectangle|draw_circle");
Thank you! :)
Try
public Regex codeFunctions = new Regex(#"\b(draw_line|draw_rectangle|draw_circle)\b");
The \b means match a word boundary, i.e. a transition from a non-word character to a word character (or vice versa).
Word characters include alphabet characters, digits, and the underscore symbol. Non-word characters include everything else, including +-/*(), so it should work fine for you.
See the Regex Class documentation for more details.
The # at the start of the string makes the string a verbatim string, otherwise you have to type two backslashes to make one backslash.
Do you want to match any words, or just the words listed above? To match an arbitrary word, substitute this for the bit that creates the Regex object:
new Regex (#"\b(\w+)\b");
In the future, if you want more characters to be treated as whitespace (for example, underscores), I would recommend String.Replace-ing them to a space character. There may be a clever way to get the same effect with regular expressions, but personally I think it would be too clever. The String.Replace version is obvious.
Also, I can't help but recommend that you read up on regular expressions. Yes, they look like line noise until you get used to them, but once you do they're convenient and there are plenty of good resources out there to help you.

Pulling data out of quotes?

I'm looking for a regex that can pull out quoted sections in a string, both single and double quotes.
IE:
"This is 'an example', \"of an input string\""
Matches:
an example
of an input string
I wrote up this:
[\"|'][A-Za-z0-9\\W]+[\"|']
It works but does anyone see any flaws with it?
EDIT: The main issue I see is that it can't handle nested quotes.
How does it handle single quotes inside of double quotes (or vice versa)?
"This is 'an example', \"of 'quotes within quotes'\""
should match
an example
of 'quotes within quotes'
Use a backreference if you need to support this.
(\"|')[A-Za-z0-9\\W]+?\1
EDIT: Fixed to use a reluctant quantifier.
Like that?
"([\"'])(.*?)\1"
Your desired match would be in sub group 2, and the kind of quote in group one.
The flaw in your regex is 1) the greedy "+" and 2) [A-Za-z0-9] is not really matching an awful lot. Many characters are not in that range.
It works but doesn't match other characters in quotes (e.g., non-alphanumeric, like binary or foreign language chars). How about this:
[\"']([^\"']*)[\"']
My C# regex is a little rusty so go easy on me if that's not exactly right :)
#"(\"|')(.*?)\1"
You might already have one of these, but, in case not, here's a free, open source tool I use all the time to test my regular expressions. I typically have the general idea of what the expression should look like, but need to fiddle around with some of the particulars.
http://renschler.net/RegexBuilder/

Categories

Resources