convert any string to be used as match expression in regex - c#

If you have a string with special characters that you want to match with:
System.Text.RegularExpressions.Regex.Matches(theTextToCheck, myString);
It will obviously give you wrong results, if you have special characters inside myString like "%" or "\".
The idea is to convert myString and replacing all occurences of special characters like "%" to be replaced by their corresponding characters.
Does anyone know how to solve that or does someone have a RegEx for that? :)
Update:
The following characters have a special meaning, that I should turn of with adding a leading backslash: \, &, ~, ^, %, [, ], {, }, ?, +, *,(,),|,$
are there any others I should replace?

As #Kobi links to in the comments, you need to use Regex.Escape to ensure that that regular expression string is properly escaped.
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.

If you want to escape all characters that carry a special meaning in regex, you could simply escape every character with a backslash (There is no harm in escaping characters that don't need to be escaped).
But if you do, why are you using Regex at all instead of string.IndexOf?

Regex.Escape will do that for you. Somewhere in msdn doc it reads:
Escape converts a string so that the regular expression engine will interpret any metacharacters that it may contain as character literals
which is much more informative that the function description.
This is left for search/replace reference.
Use this as your regex:
(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)
gets your chars of interes in a numbered group
And this as your replacement string:
\$1
replaces the matches with backslash plus the group content
Sample code:
Regex re = new Regex(#"(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)");
string replaced = re.Replace(#"Look for (special {characters} and scape [100%] of them)", #"\$1");

Related

Convert Python re.sub to C#

I'm trying to do a conversion from Python to C#
sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert)
I couldn't find a C#.Net equivalent to this function to make it easy.
From the Python Manual
re.sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes such as \j are left
alone. Backreferences, such as \6, are replaced with the substring
matched by group 6 in the pattern.
You are looking for a Regex.Escape method:
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.
The sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert) code escapes the characters specified in the [.$+?{}()\[\]\\] range to match literal characters they denote.
Note that Regex.Escape also escapes spaces. If you do not want that, use your custom replace:
var input = "|^.$+?{}()[]\\-";
var escaped = Regex.Replace(input, #"[|^.$+?{}()\[\]\\-]", "\\$&");
Console.Write(escaped);
// => \|\^\.\$\+\?\{\}\(\)\[\]\\\-
I suggest adding |, - and ^, too. See IDEONE demo

Continue un-escaping when encountering unrecognized escape sequence

I have a system that processes some provided data.
Before storing the data, I am unescaping the characters like so:
Regex.Unescape(text);
I ran into a bunch of ArgumentException: <str> includes an unrecognized escape sequence because some of the data contained text like:
\m/ or \o/ or even ¯\_(ツ)_/¯.
Is there any way that I can ignore the unrecognized sequences and continue to escape the rest of the input?
You cannot rely on Regex.Unescape when your string comes from unknown source. See the MSDN reference:
Unescape cannot reverse an escaped string perfectly because it cannot deduce precisely which characters were escaped.
Since
It reverses the transformation performed by the Escape method by removing the escape character ("\") from each character escaped by the method. These include the \, *, +, ?, |, {, [, (,), ^, $,., #, and white space characters. In addition, the Unescape method unescapes the closing bracket (]) and closing brace (}) characters.
and
It replaces the representation of unprintable characters with the characters themselves. For example, it replaces \a with \x07. The character representations it replaces are \a, \b, \e, \n, \r, \f, \t, and \v.
You can emulate Regex.Unescape like
var unescaped = Regex.Replace(input, #"\\([\\*+?|{}[\]()^$. #])", "$1");
See regex demo
If there is an escaped character from the \, *, +, ?, |, {, [, (,), ^, $,., #, } and ] set, the backslash will get removed.

Regex Escaping. Explanation and Example

I would like a simple explanation about regex's escaping structure in C#. I've read the MSDN pages but it seems that i cannot write a working Regex.Escape()
Additionally, a working example of escaping "(", ")" and "." characters would be great. For example somestring = Regex.Escape("("+"(.*?))");
Thanks
As stated in the documentation:
Escapes a minimal set of characters (,\, *, +, ?, |, {, [, (,), #, ^, $, .,
and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters
literally rather than as metacharacters.
Which basically means that, in regular expression language, you have some characters which are special. These characters include, operators such as ?, *, ., +, etc.
To have a regular expression threat for instance, the + as the character +, and not the one or more of the previous operator, we escape it like so: \+. This tells the parsing engine to treat the + as is.
What the escape method does is that it adds the extra backslash to these characters.
Thus, given this: Regex.Escape("("+"(.*?))");, the output string would be \(\(\.\*\?\)\), which would mean, match the given string: (.*?)).
There may some possibilities of regex meta characters present in a variable in which you're trying to use the value of that variable as a regex to search for a particular substring. In this case , we need to put the variable inside the Regex.Escape function in-order to make special characters present inside the variable to get automatically escaped.
Regex.Escape("("+"(.*?))")
Essentially any meta-character in the input gets a backslash in front of it. So:
\(\(\.\*\?\)\)
But of course, anything that shows the string as if it were in C# source code (like the VS debugger tool windows) will itself escape the backslashes, hence a display something like:
\\(\\(\\.\\*\\?\\)\\)
(Hence why verbastin strings are so useful with regexes.)
PS. Do not write your own Regex.Escape: you'll just miss some edge cases of the syntax (and there are lots). The Framework method is there to use, so use it.

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.
My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";
Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.
There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.
^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.
Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

Regex escape questionmark and double quotes

I have data with several occurrencies of the following string:
<a href="default.asp?itemID=987">
in which the itemID is always different. I am using C# and I want to get all those itemIDs with a Regular Expression.
At first I tried this
"<a href=\"default.asp?itemID=([0-9]*)\">"
But the questionmark is a reserved character. I considered using the # operator to disable escaping of characters. But there are still some double quotes that really need escaping. So then I would go for
"<a href=\"default.asp\\?itemID=([0-9]*)\">"
which should be translated (as a string) to
<a href="default.asp\?itemID=([0-9]*)">
But the Regex.Match method gets no success. I tried the very same regex here and it worked. What am I doing wrong?
? and . are special chars for a regex, but can't be escaped "as is" in a string litteral.
So if you put one \, it will be wrong for a string, and if you don't put \\, it will be taken as the "special char" of the regexp. So :
"#<a href=\"default\\.asp\\?itemID=([0-9]*)\">";
When using the #operator, you can regain double quotes with "".
You also need to escape certain special chars in the regex, in this case, the chars .\?
Try this:
#"<a href=""default\.asp\?itemID=([0-9]*)"">"
Try escaping the dot '.' character with \.

Categories

Resources