Convert Python re.sub to C# - c#

I'm trying to do a conversion from Python to C#
sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert)
I couldn't find a C#.Net equivalent to this function to make it easy.
From the Python Manual
re.sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes such as \j are left
alone. Backreferences, such as \6, are replaced with the substring
matched by group 6 in the pattern.

You are looking for a Regex.Escape method:
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.
The sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert) code escapes the characters specified in the [.$+?{}()\[\]\\] range to match literal characters they denote.
Note that Regex.Escape also escapes spaces. If you do not want that, use your custom replace:
var input = "|^.$+?{}()[]\\-";
var escaped = Regex.Replace(input, #"[|^.$+?{}()\[\]\\-]", "\\$&");
Console.Write(escaped);
// => \|\^\.\$\+\?\{\}\(\)\[\]\\\-
I suggest adding |, - and ^, too. See IDEONE demo

Related

Make Regex Match word containing spetial characters

My Code is like this:
string currentPageSlug = "securities/EBR#03L$ZZZ";
string patern= #"securities/(\w+)[\#\$]";
string res = Regex.Match(currentPageSlug, patern).Value;
Console.WriteLine(res);
which gives me this result:
securities/EBR#
but I want to get:
securities/EBR#03L$ZZZ
whole word including all special characters (# and $ and maybe others too)
my regex pattern does not seem to work.
Your regex matches words followed by a single special character. You need to include [#$] in the repeating construct +, like this:
string patern= #"securities/((?:\w|[#$])+)";
Note that since # and $ are used inside a character class, it is not necessary to escape them with a backslash \.

Regex Escaping. Explanation and Example

I would like a simple explanation about regex's escaping structure in C#. I've read the MSDN pages but it seems that i cannot write a working Regex.Escape()
Additionally, a working example of escaping "(", ")" and "." characters would be great. For example somestring = Regex.Escape("("+"(.*?))");
Thanks
As stated in the documentation:
Escapes a minimal set of characters (,\, *, +, ?, |, {, [, (,), #, ^, $, .,
and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters
literally rather than as metacharacters.
Which basically means that, in regular expression language, you have some characters which are special. These characters include, operators such as ?, *, ., +, etc.
To have a regular expression threat for instance, the + as the character +, and not the one or more of the previous operator, we escape it like so: \+. This tells the parsing engine to treat the + as is.
What the escape method does is that it adds the extra backslash to these characters.
Thus, given this: Regex.Escape("("+"(.*?))");, the output string would be \(\(\.\*\?\)\), which would mean, match the given string: (.*?)).
There may some possibilities of regex meta characters present in a variable in which you're trying to use the value of that variable as a regex to search for a particular substring. In this case , we need to put the variable inside the Regex.Escape function in-order to make special characters present inside the variable to get automatically escaped.
Regex.Escape("("+"(.*?))")
Essentially any meta-character in the input gets a backslash in front of it. So:
\(\(\.\*\?\)\)
But of course, anything that shows the string as if it were in C# source code (like the VS debugger tool windows) will itself escape the backslashes, hence a display something like:
\\(\\(\\.\\*\\?\\)\\)
(Hence why verbastin strings are so useful with regexes.)
PS. Do not write your own Regex.Escape: you'll just miss some edge cases of the syntax (and there are lots). The Framework method is there to use, so use it.

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.
My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";
Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.
There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.
^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.
Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

convert any string to be used as match expression in regex

If you have a string with special characters that you want to match with:
System.Text.RegularExpressions.Regex.Matches(theTextToCheck, myString);
It will obviously give you wrong results, if you have special characters inside myString like "%" or "\".
The idea is to convert myString and replacing all occurences of special characters like "%" to be replaced by their corresponding characters.
Does anyone know how to solve that or does someone have a RegEx for that? :)
Update:
The following characters have a special meaning, that I should turn of with adding a leading backslash: \, &, ~, ^, %, [, ], {, }, ?, +, *,(,),|,$
are there any others I should replace?
As #Kobi links to in the comments, you need to use Regex.Escape to ensure that that regular expression string is properly escaped.
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.
If you want to escape all characters that carry a special meaning in regex, you could simply escape every character with a backslash (There is no harm in escaping characters that don't need to be escaped).
But if you do, why are you using Regex at all instead of string.IndexOf?
Regex.Escape will do that for you. Somewhere in msdn doc it reads:
Escape converts a string so that the regular expression engine will interpret any metacharacters that it may contain as character literals
which is much more informative that the function description.
This is left for search/replace reference.
Use this as your regex:
(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)
gets your chars of interes in a numbered group
And this as your replacement string:
\$1
replaces the matches with backslash plus the group content
Sample code:
Regex re = new Regex(#"(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)");
string replaced = re.Replace(#"Look for (special {characters} and scape [100%] of them)", #"\$1");

How do I specify a wildcard (for ANY character) in a c# regex statement?

Trying to use a wildcard in C# to grab information from a webpage source, but I cannot seem to figure out what to use as the wildcard character. Nothing I've tried works!
The wildcard only needs to allow for numbers, but as the page is generated the same every time, I may as well allow for any characters.
Regex statement in use:
Regex guestbookWidgetIDregex = new Regex("GuestbookWidget(' INSERT WILDCARD HERE ', '(.*?)', 500);", RegexOptions.IgnoreCase);
If anyone can figure out what I'm doing wrong, it would be greatly appreciated!
The wildcard character is ..
To match any number of arbitrary characters, use .* (which means zero or more .) or .+ (which means one or more .)
Note that you need to escape your parentheses as \\( and \\). (or \( and \) in an #"" string)
On the dot
In regular expression, the dot . matches almost any character. The only characters it doesn't normally match are the newline characters. For the dot to match all characters, you must enable what is called the single line mode (aka "dot all").
In C#, this is specified using RegexOptions.Singleline. You can also embed this as (?s) in the pattern.
References
regular-expressions.info/The Dot Matches (Almost) Any Character
On metacharacters and escaping
The . isn't the only regex metacharacters. They are:
( ) { } [ ] ? * + - ^ $ . | \
Depending on where they appear, if you want these characters to mean literally (e.g. . as a period), you may need to do what is called "escaping". This is done by preceding the character with a \.
Of course, a \ is also an escape character for C# string literals. To get a literal \, you need to double it in your string literal (i.e. "\\" is a string of length one). Alternatively, C# also has what is called #-quoted string literals, where escape sequences are not processed. Thus, the following two strings are equal:
"c:\\Docs\\Source\\a.txt"
#"c:\Docs\Source\a.txt"
Since \ is used a lot in regular expression, #-quoting is often used to avoid excessive doubling.
References
regular-expressions.info/Metacharacters
MSDN - C# Programmer's Reference - string
On character classes
Regular expression engines allow you to define character classes, e.g. [aeiou] is a character class containing the 5 vowel letters. You can also use - metacharacter to define a range, e.g. [0-9] is a character classes containing all 10 digit characters.
Since digit characters are so frequently used, regex also provides a shorthand notation for it, which is \d. In C#, this will also match decimal digits from other Unicode character sets, unless you're using RegexOptions.ECMAScript where it's strictly just [0-9].
References
regular-expressions.info/Character Classes
MSDN - Character Classes - Decimal Digit Character
Related questions
.NET regex: What is the word character \w
Putting it all together
It looks like the following will work for you:
#-quoting digits_ _____anything but ', captured
| / \ / \
new Regex(#"GuestbookWidget\('\d*', '([^']*)', 500\);", RegexOptions.IgnoreCase);
\/ \/
escape ( escape )
Note that I've modified the pattern slightly so that it uses negated character class instead of reluctance wildcard matching. This causes a slight difference in behavior if you allow ' to be escaped in your input string, but neither pattern handle this case perfectly. If you're not allowing ' to be escaped, however, this pattern is definitely better.
References
regular-expressions.info/An Alternative to Laziness and Capturing Groups

Categories

Resources