Continue un-escaping when encountering unrecognized escape sequence

Continue un-escaping when encountering unrecognized escape sequence - c#

I have a system that processes some provided data.
Before storing the data, I am unescaping the characters like so:
Regex.Unescape(text);
I ran into a bunch of ArgumentException: <str> includes an unrecognized escape sequence because some of the data contained text like:
\m/ or \o/ or even ¯\_(ツ)_/¯.
Is there any way that I can ignore the unrecognized sequences and continue to escape the rest of the input?

You cannot rely on Regex.Unescape when your string comes from unknown source. See the MSDN reference:
Unescape cannot reverse an escaped string perfectly because it cannot deduce precisely which characters were escaped.
Since
It reverses the transformation performed by the Escape method by removing the escape character ("\") from each character escaped by the method. These include the \, *, +, ?, |, {, [, (,), ^, $,., #, and white space characters. In addition, the Unescape method unescapes the closing bracket (]) and closing brace (}) characters.
and
It replaces the representation of unprintable characters with the characters themselves. For example, it replaces \a with \x07. The character representations it replaces are \a, \b, \e, \n, \r, \f, \t, and \v.
You can emulate Regex.Unescape like
var unescaped = Regex.Replace(input, #"\\([\\*+?|{}[\]()^$. #])", "$1");
See regex demo
If there is an escaped character from the \, *, +, ?, |, {, [, (,), ^, $,., #, } and ] set, the backslash will get removed.

Related

Why are slashes doubled in pathname in C# programs? [duplicate]

This question already has answers here:
Full path with double backslash (C#)
(5 answers)
How do I write a backslash (\) in a string?
(6 answers)
Closed 6 years ago.
In the C# program that I am reading, the slashes in the pathnames are doubled, for example:
"C:\\Users\\Tim\\Download"
Why are the slashes doubled in pathnames in C# programs, and is this necessary?

Using strings in C# you need to Escape characters using Escape Sequences
Escape Sequences
Character combinations consisting of a backslash () followed by a
letter or by a combination of digits are called "escape sequences." To
represent a newline character, single quotation mark, or certain other
characters in a character constant, you must use escape sequences. An
escape sequence is regarded as a single character and is therefore
valid as a character constant.
Escape sequences are typically used to
specify actions such as carriage returns and tab movements on
terminals and printers. They are also used to provide literal
representations of nonprinting characters and characters that usually
have special meanings, such as the double quotation mark ("). The
following table lists the ANSI escape sequences and what they
represent.
Note that the question mark preceded by a backslash (\?)
specifies a literal question mark in cases where the character
sequence would be misinterpreted as a trigraph. See Trigraphs for more
information.
\a Bell (alert)
\b Backspace
\f Formfeed
\n New line
\r Carriage return
\t Horizontal tab
\v Vertical tab
\' Single quotation mark
\" Double quotation mark
\\ Backslash
\? Literal question mark
\ ooo ASCII character in octal notation
\x hh ASCII character in hexadecimal notation
\x hhhh Unicode character in hexadecimal notation if this escape sequence is used in a wide-character constant or a Unicode string literal.
For example, WCHAR f = L'\x4e00' or WCHAR b[] = L"The Chinese character for one is \x4e00".

Slashes are not doubled - they are just escaped, because backslash has special meaning in C# strings. Character combination of backslash followed by some characters is called escape sequence. They are used to represent nonprintable characters, actions like carriage returns and characters which has special meaning like double quotes or backslashes.
Samples of escape sequences :
\n - new line
\t - horizontal tab
\" - double quotes
\\ - backslash
So if you want to have backslash character in your string:
"C:\Users\Tim\Download"
you should either use corresponding escape sequence:
"C:\\Users\\Tim\\Download"
Or you can use verbatim string. In verbatim string escape sequences are not processed
#"C:\Users\Tim\Download"
Further reading: Escape Sequences

Convert Python re.sub to C#

I'm trying to do a conversion from Python to C#
sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert)
I couldn't find a C#.Net equivalent to this function to make it easy.
From the Python Manual
re.sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes such as \j are left
alone. Backreferences, such as \6, are replaced with the substring
matched by group 6 in the pattern.

You are looking for a Regex.Escape method:
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.
The sconvert = re.sub(r"([.$+?{}()\[\]\\])", r"\\\1", sconvert) code escapes the characters specified in the [.$+?{}()\[\]\\] range to match literal characters they denote.
Note that Regex.Escape also escapes spaces. If you do not want that, use your custom replace:
var input = "|^.$+?{}()[]\\-";
var escaped = Regex.Replace(input, #"[|^.$+?{}()\[\]\\-]", "\\$&");
Console.Write(escaped);
// => \|\^\.\$\+\?\{\}\(\)\[\]\\\-
I suggest adding |, - and ^, too. See IDEONE demo

Regex to match backslash inside a string

I'm trying to match the following strings:
this\test_
_thistes\t
_t\histest
In other words, the allowed strings have ONLY a backslash, splitting 2 substrings which can contain numbers, letters and _ characters.
I tried the following regex, testing it on http://regexhero.net/tester/:
^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$
Unfortunately, it recognizes also the following not allowed strings:
this\\
_\
_\w\s\x
Any help please?

Don't make the \ as optional. The below regex won't allow two or more \ backslashes and asserts that there must be atleast one word character present before and after to the \ symbol.
#"^\w+\\\w+$"
OR
#"^[A-Za-z0-9_]+\\[A-Za-z0-9_]+$"
DEMO

The best way to fix up your regex is the following:
^[a-zA-Z0-9_]+\\[a-zA-Z0-9_]+$
This breaks down to:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[a-zA-Z0-9_]+ any character of: 'a' to 'z', 'A' to 'Z',
'0' to '9', '_' (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\\ '\'
--------------------------------------------------------------------------------
[a-zA-Z0-9_]+ any character of: 'a' to 'z', 'A' to 'Z',
'0' to '9', '_' (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Explanation courtesy of http://rick.measham.id.au/paste/explain.pl
As you can see we have the same pattern before and after the backslash (since you indicated they should both be letters, numbers and underscores) with the + modifier meaning at least one. Then in the middle there is just the backslash which is compulsory.
Since it is unclear whether when you said "letters" you meant the basic alphabet or if you meant anything that is letter like (most obviously accented characters but also any other alphabet, etc.) then you may want to expand your set of characters by using something like \w as Avinash Raj suggests. See http://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx#WordCharacter for more info on what the "word character" covers.

Your regex can mean two things, depending on whether you are declaring it as a raw string or as a normal string.
Using:
"^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$"
Will not match any of your test examples, since this will match, in order:
^ beginning of string,
[a-zA-Z_] 1 alpha character or underscore,
[\\\]? 1 optional backslash,
[a-zA-Z0-9_]+ at least 1 alphanumeric and/or underscore characters,
$ end of string
If you use it as a raw string (which is how regexhero interpreted it and indicated by the # sign before the string starts) is:
#"^[a-zA-Z_][\\\]?[a-zA-Z0-9_]+$"
^ beginning of string,
[a-zA-Z_] 1 alpha character or underscore,
[\\\]?[a-zA-Z0-9_]+ one or more characters being; backslash, ], ?, alphanumeric and underscore,
$ end of string.
So what you actually need is either:
"^[a-zA-Z0-9_]+\\\\[a-zA-Z0-9_]+$"
(Two pairs of backslashes become two literal backslashes, which will be interpreted by the regex engine as an escaped backslash; hence 1 literal backslash)
Or
#"^[a-zA-Z0-9_]+\\[a-zA-Z0-9_]+$"
(No backslash substitution performed, so the regex engine directly interprets the escaped backslash)
Note that I added the numbers in the first character class to allow it to match numbers like you requested and added the + quantifier to allow it to match more than one character before the backslash.

Pretty sure this should work if i understood everything you wanted.
^([a-zA-Z0-9_]+\\[a-zA-Z0-9_]+)

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.

My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";

Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.

There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.

^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.

Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

convert any string to be used as match expression in regex

If you have a string with special characters that you want to match with:
System.Text.RegularExpressions.Regex.Matches(theTextToCheck, myString);
It will obviously give you wrong results, if you have special characters inside myString like "%" or "\".
The idea is to convert myString and replacing all occurences of special characters like "%" to be replaced by their corresponding characters.
Does anyone know how to solve that or does someone have a RegEx for that? :)
Update:
The following characters have a special meaning, that I should turn of with adding a leading backslash: \, &, ~, ^, %, [, ], {, }, ?, +, *,(,),|,$
are there any others I should replace?

As #Kobi links to in the comments, you need to use Regex.Escape to ensure that that regular expression string is properly escaped.
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes. This instructs the regular expression engine to interpret these characters literally rather than as metacharacters.

If you want to escape all characters that carry a special meaning in regex, you could simply escape every character with a backslash (There is no harm in escaping characters that don't need to be escaped).
But if you do, why are you using Regex at all instead of string.IndexOf?

Regex.Escape will do that for you. Somewhere in msdn doc it reads:
Escape converts a string so that the regular expression engine will interpret any metacharacters that it may contain as character literals
which is much more informative that the function description.
This is left for search/replace reference.
Use this as your regex:
(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)
gets your chars of interes in a numbered group
And this as your replacement string:
\$1
replaces the matches with backslash plus the group content
Sample code:
Regex re = new Regex(#"(\\|\&|\~|\^|\%|\[|\]|\{|\}|\?|\+|\*|\(|\)|\||\$)");
string replaced = re.Replace(#"Look for (special {characters} and scape [100%] of them)", #"\$1");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.