Regex to identify escaped characters issue - c#

Let's suppose we have the following string:
#"Hello m\u00e9 name is Mat\u00bfQu"
I am using the regex:
private static readonly Regex ESCAPING_REGEX = new Regex("\\+[^\"][a-zA-Z0-9]*", RegexOptions.Compiled);
However, this regex doesn't seem to return any matches:
MatchCollection matches = ESCAPING_REGEX.Matches(text);
// matches.Count == 0
I tried the regex on Regex101 and it does return the two matches that I was looking for.
How can I fix my regular expression to achieve expected behavior? (Any tips for improvement are gladly accepted.)

Your regex declaration is faulty because you require a literal + to be in the beginning of the match. Look what your regex looks like for a regex engine:
\+ - Matches a literal +
[^"] - Matches any character other than "
[a-zA-Z0-9]* - Matches 0 or more characters that are digits or Latin letters.
If you use a verbatim string literal to create your regex, e.g.
Regex.Matches(str, #"\\+[^""][a-zA-Z0-9]*");
you'd get 2 matches. \\ in a verbatim string literal will match a literal \, and + will be treated as a quantifier.
Actually, you do not even need the + (since it will match \\\\) and [^""] (unless there can be some "s right after \ and that is not what you want to match), you can use
#"\\[a-zA-Z0-9]+"
to match your substrings (\\ matches a \, [a-zA-Z0-9]+ will match 1 or more characters from the range).

Related

Escape special characters in xml

Using Regex, how can I escape special characters in xml attribute values?
Given the following xml as string:
"<node attr=\"<Sample>\"></node>"
I want to get:
"<node attr=\"<Sample>\"></node>"
System.Security.SecurityElement.Escape function won't work as it tries to escape every special characters (including tag opening/closing angle brackets).
string text = "<node attr=\"<Sample>\"></node>";
string pattern = #"(?<=\b\w+\s*=\s*"")<\w+>(?="")";
string result = Regex.Replace(text, pattern, m => SecurityElement.Escape(m.Value));
Console.WriteLine(text);
Console.WriteLine(result);
Where:
?<= - positive lookbehind
\b - start the match at a word boundary
\w+ - match one or more word characters
\s* - match zero or more white-space characters
?= - positive lookahead

Regex expression to match whole word with special characters not working ? [duplicate]

This question already has an answer here:
Regex expression to match whole word ?
(1 answer)
Closed 4 years ago.
I was going through this question
C#, Regex.Match whole words
It says for match whole word use "\bpattern\b"
This works fine for match whole word without any special characters since it is meant for word characters only!
I need an expression to match words with special characters also. My code is as follows
class Program
{
static void Main(string[] args)
{
string str = Regex.Escape("Hi temp% dkfsfdf hi");
string pattern = Regex.Escape("temp%");
var matches = Regex.Matches(str, "\\b" + pattern + "\\b" , RegexOptions.IgnoreCase);
int count = matches.Count;
}
}
But it fails because of %. Do we have any workaround for this?
There can be other special characters like 'space','(',')', etc
If you have non-word characters then you cannot use \b. You can use the following
#"(?<=^|\s)" + pattern + #"(?=\s|$)"
Edit: As Tim mentioned in comments, your regex is failing precisely because \b fails to match the boundary between % and the white-space next to it because both of them are non-word characters. \b matches only the boundary between word character and a non-word character.
See more on word boundaries here.
Explanation
#"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
# Match either the regular expression below (attempting the next alternative only if this one fails)
^ # Assert position at the beginning of the string
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
)
temp% # Match the characters “temp%” literally
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
If the pattern can contain characters that are special to Regex, run it through Regex.Escape first.
This you did, but do not escape the string that you search through - you don't need that.
output = Regex.Replace(output, "(?<!\w)-\w+", "")
output = Regex.Replace(output, " -"".*?""", "")

Regex for matching C# string literals

I am trying to write a regular expression that will match a string that contains name-value pairs of the form:
<name> = <value>, <name> = <value>, ...
Where <value> is a C# string literal. I already know the s that I need to find via this regular expression. So far I have the following:
regex = new Regex(fieldName + #"\s*=\s*""(.*?)""");
This works well, but it of course fails to match in the case where the string I am trying to match contans a <value> with an escaped quote. I am struggling to work out how to solve this, I think I need a lookahead, but need a few pointers. As an example, I would like to be able to match the value of the 'difficult' named value below:
difficult = "\\\a\b\'\"\0\f \t\v", easy = "one"
I would appreciate a decent explanation with your answers, I want to learn, rather than copy ;-)
Try this to capture the key and value:
(\w+)\s*=\s*(#"(?:[^"]|"")*"|"(?:\\.|[^\\"])*")
As a bonus, it also works on verbatim strings.
C# Examples:https://dotnetfiddle.net/vQP4rn
Here's an annotated version:
string pattern = #"
(\w+)\s*=\s* # key =
( # Capturing group for the string
#"" # verbatim string - match literal at-sign and a quote
(?:
[^""]|"""" # match a non-quote character, or two quotes
)* # zero times or more
"" #literal quote
| #OR - regular string
"" # string literal - opening quote
(?:
\\. # match an escaped character,
|[^\\""] # or a character that isn't a quote or a backslash
)* # a few times
"" # string literal - closing quote
)";
MatchCollection matches = Regex.Matches(s, pattern,
RegexOptions.IgnorePatternWhitespace);
Note that the regular string allows all characters to be escaped, unlike in C#, and allows newlines. It should be easy to correct if you need validation, but it should be file for parsing.
This should match only the string literal part (you can tack on whatever else you want to the beginning/end):
Regex regex = new Regex("\"((\\.)|[^\\\\\"])*\"");
and if you want a pattern which doesn't allow "multi-line" string literals (as C# string literals really are):
Regex regex = new Regex("\"((\\[^\n\r])|[^\\\\\"\n\r])*\"");
You can use this:
#" \s* = \s* (?<!\\)"" (.* ) (?<!\\)"""
It's almost like yours, but instead of using "", I used (?<!\\)"" to match when suffix \ is not present, so it won't match escaped quotes.

.NET RegEx for letters and spaces

I am trying to create a regular expression in C# that allows only alphanumeric characters and spaces. Currently, I am trying the following:
string pattern = #"^\w+$";
Regex regex = new Regex(pattern);
if (regex.IsMatch(value) == false)
{
// Display error
}
What am I doing wrong?
If you just need English, try this regex:
"^[A-Za-z ]+$"
The brackets specify a set of characters
A-Z: All capital letters
a-z: All lowercase letters
' ': Spaces
If you need unicode / internationalization, you can try this regex:
#"$[\\p{L}\\s]+$"
See https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#word-character-w
This regex will match all unicode letters and spaces, which may be more than you need, so if you just need English / basic Roman letters, the first regex will be simpler and faster to execute.
Note that for both regex I have included the ^ and $ operator which mean match at start and end. If you need to pull this out of a string and it doesn't need to be the entire string, you can remove those two operators.
try this for all letter with space :
#"[\p{L} ]+$"
The character class \w does not match spaces. Try replacing it with [\w ] (there's a space after the \w to match word characters and spaces. You could also replace the space with \s if you want to match any whitespace.
If, other then 0-9, a-z and A-Z, you also need to cover any accented letters like ï, é, æ, Ć or Ş then you should better use the Unicode properties \p{...} for matching, i.e. (note the space):
string pattern = #"^[\p{IsLetter}\p{IsDigit} ]+$";
This regex works great for me.
Regex rgx = new Regex("[^a-zA-Z0-9_ ]+");
if (rgx.IsMatch(yourstring))
{
var err = "Special charactes are not allowed in Tags";
}

C# Regular Expression

Can you explain me what is the meaning of this regular expression. What would be the string which matches to this expression.
Regex(#"/Type\s*/Page[^s]");
what is # symbol?? Thanks in advance.
Please provide full explaination. What
would be the string which matches to
this expression.
The # symbol designates a verbatim string literal:
A verbatim string literal consists of
an # character followed by a
double-quote character, zero or more
characters, and a closing double-quote
character. A simple example is
#"hello". In a verbatim string
literal, the characters between the
delimiters are interpreted verbatim,
the only exception being a
quote-escape-sequence. In particular,
simple escape sequences and
hexadecimal and Unicode escape
sequences are not processed in
verbatim string literals. A verbatim
string literal may span multiple
lines.
As for the regular expression it breaks down like this:
/Type match this string exactly
\s* match any whitespace character zero or more times
/Page match this string exactly
[^s] match any character that isn't "s"
# says that the string literal is verbatim.
The regex matches:
/Type followed by zero or more whitespaces, followed by /Page and a character that is not s
It will match strings like /Type/Pagex, /Type /Page3, /Type /Page?
# starts a c# verbatim string, in which the compiler doesn't process escape sequences, making writing expressions with lots of \ characters easier.
both of the following match
/Type /Page4
/Type /Pagex
Your regular expression matches any string containing the following:
A "/" character
The word "Type" (case sensitive)
Optionally, some whitespace
Another "/"
The word "Page" (case sensitive)
Any character that isn't an "s"
Examples would be "/Type /Paged" or "/Type/Pager".
If you want to match either "Page" or "Pages" at the end, you probably want this instead:
Regex(#"/Type\s*/Pages?");
Here is a good online C# regex tester.
Roughly, it matches: /Type{optional space}/Page{not an 's'}

Categories

Resources