Escape special characters in xml - c#

Using Regex, how can I escape special characters in xml attribute values?
Given the following xml as string:
"<node attr=\"<Sample>\"></node>"
I want to get:
"<node attr=\"<Sample>\"></node>"
System.Security.SecurityElement.Escape function won't work as it tries to escape every special characters (including tag opening/closing angle brackets).

string text = "<node attr=\"<Sample>\"></node>";
string pattern = #"(?<=\b\w+\s*=\s*"")<\w+>(?="")";
string result = Regex.Replace(text, pattern, m => SecurityElement.Escape(m.Value));
Console.WriteLine(text);
Console.WriteLine(result);
Where:
?<= - positive lookbehind
\b - start the match at a word boundary
\w+ - match one or more word characters
\s* - match zero or more white-space characters
?= - positive lookahead

Related

Matching a string with out matching a specific word

I need a regular expression to match a string but exclude a specific words from the string.
for example
dfm HSBC12323
i need to extract
HSBC12323
and do not include dfm. if the string HSBC12323 it need to match it as it is as dfm may not be exist.
if the string dfm123213 i need to match 123213
adx 212321 i need to match 212321 not adx
adx hsbc123uy i need to match hsbc123uy
hsbc1237 i need to match it as is.
(?<!dfm\s*?|adx\s*?|\w)\d+
but it doesn't work like i want
Actual string : dfm HSBC12323 excpected HSBC12323
Actual string : HSBC12323 expected HSBC12323
Actual string : dfm123213 expected 123213
Actual string : adx 212321 expected 212321
Actual string : usa1237 expected usa1237
Your pattern (?<!dfm\s*?|adx\s*?|\w)\d+ matches 1+ digits if what is on the left is not either dfm or adx or a word character where after the first 2 there can be whitespace chars. You don't have to make the s*? non greedy as it can not pass matching the following digits \d+
In all your examples that would not match because before all the examples \w can match before a digit when the first 2 can not match. This for example $22 would match.
One option to match your values could be using a alternation in combination with a positive lookbehind and a negative lookahead.
(?<=\b(?:dfm|adx) *)\w+|\b(?!(?:dfm|adx))\w+
Explanation
(?<= Positive lookbehind, assert what is on the left
\b(?:dfm|adx) * Word boundary, match either dfm or adx followed by 0+ times a space
) Close positive lookbehind
\w+ Match 1+ word chars
| Or
\b Word boundary
(?! Negative lookahead, assert what is directly on the right is not
(?:dfm|adx) Match either dfm or adx
) Close negative lookahead
\w+ Match 1+ word characters
See a .NET regex demo
You might also add (?!\S) after matching \w+ if the match should not be followed by a non whitespace char.
My guess is that with this expression or one similar to, we can just step by step capture what we desire to, and then we can even strengthen our expression with additional constraints, just to be safe:
(?=dfm\s+|adx\s+)(?:dfm\s+([A-Z0-9]+)|adx\s+([0-9]+))|(?=dfm)dfm([0-9]+)|[A-Za-z0-9]+
Demo
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(?=dfm\s+|adx\s+)(?:dfm\s+([A-Z0-9]+)|adx\s+([0-9]+))|(?=dfm)dfm([0-9]+)|[A-Za-z0-9]+";
string input = #"dfm HSBC12323
HSBC12323
dfm123213
adx 212321
usa1237";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Remove Dashes but Not Hyphens

I want to remove dashes before, after, and between spaced words, but not hyphenated words.
This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.
should become:
This is a test-sentence. Test One-Two--Three---Four.
Remove multiple dashes ---.
Keep multiple hyphens Three---Four.
I was trying to do it with this:
http://rextester.com/SXQ57185
string sentence = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
string regex = #"(?<!\w)\-(?!\-)|(?<!\-)\-(?!\w)";
sentence = Regex.Replace(sentence, regex, "");
Console.WriteLine(sentence);
But the output is:
This is a test-sentence. Test - One-TwoThree-Four--.
What I would recommend doing is a combination of both a positive lookback and a positive lookahead against the characters that you don't want the dashes to be next to. In your case, that would be spaces and full stops. If either the lookbehind or lookahead match, you want to remove that dash.
This would be: ((?<=[\s\.])\-+)|(\-+(?=[\s\.])).
Breaking this down:
((?<=[\s\.])\-+) - match hyphens that follow either a space or a full stop
| - or
(\-+(?=[\s\.]) - match hyphens that are followed by either a space or a full stop
Here's a JavaScript example showcasing that:
const string = 'This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.';
const regex = /((?<=[\s\.])\-+)|(\-+(?=[\s\.]))/g;
console.log(string.replace(regex, ''));
And this can also been seen on Regex101.
Note that you'll probably also want to trim the excess spaces after using this, which can simply be done with .Trim() in C#.
You can use \b|\s for this task.
/(\b|\s)(-{3})(\b|\s)/g
DEMO
Breakdown shamelessly copied from regex101.com:
/(\b|\s)(-{3})(\b|\s)/g
1st Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Capturing Group (-{3})
-{3} matches the character - literally (case sensitive)
{3} Quantifier — Matches exactly 3 times
3rd Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
You may just match all hyphens in between word chars, and remove all others with a simple
Regex.Replace(s, #"\b(-+)\b|-", "$1")
See the regex demo
Details
\b(-+)\b - word boundary, followed with 1+ hyphens, and then again a word boundary (that is, hyphen(s) in between letters, digits and underscores)
| - or
- - a hyphen in other contexts (it will be removed).
See the C# demo:
var s = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
var result = Regex.Replace(s, #"\b(-+)\b|-", "$1");
Console.WriteLine(result);
// => This is a test-sentence. Test One-Two--Three---Four.

Regex replace special character

I need help in my regex.
I need to remove the special character found in the start of text
for example I have a text like this
.just a $#text this should not be incl#uded
The output should be like this
just a text this should not be incl#uded
I've been testing my regex here but i can't make it work
([\!-\/\;-\#]+)[\w\d]+
How do I limit the regex to check only the text that starts in special characters?
Thank you
Use \B[!-/;-#]+\s*\b:
var result = Regex.Replace(s, #"\B[!-/;-#]+\s*\b", "");
See the regex demo
Details
\B - the position other than a word boundary (there must be start of string or a non-word char immediately to the left of the current position)
[!-/;-#]+ - 1 or more ASCII punctuation
\s* - 0+ whitespace chars
\b - a word boundary, there must be a letter/digit/underscore immediately to the right of the current location.
If you plan to remove all punctuation and symbols, use
var result = Regex.Replace(s, #"\B[\p{P}\p{S}]+\s*\b", "");
See another regex demo.
Note that \p{P} matches any punctuation symbols and \p{S} matches any symbols.
Use lookahead:
(^[.$#]+|(?<= )[.$#]+)
The ^[.$#]+ is used to match the special characters at the start of a line.
The (?<= )[.$#]+) is used to matching the special characters at the start of a word which is in the sentence.
Add your special characters in the character group [] as you need.
Following are two possible options from your question details. Hope it will help you.
string input = ".just a $#text this should not be incl#uded";
//REMOVING ALL THE SPECIAL CHARACTERS FROM THE WHOLE STRING
string output1 = Regex.Replace(input, #"[^0-9a-zA-Z\ ]+", "");
// REMOVE LEADING SPECIAL CHARACTERS FROM EACH WORD IN THE STRING. WILL KEEP OTHER SPECIAL CHARACTERS
var split = input.Split();
string output2 = string.Join(" ", split.Select(s=> Regex.Replace(s, #"^[^0-9a-zA-Z]+", "")).ToArray());
Negative lookahead is fine here :
(?![\.\$#].*)[\S]+
https://regex101.com/r/i0aacp/11/
[\S] match any character
(?![\.\$#].*) negative lookahead means those characters [\S]+ should not start with any of \.\$#

Regex to identify escaped characters issue

Let's suppose we have the following string:
#"Hello m\u00e9 name is Mat\u00bfQu"
I am using the regex:
private static readonly Regex ESCAPING_REGEX = new Regex("\\+[^\"][a-zA-Z0-9]*", RegexOptions.Compiled);
However, this regex doesn't seem to return any matches:
MatchCollection matches = ESCAPING_REGEX.Matches(text);
// matches.Count == 0
I tried the regex on Regex101 and it does return the two matches that I was looking for.
How can I fix my regular expression to achieve expected behavior? (Any tips for improvement are gladly accepted.)
Your regex declaration is faulty because you require a literal + to be in the beginning of the match. Look what your regex looks like for a regex engine:
\+ - Matches a literal +
[^"] - Matches any character other than "
[a-zA-Z0-9]* - Matches 0 or more characters that are digits or Latin letters.
If you use a verbatim string literal to create your regex, e.g.
Regex.Matches(str, #"\\+[^""][a-zA-Z0-9]*");
you'd get 2 matches. \\ in a verbatim string literal will match a literal \, and + will be treated as a quantifier.
Actually, you do not even need the + (since it will match \\\\) and [^""] (unless there can be some "s right after \ and that is not what you want to match), you can use
#"\\[a-zA-Z0-9]+"
to match your substrings (\\ matches a \, [a-zA-Z0-9]+ will match 1 or more characters from the range).

Regex expression to match whole word with special characters not working ? [duplicate]

This question already has an answer here:
Regex expression to match whole word ?
(1 answer)
Closed 4 years ago.
I was going through this question
C#, Regex.Match whole words
It says for match whole word use "\bpattern\b"
This works fine for match whole word without any special characters since it is meant for word characters only!
I need an expression to match words with special characters also. My code is as follows
class Program
{
static void Main(string[] args)
{
string str = Regex.Escape("Hi temp% dkfsfdf hi");
string pattern = Regex.Escape("temp%");
var matches = Regex.Matches(str, "\\b" + pattern + "\\b" , RegexOptions.IgnoreCase);
int count = matches.Count;
}
}
But it fails because of %. Do we have any workaround for this?
There can be other special characters like 'space','(',')', etc
If you have non-word characters then you cannot use \b. You can use the following
#"(?<=^|\s)" + pattern + #"(?=\s|$)"
Edit: As Tim mentioned in comments, your regex is failing precisely because \b fails to match the boundary between % and the white-space next to it because both of them are non-word characters. \b matches only the boundary between word character and a non-word character.
See more on word boundaries here.
Explanation
#"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
# Match either the regular expression below (attempting the next alternative only if this one fails)
^ # Assert position at the beginning of the string
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
)
temp% # Match the characters “temp%” literally
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
If the pattern can contain characters that are special to Regex, run it through Regex.Escape first.
This you did, but do not escape the string that you search through - you don't need that.
output = Regex.Replace(output, "(?<!\w)-\w+", "")
output = Regex.Replace(output, " -"".*?""", "")

Categories

Resources