Decoding Special Characters - c#

I was analysing a piece of code(written by someone else) in AngularJS and came across the below block with some string operations with special characters. What do we mean by the following expressions? It would be great if someone can please throw some light on these:
str = str.replace(/&/g, "&");
str = str.replace(/</g, "<");
str = str.replace(/>/g, ">");
str = str.replace(/"/g, """);
str = str.replace(/'/g, "&apos;");
where str is a string object
Thanks in advance

This is about escaping special characters for HTML.
And the way it writes regular expression is more likely JavaScript than C#.

It's doing XML string escaping by hand instead of calling one of the many provided functions that do it for you and do it correctly and much, much more efficiently:
SecurityElement.Escape (best by far, no dependencies)
HttpUtility.HtmlEncode (worse, lots of dependencies)
And of course, all the xml writers like XDocument or XmlTextWriter

Related

How can I process a variable string with escape characters in it in C#?

In my C# code I need, sometimes, to process strings (Mostly XML strings) of the following format:
Example:
string str = "<ObjectConnectionSettings><ContextFields><ContextField Name="Server" Type="Text" Value="{$Profile.Server}" />";
The first double quote in context breaks the string pattern what leads to compile errors.
What is the best workaround for this issue?
Thank you.
The solution is to Escape the double quote you do not want the compiler to pick up:
string str = "<ObjectConnectionSettings><ContextFields><ContextField Name=\"Server\" Type=\"Text\" Value=\"{$Profile.Server}\" />";
That way the compile will know not the accidentally do stuff with them. But there will still be double-quotes at runtime, when this string is evaluated by somebody elses Database Access code.

Most efficient way to replace HTML special entities to custom string

I want to replace ALL HTML special entities like > < to custom string.
Lets say i have following string:
string str = "<div>>hello<</div>";
and method:
Method(string str, string replaceStr)
After calling Method(str, ":)") result should be
<div>:)hello:)</div>
The problem is there are too many of special characters and I'm wondering what is the be most efficient way to accomplish this?
EDIT:
String.Replace will not do my work and using Regex for parsing HTML is not really good approach.
By dislikes on this quetion there propably isn't any clean solution therefore I decided go for following algorithm:
create txt file with valid HTML special characters (like
ΒΆ)
parse file into array of string
Thanks to HtmLAgilityPack parse HTML and get raw text and replace all entities.
I know that this is not really effective for big html string but it should do the work for now.
You can try:
string str = "<div>>hello<</div>";
string output = Regex.Replace(str, ">|<", ":)");
You can also use HtmlDecode
string str = "<div>>hello<</div>";
string output = WebUtility.HtmlDecode(str);

Strip out content between and including h2 tag

I am trying to strip the content from between the h2 tags in a string using a Regex in C#:
<h2>content needs removing</h2> other content...
I have the following Regex, which according to the Regex buddy software I used to test it, should work, but it doesn't:
myString = Regex.Replace(myString, #"<h[0-9]>.*</h[0-9]>", String.Empty);
I have another Regex that is run after this to remove all other HTML tags, it is called in the same way and works fine. Can anyone help me out with why this isn't working?
Don't use Regular Expressions.
HTML is not a Regular Language, thus it can't be parsed correctly with a Regular Expression.
For example, your Regex would match:
<h2>sample</h1>
which is not valid. When dealing with nested structures, this would lead to unexpected results (.* is greedy and matches everything until the last closing h[0-9] tag in your input HTML string)
You can use XMLDocument (HTML is not XML but that would be sufficient for what you're trying to do) or you can use Html Agility Pack.
try this code :
String sourcestring = "<h2>content needs removing</h2> other content...";
String matchpattern = #"\s?<h[0-9]>[^<]+</h[0-9]>\s?";
String replacementpattern = #"";
MessageBox.Show(Regex.Replace(sourcestring,matchpattern,replacementpattern));
[^<]+ is more safer than .+ because it stops collecting where it sees a <.
This works fine for me:
string myString = "<h2>content needs removing</h2> other content...";
Console.WriteLine(myString);
myString = Regex.Replace(myString, "<h[0-9]>.*</h[0-9]>", string.Empty);
Console.WriteLine(myString);
Displays:
<h2>content needs removing</h2> other content...
other content...
As expected.
If you problem is that your real case has several different heading tags, then you have an issue with the greedy * quantifier. It will create the longest match that it can. For example, if you have:
<h2>content needs removing</h2> other content...<h3>some more headings</h3> and some other stuff
You will match everything from <h2> to </h3> and replace it. To fix this, you need to use a lazy quantifier:
myString = Regex.Replace(myString, "<h[0-9]>.*?</h[0-9]>", string.Empty);
Will leave you with:
other content... and some other stuff
Note however, that this will not fix nested <h> tags. As #fardjad said, using Regex for HTML isn't generally a good idea.

Finding text between tags and replacing it along with the tags

I am using The following regex pattern to find text between [code] and [/code] tags:
(?<=[code]).*?(?=[/code])
It returns me anything which is enclosed between these 2 tags, e.g. this: [code]return Hi There;[/code] gives me return Hi There;.
I need help with regex to replace entire text along with the tags.
Use this:
var s = "My temp folder is: [code]Path.GetTempPath()[/code]";
var result = Regex.Replace(s, #"\[code](.*?)\[/code]",
m =>
{
var codeString = m.Groups[1].Value;
// then evaluate this string
return EvaluateMyCode(codeString)
});
I would use a HTML Parser for this. I can see that what you are trying to do is simple, however these things have a habit to get much more complicated overtime. The end result is much pain for the poor sole who has to maintain the code in the future.
Take a look at this question about HTML Parsers
What is the best way to parse html in C#?
[Edit]
Here is a much more relevant answer to the question asked.
#Milad Naseri regex is correct you just need to do something like
string matchCodeTag = #"\[code\](.*?)\[/code\]";
string textToReplace = "[code]The Ape Men are comming[/code]";
string replaceWith = "Keep Calm";
string output = Regex.Replace(textToReplace, matchCodeTag, replaceWith);
Check out this web sites for more examples
http://www.dotnetperls.com/regex-replace
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Hope this helps
You need to use back referencing, i.e. replace \[code\](.*?)\[/code\] with something like <code>$1</code> which will give you what's been enclosed by the [code][/code] tags enclosed in -- for this example -- <code></code> tags.

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Categories

Resources