How to escape xml content in a raw string? - c#

I am getting a string of 'xml' that contains some content that is unescaped. Here is a trivial example:
<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />
The problem I have is when you try to convert the above as a string using XmlDocument.LoadXml(), LoadXml() throws an exception because of the lack of escaping on the inner quotes for the content held by attribute 'text'. Is there a relatively painless way to escape the content specifically? Or am I just going to have to parse it/escape it/rebuild it myself?
i'm not generating this text, i just get it from another process in a string like this:
"<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />"

You need to use the html character encoding where " is "
But since your input is a malformed xml text you have to find a way to parse that text and replace the quotes with their encoded translation. Maybe some regex parsing..
Please consider this just a creative way to make the job. I know it's dirty but it will work in most cases:
private static string XmlEncodeQuotes(string target) {
string result = string.Empty;
for (int i = 0; i < target.Length; i++)
{
if (target[i] == '"')
{
if (target[i - 1] != '=')
if (!Regex.IsMatch(target.Substring(i), #"^""\s[a-zA-Z]+="""))
{
result += """;
continue;
}
}
result += target[i];
}
return result;
}

have you tried wrapping the portion of the xml document within a CDATA tag?

Will System.Security.SecurityElement.Escape() work for you? If not, there is an XmlTextWriter as well.

If you're simply asking how to escape a quote, that's done with
"
I'm not sure what you're dealing with, but the root of your problem is the fact that the data you are receiving is malformed.
Option 1) Unless you clean up the data, you will have a hard time getting most parsers to load invalid XML data. Some are more forgiving than others. You might have some luck with the HTML Agility Pack
Option 2) Use Regular Expressions to fix your XML.
Option 3) If coding a parsing solution is not an option use XSLT. Simply create transform and then add a template to fix the issues.

Related

Using C# to convert incorrect html string to real html

My original issue is that I am trying to serialize a string containing html tags to an XML element.
hello World, this
is
a nice
test
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
However, I have 2 issues
Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.
2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags).
Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
string result = "<p>" + text
.Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
.Replace(Environment.NewLine, "<br />")
.Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";
However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.
Target would be to have a result like the following (line breaks are only for better readability and don't matter here)
<p>hello World, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?
Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.
Thank you!
EDIT: Clearly separated the 2 issues
EDIT2: No influence on deserialization
EDIT3: Added target output
What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.
You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.
Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.
I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:
1) Replace line breaks with HTML line breaks
string result = text.Replace(Environment.NewLine, "<br />");
2) Use the HTMLAgility pack to fix any invalid HTML
var doc = new HtmlDocument();
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
doc.OptionFixNestedTags = false;
doc.LoadHtml(result);
if (doc.ParseErrors.Count() > 0)
{
// throw error
}else{
// get fixed html
result= doc.DocumentNode.OuterHtml;
}

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

How to escape invalid characters inside XML string in C#

I have an XML string in C#. This XML has several tags. In some of these tags there are invalid characters like '&' in the text. I need to escape these characters inside the text from the whole long XML string but I want to keep the tags.
I have tried HttpUtility.HtmlEncode and few other available methods but they encode the whole string rather then just the text inside the tags. Example tags are
<node1>This is a string & so is this</node1> should be converted to
<node1>This is a string & so is this</node1>
Any ideas? thanks
P.S. I know similar question has been asked before I have not found a complete solution for this problem.
I guess the simplest solution is to load the whole Xml document in memory as an XmlDocument and then go through the elements and replace the values with their html encoded form.
you can use a CDATA field, like this:
<YourXml>
<Id>1</Id>
<Content>
<![CDATA[
your special caracteres
]]>
</content>
</yourXml>
I dont get what is the big deal in this. When you have the entire xml as a string, the easiest way to achieve what u want is to use the Replace function.
For example the whole xml is in the string str, then all u have to do is,
str.Replace("&" , "&");
Thats it man. You have achieved whatever u wanted to. Some times very simple solutions exist for big problems. Hope this helps for you.
XDocument or XmlDocument is a way to go. If for some crazy out of your control reason you need to encode just text blocks inside XmlElement:
using System.Text;
using System.Xml;
static string EncodeText(string unescapedText) {
if (string.IsNullOrEmpty(unescapedText)) {
return unescapedText;
}
var builder = new StringBuilder(unescapedText.Length);
using (var writer = XmlTextWriter.Create(builder, new XmlWriterSettings {
ConformanceLevel = ConformanceLevel.Fragment
})) {
writer.WriteValue(unescapedText);
}
return builder.ToString();
}

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

How Can I strip HTML from Text in .NET?

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
EDIT
See my answer.
EDIT 2
alt text http://tinyurl.com/sillychimp
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, #"\s+", " ").Trim();
}
Take a look at this Strip HTML tags from a string using regular expressions
Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method
TextReader tr = new StreamReader(#"Filepath");
string str = tr.ReadToEnd();
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);
but you need to have a namespace referenced i.e:
system.text.RegularExpressions
only take this logic for your website
If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:
public static string StripTags(string value)
{
if (value == null)
return string.Empty;
string pattern = #"&.{1,8};";
value = Regex.Replace(value, pattern, " ");
pattern = #"<(.|\n)*?>";
return Regex.Replace(value, pattern, string.Empty);
}
It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...
You could:
Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
Use TinyMCE's built-in configuration options for stripping unwanted HTML.
Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.
As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?
You can use HTQL COM, and query the source with a query:
<body> &tx;
You can use something like this
string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Categories

Resources