Parsing XML which contains illegal characters

Parsing XML which contains illegal characters - c#

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>

If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);

Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.

Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

Related

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.

In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)

In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.

I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

How to escape xml content in a raw string?

I am getting a string of 'xml' that contains some content that is unescaped. Here is a trivial example:
<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />
The problem I have is when you try to convert the above as a string using XmlDocument.LoadXml(), LoadXml() throws an exception because of the lack of escaping on the inner quotes for the content held by attribute 'text'. Is there a relatively painless way to escape the content specifically? Or am I just going to have to parse it/escape it/rebuild it myself?
i'm not generating this text, i just get it from another process in a string like this:
"<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />"

You need to use the html character encoding where " is "
But since your input is a malformed xml text you have to find a way to parse that text and replace the quotes with their encoded translation. Maybe some regex parsing..
Please consider this just a creative way to make the job. I know it's dirty but it will work in most cases:
private static string XmlEncodeQuotes(string target) {
string result = string.Empty;
for (int i = 0; i < target.Length; i++)
{
if (target[i] == '"')
{
if (target[i - 1] != '=')
if (!Regex.IsMatch(target.Substring(i), #"^""\s[a-zA-Z]+="""))
{
result += """;
continue;
}
}
result += target[i];
}
return result;
}

have you tried wrapping the portion of the xml document within a CDATA tag?

Will System.Security.SecurityElement.Escape() work for you? If not, there is an XmlTextWriter as well.

If you're simply asking how to escape a quote, that's done with
"
I'm not sure what you're dealing with, but the root of your problem is the fact that the data you are receiving is malformed.
Option 1) Unless you clean up the data, you will have a hard time getting most parsers to load invalid XML data. Some are more forgiving than others. You might have some luck with the HTML Agility Pack
Option 2) Use Regular Expressions to fix your XML.
Option 3) If coding a parsing solution is not an option use XSLT. Simply create transform and then add a template to fix the issues.

Having trouble taking out all the newline, tab, and carriage return between two tags

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"
This is a sample XML file I'm reading:
<Consequence_Note>
<Text>In some cases, integer coercion errors can lead to exploitable buffer
overflow conditions, resulting in the execution of arbitrary
code.</Text>
</Consequence_Note>
and this
<Consequence_Scope>Availability</Consequence_Scope>
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.
I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"
For example:
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
What I would like to have is:
<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>
This is my code (I'm reading from a xml file):
String file = #"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, #">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);

Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!
Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.
The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.

If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx
A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:
\s{2,}
replace with
[ ]
That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

Ascii to XML Character set conversion

Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "&#x13" is an invalid XML character.]
Thanks in advance

I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using  for 0x03 appears as  in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});

Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.

The character reference &#x13 is indeed not a valid XML character. You probably want either &#xD or &#13.

Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding

Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.

You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);

Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"

HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.

Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;

The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.

This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}

If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.

You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)

Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing XML which contains illegal characters - c#

Related

How to write '&' in xml?

How to escape xml content in a raw string?

Having trouble taking out all the newline, tab, and carriage return between two tags

Ascii to XML Character set conversion

parsing XML with ampersand

Categories

Resources