I am trying to parse the below with in C# with xmldocument. but I can't load it. It says invalid characters. Even in the browser it doesn't display correctly complaining about invalid characters. I need to loop through all elements in this string.
Can someone please advise what's wrong here?
<div><b>Q1.
What is your name?:</b> BTB (Build the bank)</div>
<div><b>Q2.
How old are you?:</b> 29</div>
code is this:
XmlDocument xml = new XmlDocument();
xml.Load(item.Summary);
error is: "Illegal characters in path."
XmlDocument.Load expects a file name to load the xml from. Try LoadXml.
"BTB (Build the bank)" needs to be wrapped in its own tag if this shall be a valid xml. It is valid html though.
Also, xml must have a single top node.
Related
Question Background:
I have an XML response from a web service (that I am unable to control the content of) that I would like to validate. For example, often the response will have a URL in it that has query string parameters using a "&".
Code:
The following code gives an example of escaping an XML string with illegal characters. This will indeed produce an escaped string:
string xml = "<node>it's my \"node\" & i like it<node>";
string encodedXml = System.Security.SecurityElement.Escape(xml);
// RESULT: <node>it's my "node" & i like it<node>
If I know attempt to load this escaped XML into a new Xml Document, I will receive an error that the first character of the XML is not valid:
var doc = new XmlDocument();
// Error will occur here.
doc.LoadXml(encodedXml);
Error output:
Data at the root level is invalid. Line 1, position 1.
How do I load this escaped XML into an XML Document object?
This is not a valid XML document:
<node>it's my "node" & i like it<node>
When you escape the angle brackets on the tags, they are no longer treated as tags by the XML parser. It's all just text in an element -- but there's no element containing it. In XML, there must be a root element. That's a requirement. It may be an arbitrary requirement, and that may be unjust, but you'll never win an argument with a parser.
What you're doing is like giving this to a C# compiler:
string s = \"foo\" bar\";
The outer quotes shouldn't be escaped.
This is what you want:
string xml = "<node>it's my "node" & i like it</node>";
Note also that your original XML was broken already:
string xml = "<node>it's my \"node\" & i like it<node>";
Your "closing" tag isn't a closing tag. It should be </node>, not <node>.
If you are receiving a response from another web application / API / service, it is likely that the contents are Html encoded.
Take a look at the WebUtility class, particularly, HtmlDecode and UrlDecode. This is likely to convert your "string" data to proper Xml.
If you're receiving valid XML back from the service you can convert the response using something like this:
//...
WebResponse response = request.GetResponse();
XDocument doc = XDocument.Parse
((
new System.IO.StreamReader
(
response.GetResponseStream()
)
).ReadToEnd());
If you're receiving invalid XML from a service which should return valid XML, contact whoever owns/provides that service / raise a support ticket with them in the appropriate way.
Any other action is a hack. Sometimes that may be required (e.g. when you're dealing with a legacy system that's no longer supported with bugs that have never been corrected), but pursue the non-hacky routes first.
I am using Windows.Data.Xml.Dom.XmlDocument to parse an xml string.
The code is simple
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);
The problem is that it throws an exception when it encounters some specific characters. An example is below. (Yes the XML I am parsing is actually html but it has to be parsed as XML)
This string throws the exception
<div>So schnell. So vielfältig. Soo lecker!</div>
These do not
<div>So schnell. So vielfltig. Soo lecker!</div>
<div>So schnell. So vielf<ltig. Soo lecker!</div>
These are the message and type of the exception.
Exception from HRESULT: 0xC00CE002 System.Exception
I don't know why only specific characters trigger the exception. Can anybody help?
Xml does not support all html characters and the character you mentioned is an html character. Supported character list for xml and html :
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML
Xml only supports quot amp apos lt gt
You will have to use hex value of the other special characters in order for them to be loaded as xml.
I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.
I am trying to load an xml file into an xmlDocument object..the problem is that it says "invalid character in the giving encoding"..which is correct, because the defition of the xml file I have as an input says only <?xml version="1.0" ?>, and no encoding is specified..
The question mark is an actual question mark. I made a little utility to search through the document and find the character that is giving the trouble, when I find it and display it on a label, it is a question mark surrounded by a black box..
What I am asking is, I still need to load this file and analyze it, any help on how to do it?
Any configuration of my xmlDocument object that I must specify ?
Thanks!
Doing this worked :
StreamReader sr = new StreamReader(TxtPath.Text, true);
XDocument document = XDocument.Load(sr);
This loads the document as unicode and fixes the problem with strange characters in a non formatted xml file
In C#, is there a way to work out an XmlNode's position in the original XML 'text', when the document is loaded from a file or string? I want to be able to report problems with an XML document that I am processing.
e.g:
"Error in foo.xml - value of attribute 'pet' must be a species of fluffy mammal, at line 27, column 13 [snippet of original XML text here...]"
Edit:
The checks can't be done using schema validation. Here is another, less frivolous sample error message to illustrate: "specified addin type 'Addins.LogWindow' must be public"
Well you're not supposed to write your own XmlParser but in the Compact Framework we have no choice as XmlDocument is as slow as the Dalai Lama on ketamine so we use an XmlReader when parsing an Xml file.
We throw an exception whenever we find something messed up or inconsistent and we pass the XmlReader to the exception. We then can extract the line position by casting the XmlReader into a IXmlLineInfo object which contains properties for the line and position.
Don't know if this will help. Generally I wouldn't be writing my own XmlParser on desktop which is why im reticent to suggest this as a solution.
Would a XML Schema work for you?
http://support.microsoft.com/kb/318504
Sorry, there are very few DOM implementations that will remember the original parsed location of a Node for you. Most only report any position information on a parsing error. For example in DOM Level 3 LS you only get a reference to a DOMLocator when there is a DOMError.
The only imp I know of that keeps track after parsing is pxdom, and that's for Python so not of much use to you.