C#: shield XmlTextReader from an occasional Unicode character

C#: shield XmlTextReader from an occasional Unicode character - c#

In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which trips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);

It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader from invalid XML. You need to either
Fix the server side to properly encode characters
Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.

What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.

Related

XmlException: Invalid character in the given encoding

I have an UTF-8 encoded xml
<?xml version="1.0" encoding="UTF-8"?>
When using below version of xml reader. I am assuming this uses UTF-8 enoding to parse xml file.
using (XmlReader reader = XmlReader.Create(inputUri))
I am getting below exception.
System.Xml.XmlException occurred
HResult=-2146232000
LineNumber=18750
LinePosition=13
Message=Invalid character in the given encoding. Line 18750, position 13.
But when using below version of xmlreader
using (XmlReader reader = XmlReader.Create(new StreamReader(inputUri,Encoding.UTF8)))
The xml gets parsed successfully. Why such differences between these two versions given both uses same encoding to parse the given xml file??
PS: I am pretty much sure the first version uses UTF-8 endoding.
Below is the snippet from XmlTextReaderImpl.cs whose instance is returned by the first version.
private void SetupEncoding( Encoding encoding ) {
if ( encoding == null ) {
Debug.Assert( ps.charPos == 0 );
ps.encoding = Encoding.UTF8;
ps.decoder = new SafeAsciiDecoder(); // This falls back to UTF-8 decoder
}
}

I got the answer in msdn forum.
"XmlReader will mark any illegal character as illegal because the XML format is broken.
On the second case, because StreamReader is a general purpose Text reader, when it encounters data that is not within range defined by Encoding, it replace the character with a replacement fallback. And therefore when you pass the resulting stream to XmlReader, all characters it can see now falls in legal range defined by the encoding."

using (XmlReader reader = XmlReader.Create(inputUri))
The above will use the encoding of the XmlReader and will ignore the encoding declaration of the file.
Which is why the exception occurs, and is why the second method works - as you provide a UTF-8 encoding to use.
N.B. I think that the default encoding is UTF-16

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

Error parsing XML string to XDocument

I have this XML string bn:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
I am trying to convert it to an XDocument like this:
var doc = XDocument.Parse(bn);
However, I get this error:
Data at the root level is invalid. Line 1, position 1.
Am I missing something?
UPDATE:
This is the method I use to create the xml string:
public static string SerializeObjectToXml(Root rt)
{
var memoryStream = new MemoryStream();
var xmlSerializer = new XmlSerializer(typeof(Root));
var xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);
xmlSerializer.Serialize(xmlTextWriter, rt);
memoryStream = (MemoryStream)xmlTextWriter.BaseStream;
string xmlString = ByteArrayToStringUtf8(memoryStream.ToArray());
xmlTextWriter.Close();
memoryStream.Close();
memoryStream.Dispose();
return xmlString;
}
It does add to the start that I have to remove. Could I change something to make it correct from the start?

There is two characters at the beginning of your string that, although you can't see them, are still there and make the string fail. Try this instead:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
The character in question is this. This is a byte-order mark, basically telling the program reading it if it's big or little endian. It seems like you copied and pasted this from a file that wasn't decoded properly.
To remove it, you could use this:
yourString.Replace(((char)0xFEFF).ToString(), "")

You have two unprintable characters (Zero-Width No-break Space) at the beginning of your string.
XML does not allow text outside the root element.

The accepted answer does unnecessary string processing, but, in its defense, it's because you're unnecessarily dealing in string when you don't have to. One of the great things about the .NET XML APIs is that they have robust internals. So instead of trying to feed a string to XDocument.Parse, feed a Stream or some type of TextReader to XDocument.Load. This way, you aren't fooling with manually managing the encoding and any problems it creates, because the internals will handle all of that stuff for you. Byte-order marks are a pain in the neck, but if you're dealing in XML, .NET makes it easier to handle them.

Loading XML to an XDocument with a URL containing an ampersand

XDocument xd = XDocument.Load("http://www.google.com/ig/api?weather=vilnius&hl=lt");
The ampersand & isn't a supported character in a string containing a URL when calling the Load() method. This error occurs:
XmlException was unhandled: Invalid character in the given encoding
How can you load XML from a URL into an XDocument where the URL has an ampersand in the querystring?

You need to URL-encode it as &:
XDocument xd = XDocument.Load(
"http://www.google.com/ig/api?weather=vilnius&hl=lt");
You might be able to get away with using WebUtility.HtmlEncode to perform this conversion automatically; however, be careful that this is not the intended use of that method.
Edit: The real issue here has nothing to do with the ampersand, but with the way Google is encoding the XML document using a custom encoding and failing to declare it. (Ampersands only need to be encoded when they occur within special contexts, such as the <a href="…" /> element of (X)HTML. Read Ampersands (&'s) in URLs for a quick explanation.)
Since the XML declaration does not specify the encoding, XDocument.Load is internally falling back to default UTF-8 encoding as required by XML specification, which is incompatible with the actual data.
To circumvent this issue, you can fetch the raw data and decode it manually using the sample below. I don’t know whether the encoding really is Windows-1252, so you might need to experiment a bit with other encodings.
string url = "http://www.google.com/ig/api?weather=vilnius&hl=lt";
byte[] data;
using (WebClient webClient = new WebClient())
data = webClient.DownloadData(url);
string str = Encoding.GetEncoding("Windows-1252").GetString(data);
XDocument xd = XDocument.Parse(str);

There is nothing wrong with your code - it is perfectly OK to have & in the query string, and it is how separate parameters are defined.
When you look at the error you'll see that it fails to load XML, not to query it from the Url:
XmlException: Invalid character in the given encoding. Line 1, position 473
which clearly points outside of your query string.
The problem could be "Apsiniaukę" (notice last character) in the XML response...

instead of "&" use "&" or "&" . and it will work fine .

.NET XmlDocument LoadXML and Entities

When loading XML into an XmlDocument, i.e.
XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);
is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)
Thanks

This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.
The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.
This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795

What are you writing it to? A TextWriter? a Stream? what?
The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<xml>™</xml>");
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriter xw = XmlWriter.Create(ms, settings);
doc.Save(xw);
xw.Close();
Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
}
Outputs:
<?xml version="1.0" encoding="iso-8859-1"?><xml>™</xml>

I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)
What are you doing with the document after loading it?

I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.
<root>
<testnode>
<![CDATA[some text ™]]>
</testnode>
</root>

Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:
If the character reference begins with
"&#x", the digits and letters up to
the terminating ; provide a
hexadecimal representation of the
character's code point in ISO/IEC
10646.

The &#xxxx; entities are considered to be the character they represent. All XML is converted to unicode on reading and any such entities are removed in favor of the unicode character they represent. This includes any occurance for them in unicode source such as the string passed to LoadXML.
Similarly on writing any character that cannot be represented by the stream being written to is converted to a &#xxxx; entity. There is little point trying to preserve them.
A common mistake is expect to get a String from a DOM by some means that uses an encoding other then unicode. That just doesn't happen regardless of what the

Thanks for all of the help.
I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C#: shield XmlTextReader from an occasional Unicode character - c#

Related

XmlException: Invalid character in the given encoding

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

Error parsing XML string to XDocument

Loading XML to an XDocument with a URL containing an ampersand

.NET XmlDocument LoadXML and Entities

Categories

Resources