XmlException: Invalid character in the given encoding - c#

I have an UTF-8 encoded xml
<?xml version="1.0" encoding="UTF-8"?>
When using below version of xml reader. I am assuming this uses UTF-8 enoding to parse xml file.
using (XmlReader reader = XmlReader.Create(inputUri))
I am getting below exception.
System.Xml.XmlException occurred
HResult=-2146232000
LineNumber=18750
LinePosition=13
Message=Invalid character in the given encoding. Line 18750, position 13.
But when using below version of xmlreader
using (XmlReader reader = XmlReader.Create(new StreamReader(inputUri,Encoding.UTF8)))
The xml gets parsed successfully. Why such differences between these two versions given both uses same encoding to parse the given xml file??
PS: I am pretty much sure the first version uses UTF-8 endoding.
Below is the snippet from XmlTextReaderImpl.cs whose instance is returned by the first version.
private void SetupEncoding( Encoding encoding ) {
if ( encoding == null ) {
Debug.Assert( ps.charPos == 0 );
ps.encoding = Encoding.UTF8;
ps.decoder = new SafeAsciiDecoder(); // This falls back to UTF-8 decoder
}
}

I got the answer in msdn forum.
"XmlReader will mark any illegal character as illegal because the XML format is broken.
On the second case, because StreamReader is a general purpose Text reader, when it encounters data that is not within range defined by Encoding, it replace the character with a replacement fallback. And therefore when you pass the resulting stream to XmlReader, all characters it can see now falls in legal range defined by the encoding."

using (XmlReader reader = XmlReader.Create(inputUri))
The above will use the encoding of the XmlReader and will ignore the encoding declaration of the file.
Which is why the exception occurs, and is why the second method works - as you provide a UTF-8 encoding to use.
N.B. I think that the default encoding is UTF-16

Related

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

Loading XML to an XDocument with a URL containing an ampersand

XDocument xd = XDocument.Load("http://www.google.com/ig/api?weather=vilnius&hl=lt");
The ampersand & isn't a supported character in a string containing a URL when calling the Load() method. This error occurs:
XmlException was unhandled: Invalid character in the given encoding
How can you load XML from a URL into an XDocument where the URL has an ampersand in the querystring?
You need to URL-encode it as &:
XDocument xd = XDocument.Load(
"http://www.google.com/ig/api?weather=vilnius&hl=lt");
You might be able to get away with using WebUtility.HtmlEncode to perform this conversion automatically; however, be careful that this is not the intended use of that method.
Edit: The real issue here has nothing to do with the ampersand, but with the way Google is encoding the XML document using a custom encoding and failing to declare it. (Ampersands only need to be encoded when they occur within special contexts, such as the <a href="…" /> element of (X)HTML. Read Ampersands (&'s) in URLs for a quick explanation.)
Since the XML declaration does not specify the encoding, XDocument.Load is internally falling back to default UTF-8 encoding as required by XML specification, which is incompatible with the actual data.
To circumvent this issue, you can fetch the raw data and decode it manually using the sample below. I don’t know whether the encoding really is Windows-1252, so you might need to experiment a bit with other encodings.
string url = "http://www.google.com/ig/api?weather=vilnius&hl=lt";
byte[] data;
using (WebClient webClient = new WebClient())
data = webClient.DownloadData(url);
string str = Encoding.GetEncoding("Windows-1252").GetString(data);
XDocument xd = XDocument.Parse(str);
There is nothing wrong with your code - it is perfectly OK to have & in the query string, and it is how separate parameters are defined.
When you look at the error you'll see that it fails to load XML, not to query it from the Url:
XmlException: Invalid character in the given encoding. Line 1, position 473
which clearly points outside of your query string.
The problem could be "Apsiniaukę" (notice last character) in the XML response...
instead of "&" use "&" or "&" . and it will work fine .

C#: shield XmlTextReader from an occasional Unicode character

In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which trips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);
It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader from invalid XML. You need to either
Fix the server side to properly encode characters
Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.
What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.

XmlDocument.Load() method fails to decode € (euro)

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)
<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
<f>€.txt</f>
</root>
From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).
My software is written in C# and wants to extract the element f.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml");
In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.
Now, my problem when I extract the value:
//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm);
if (n != null)
String filename = n.InnerText;
The Visual Studio debugger displays filename = □.txt
It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.
What's wrong?
If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.
You would have to create a StreamReader with the correct encoding and use that as the parameter.
xmlDoc.Load(new StreamReader(
File.Open("file.xml"),
Encoding.GetEncoding("iso-8859-15")));
EDIT:
I just stumbled across KB308061 from Microsoft. There's an interesting passage:
Specify the encoding declaration in
the XML declaration section of the XML
document. For example, the following
declaration indicates that the
document is in UTF-16 Unicode encoding
format:
<?xml version="1.0" encoding="UTF-16"?>
Note that this declaration only
specifies the encoding format of an
XML document and does not modify or
control the actual encoding format of
the data.
Don't just use the debugger or the console to display the string as a string.
Instead, dump the contents of the string, one character at a time. For example:
foreach (char c in filename)
{
Console.WriteLine("{0}: {1:x4}", c, (int) c);
}
That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.
Use the Unicode code charts to look up the characters specified.
Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15
Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>
Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.
I dont know the exact escape code for € .. but it would be something of this sort
<f><![CDATA[%3E.txt]]></f>
The above should make € be communicated correctly through the xml.

How to change character encoding of XmlReader

I have a simple XmlReader:
XmlReader r = XmlReader.Create(fileName);
while (r.Read())
{
Console.WriteLine(r.Value);
}
The problem is, the Xml file has ISO-8859-9 characters in it, which makes XmlReader throw "Invalid character in the given encoding." exception. I can solve this problem with adding <?xml version="1.0" encoding="ISO-8859-9" ?> line in the beginning but I'd like to solve this in another way in case I can't modify the source file. How can I change the encoding of XmlReader?
To force .NET to read the file in as ISO-8859-9, just use one of the many XmlReader.Create overloads, e.g.
using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9")))) {
while(r.Read()) {
Console.WriteLine(r.Value);
}
}
However, that may not work because, IIRC, the W3C XML standard says something about when the XML declaration line has been read, a compliant parser should immediately switch to the encoding specified in the XML declaration regardless of what encoding it was using before. In your case, if the XML file has no XML declaration, the encoding will be UTF-8 and it will still fail. I may be talking nonsense here so try it and see. :-)
The XmlTextReader class (which is what the static Create method is actually returning, since XmlReader is the abstract base class) is designed to automatically detect encoding from the XML file itself - there's no way to set it manually.
Simply insure that you include the following XML declaration in the file you are reading:
<?xml version="1.0" encoding="ISO-8859-9"?>
If you can't ensure that the input file has the right header, you could look at one of the other 11 overloads to the XmlReader.Create method.
Some of these take an XmlReaderSettings variable or XmlParserContext variable, or both. I haven't investigated these, but there is a possibility that setting the appropriate values might help here.
There is the XmlReaderSettings.CheckCharacters property - the help for this states:
Instructs the reader to check characters and throw an exception if any characters are outside the range of legal XML characters. Character checking includes checking for illegal characters in the document, as well as checking the validity of XML names (for example, an XML name may not start with a numeral).
So setting this to false might help. However, the help also states:
If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.
So further investigation is warranted.
Use a XmlTextReader instead of a XmlReader:
System.Text.Encoding.UTF8.GetString(YourXmlTextReader.Encoding.GetBytes(YourXmlTextReader.Value))

Categories

Resources