XmlDocument.Load() method fails to decode € (euro)

XmlDocument.Load() method fails to decode € (euro) - c#

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)
<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
<f>€.txt</f>
</root>
From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).
My software is written in C# and wants to extract the element f.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml");
In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.
Now, my problem when I extract the value:
//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm);
if (n != null)
String filename = n.InnerText;
The Visual Studio debugger displays filename = □.txt
It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.
What's wrong?

If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.
You would have to create a StreamReader with the correct encoding and use that as the parameter.
xmlDoc.Load(new StreamReader(
File.Open("file.xml"),
Encoding.GetEncoding("iso-8859-15")));
EDIT:
I just stumbled across KB308061 from Microsoft. There's an interesting passage:
Specify the encoding declaration in
the XML declaration section of the XML
document. For example, the following
declaration indicates that the
document is in UTF-16 Unicode encoding
format:
<?xml version="1.0" encoding="UTF-16"?>
Note that this declaration only
specifies the encoding format of an
XML document and does not modify or
control the actual encoding format of
the data.

Don't just use the debugger or the console to display the string as a string.
Instead, dump the contents of the string, one character at a time. For example:
foreach (char c in filename)
{
Console.WriteLine("{0}: {1:x4}", c, (int) c);
}
That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.
Use the Unicode code charts to look up the characters specified.

Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15
Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>
Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.
I dont know the exact escape code for € .. but it would be something of this sort
<f><![CDATA[%3E.txt]]></f>
The above should make € be communicated correctly through the xml.

Related

Unicode to Windows-1251 Conversion with XML(HTML)-escaping

I have XML-file and need to produce HTML-file with Windows-1251 encoding by applying XSL Transformation. A problem is that Unicode characters of XSL -file are not converted to HTML Unicode Escape Sequence like "ғ" during XSL Transformation, only "?" sign is written instead of them. How can I ask XslCompiledTransform.Transform method to do this conversion? Or is there any method to write HTML-string into Windows-1251 HTML file with applying HTML Unicode Escape Sequences, so that I can perform XSL Transformation to string and then by this method to write to a file with Windows-1251 encoding and with HTML-escaping of all unicode characters (something like Convert("ғ") will return "ғ")?
XmlReader xmlReader = XmlReader.Create(new StringReader("<Data><Name>The Wizard of Wishaw</Name></data>"));
XslCompiledTransform xslTrans = new XslCompiledTransform();
xslTrans.Load("sheet.xsl");
using (XmlTextWriter xmlWriter = new XmlTextWriter("result.html", Encoding.GetEncoding("Windows-1251")))
{
xslTrans.Transform(xmlReader, xmlWriter); // it writes Windows-1251 HTML-file but does not escape unicode characters, just writes "?" signs
}
Thanks all for help!
UPDATE
My output configuration tag in XSL-file:
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />
I do not even hope now that XSL will satisfy my needs. But I wonder that I do not have any method to check if character is acceptable by specified encoding. Something like
Char.IsEncodable('ғ', Encoding.GetEncoding('Windows-1251'))
My current solution is to convert all characters greater than 127 (c > 127) to &#dddd; escape strings, but my chief is not satisfied by the solution, because the source of generated HTML-file is not readable.

Do note that XML is both a data model and a serialization format. The data can use different character set than the serialization of this data.
It looks like the key reason to your problem is that your serialization process is trying to limit the character set of the data model, whereas you would like to set the character set of the serialization format. Let's have an example: <band>Motörhead</band> and <band>Motörhead</band> are equal XML documents. They have the same structure and exactly the same data. Because of the heavy metal umlaut, the character set of the data is unicode (or something bigger than ASCII) but, because the usage of a character reference ö, the character set of the latter serialization form of the document is ASCII. In order to process this data, your XML tools still need to be unicode aware in both cases, but when using the latter serialization, the I/O and file transfer tools don't need to be unicode aware.
My guess is that by telling the XMLTextWriter to use Windows-1251 encoding, it probably in practice tries to limit the character set of the data to the characters contained in Windows-1251 by discarding all the characters outside this character set and writing a ? character instead.
However, since you produce your XML document by an XSL transformation, you can control the character set of the serialization directly in your XSLT document. This is done by adding a encoding attribute to the xsl:output element. Modify it to look like this
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" encoding="windows-1251"/>
Now the XSLT processor takes care of the serialization to reduced character set and outputs a character reference for all characters in the data that are included in windows-1251.
If changing the character set of the data is really what you need, then you need to process your data with a suitable character conversion library that can guess the most suitable replacement character (like ö -> o).

try to complement your xsl-file with replacement rules a la
<xsl:value-of select="replace(.,'ғ','&#1171;')"/>
you may wish to do this using regex patterns instead:
<xsl:value-of select="replace(.,'&#(\d+);','&#$1;')"/>
your problem origins with the xml parser that substitutes the numeric entity reference with the corresponding unicode chars before the transformation takes place. thus the unknown chars (resp. '?')
end up in your converted document.
hope this helps,
best regards,
carsten

The correct solution would be to write the file in a Unicode encoding (such as UTF-8) and forget about CP-1251 and all other legacy encodings.
But I will assume that this is not an option for some reason.
The best alternative that I can devise is to do the character replacements in the string before handing it to the XmlReader. You should use the Encoding class to convert the string to an array of bytes in CP-1251, and create your own decoder fallback mechanism. The fallback mechanism can then insert the XML escape sequences. This way you are guarunteed to handle all (and exactly those) characters that are not in CP-1251.
Then you can convert the array of bytes (in CP-1251) into a normal .NET String (in UTF-16) and hand it to your XmlReader. The values that need to be escaped will already be escaped, so the final file should be written correctly.
UPDATE
I just realized the flaw of this method. The XmlWriter will further escape the & characters as &, so the escapes themselves will appear in the final document rather than the characters they represent.
This may require some very complicated solution!
ANOTHER UPDATE
Ignore that last update. Since you are reading the string in as XML, the escapes should be interpreted correctly. This is what I get for trying post quickly rather than thinking through the problem!
My proposed solution should work fine.

Have you tried specifying the encoding in the xsl:output?
(http://www.w3schools.com/xsl/el_output.asp)

The safest and most interoperable way to do this is to specify encoding="us-ascii" in your xsl:output element. Most XSLT processors support writing this encoding.
US-ASCII is a completely safe encoding as it is a compatible subset of UTF-8 (you may elect to label the emitted XML as having a "utf-8" encoding, as this will also be true: this can be done by specifying omit-xml-declaration="yes" for your xsl:output and manually prepending an "<?xml version='1.0' encoding='utf-8'?>" declaration to your output).
This approach works because when using US-ASCII encoding, a serializer is forced to use XML's escaping mechanism for characters beyond U+007F, and so will emit them as numeric character references (the "&#.....;" form).
When dealing with environments in which non-standard encodings are required, it is generally a good defensive technique to produce this kind of XML as it is completely conformant and works in practice with even some buggy consuming software.

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?

Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.

The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

How do I avoid reading the byte order mark (BOM) in a Resources file in Visual Studio?

I am trying to use the Visual Studio editor to create XML files in the Resources area of an Assembly in C#. The files appear perfectly correct in the XML editor and honour my schema (recognising the elements and attributes). However when I try to read them (from the Resources) they fail because they consistently have 3 spurious characters at the start of the file (ï»¿ or #EF #BB #BF).
These characters do NOT appear in the editor but they are there in an external binary editor. When I remove them manualy the files behave properly.
What can I do to create XML files reliably in the Resources area?
After first 2 replies I modified the question to
"How do I read a resources file to avoid including the byte order mark?"

The XML editor creates an XML file by default with the encoding UTF-8 and adds the XML declaration:
<?xml version="1.0" encoding="utf-8" ?>
Presumably it also adds the encoding (which in UTF-8 is 3 bytes as above). The following method (found by a friend) seems to read the bytes without having to know the encoding:
String ss = new StreamReader( new MemoryStream(bytes), true ).ReadToEnd();
and this now does not try to parse the BOM as part of content.

They're not spurious. They're the byte order mark indicating UTF-8.

The editor is placing the well known unicode character marker known as the BOM (byte order mark) at the start of the file. This is used to show what the correct unicode encoding of the file is using - in this case it is UTF-8, but depending on the actual encoding the byte values will be different.

How to change character encoding of XmlReader

I have a simple XmlReader:
XmlReader r = XmlReader.Create(fileName);
while (r.Read())
{
Console.WriteLine(r.Value);
}
The problem is, the Xml file has ISO-8859-9 characters in it, which makes XmlReader throw "Invalid character in the given encoding." exception. I can solve this problem with adding <?xml version="1.0" encoding="ISO-8859-9" ?> line in the beginning but I'd like to solve this in another way in case I can't modify the source file. How can I change the encoding of XmlReader?

To force .NET to read the file in as ISO-8859-9, just use one of the many XmlReader.Create overloads, e.g.
using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9")))) {
while(r.Read()) {
Console.WriteLine(r.Value);
}
}
However, that may not work because, IIRC, the W3C XML standard says something about when the XML declaration line has been read, a compliant parser should immediately switch to the encoding specified in the XML declaration regardless of what encoding it was using before. In your case, if the XML file has no XML declaration, the encoding will be UTF-8 and it will still fail. I may be talking nonsense here so try it and see. :-)

The XmlTextReader class (which is what the static Create method is actually returning, since XmlReader is the abstract base class) is designed to automatically detect encoding from the XML file itself - there's no way to set it manually.
Simply insure that you include the following XML declaration in the file you are reading:
<?xml version="1.0" encoding="ISO-8859-9"?>

If you can't ensure that the input file has the right header, you could look at one of the other 11 overloads to the XmlReader.Create method.
Some of these take an XmlReaderSettings variable or XmlParserContext variable, or both. I haven't investigated these, but there is a possibility that setting the appropriate values might help here.
There is the XmlReaderSettings.CheckCharacters property - the help for this states:
Instructs the reader to check characters and throw an exception if any characters are outside the range of legal XML characters. Character checking includes checking for illegal characters in the document, as well as checking the validity of XML names (for example, an XML name may not start with a numeral).
So setting this to false might help. However, the help also states:
If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.
So further investigation is warranted.

Use a XmlTextReader instead of a XmlReader:
System.Text.Encoding.UTF8.GetString(YourXmlTextReader.Encoding.GetBytes(YourXmlTextReader.Value))

.NET XmlDocument LoadXML and Entities

When loading XML into an XmlDocument, i.e.
XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);
is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)
Thanks

This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.
The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.
This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795

What are you writing it to? A TextWriter? a Stream? what?
The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<xml>™</xml>");
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriter xw = XmlWriter.Create(ms, settings);
doc.Save(xw);
xw.Close();
Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
}
Outputs:
<?xml version="1.0" encoding="iso-8859-1"?><xml>™</xml>

I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)
What are you doing with the document after loading it?

I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.
<root>
<testnode>
<![CDATA[some text ™]]>
</testnode>
</root>

Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:
If the character reference begins with
"&#x", the digits and letters up to
the terminating ; provide a
hexadecimal representation of the
character's code point in ISO/IEC
10646.

The &#xxxx; entities are considered to be the character they represent. All XML is converted to unicode on reading and any such entities are removed in favor of the unicode character they represent. This includes any occurance for them in unicode source such as the string passed to LoadXML.
Similarly on writing any character that cannot be represented by the stream being written to is converted to a &#xxxx; entity. There is little point trying to preserve them.
A common mistake is expect to get a String from a DOM by some means that uses an encoding other then unicode. That just doesn't happen regardless of what the

Thanks for all of the help.
I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.