XslCompiledTransform uses UTF-16 encoding - c#

I have the following code, which I want to output xml data using the UTF-8 encoding format. but it always outputs data in UTF-16 :
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(XmlReader.Create(new StringReader(xsltString), new XmlReaderSettings()));
StringBuilder sb = new StringBuilder();
XmlWriterSettings writerSettings = new XmlWriterSettings();
writerSettings.Encoding = Encoding.UTF8;
writerSettings.Indent = true;
xslt.Transform(XmlReader.Create(new StringReader(inputXMLToTransform)), XmlWriter.Create(sb, writerSettings));

The XML output will contain a header that is based on the encoding of the stream, not the encoding specified in the settings. As strings are 16 bit unicode the encoding will be UTF-16. The workaround is to suppress the header and add it yourself instead:
writerSettings.OmitXmlDeclaration = true;
Then when you get the result from the StringBuilder:
string xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n" + sb.ToString();

If you use a MemoryStream in place of the StringBuilder, the XmlWriter will respect the encoding you specify in the XmlWriterSettings, since the MemoryStream doesn't have an inherent encoding like the StringBuilder does.

Related

XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

I have an XmlDocument that includes Kanji in its text content, and I need to write it to a stream using ISO-8859-1 encoding. When I do, none of the Kanji characters are encoded properly, and are instead replaced with "??".
Here is sample code that demonstrates how the XML is written from the XmlDocument:
MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();
What can be done to correctly encode Kanji in this specific situation?
As mentioned in comments, the ? character is showing up because Kanji characters are not supported by the encoding ISO-8859-1, so it substitutes ? as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:
Note that the encoding classes allow errors (unsupported characters) to:
Silently change to a "?" character.
Use a "best fit" character.
Change to an application-specific behavior through use of the EncoderFallback and DecoderFallback classes with the U+FFFD Unicode replacement character.
This is the behavior you are seeing.
However, even though Kanji characters are not supported by ISO-8859-1, you can get a much better result by switching to the newer XmlWriter returned by XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding like so:
MemoryStream mStream = new MemoryStream();
var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Encoding = enc,
CloseOutput = false,
// Remove to enable the XML declaration if you want it. XmlTextWriter doesn't include it automatically.
OmitXmlDeclaration = true,
};
using (var writer = XmlWriter.Create(mStream, settings))
{
doc.WriteTo(writer);
}
mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();
By setting the Encoding property of XmlWriterSettings, the XML writer will be made aware whenever a character is not supported by the current encoding and automatically replace it with an XML character entity reference rather than some hardcoded fallback.
E.g. say you have XML like the following:
<Root>
<string>畑 はたけ hatake "field of crops"</string>
</Root>
Then your code will output the following, mapping all Kanji to the single fallback character:
<Root><string>? ??? hatake "field of crops"</string></Root>
Whereas the new version will output:
<Root><string>畑 はたけ hatake "field of crops"</string></Root>
Notice that the Kanji characters have been replaced with character entities such as 畑? All compliant XML parsers will recognize and reconstruct those characters, and thus no information will be lost despite the fact that your preferred encoding does not support Kanji.
Finally, as an aside note the documentation for XmlTextWriter states:
Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.
So replacing it with an XmlWriter is a good idea in general.
Sample .Net fiddle demonstrating usage of both writers and asserting that the XML generated by XmlWriter is semantically equivalent to the original XML despite the escaping of characters.

How do write XML with decimal style entities with XmlWriter

For compatibility reasons I want to generate the exact same XML as another application with C# XmlWriter.
However I can not control the way the xml entities are written : XmlWriter use hexadecimal style (&#x20AC) and I want decimal style (&#8364).
How do I output decimal style with XmlWriter?
Sample Code
XmlWriterSettings writerSettings = new XmlWriterSettings
{
Encoding = Encoding.GetEncoding("iso-8859-1"),
};
using (XmlWriter writer = XmlWriter.Create(stream, writerSettings))
{
writer.WriteElementString("test", "€");
}
Output
<?xml version="1.0" encoding="iso-8859-1"?><test>€</test>
You can try this:
using (XmlWriter writer = XmlWriter.Create(stream, writerSettings))
{
string s = AntiXssEncoder.XmlEncode("€");
writer.WriteStartElement("test");
writer.WriteRaw(s);
writer.WriteEndElement();
}
as the AntiXssEncoder uses decimal encoding. You would have to do it with WriteRaw because the XmlEncode already encodes everything in xml.
Don't do:
string s = AntiXssEncoder.XmlEncode("€");
writer.WriteElementString("test", s);
Or your output will escape the '&' character to &#8364;

Escaping Unicode string in XmlElement despite writing XML in UTF-8

For a given XmlElement, I need to be able to set the inner text to an escaped version of the Unicode string, despite the document ultimately being encoded in UTF-8. Is there any way of achieving this?
Here's a simple version of the code:
const string text = "ñ";
var document = new XmlDocument {PreserveWhitespace = true};
var root = document.CreateElement("root");
root.InnerXml = text;
document.AppendChild(root);
var settings = new XmlWriterSettings {Encoding = Encoding.UTF8, OmitXmlDeclaration = true};
using (var stream = new FileStream("out.xml", FileMode.Create))
using (var writer = XmlWriter.Create(stream, settings))
document.WriteTo(writer);
Expected:
<root>ñ</root>
Actual:
<root>ñ</root>
Using an XmlWriter directly and calling WriteRaw(text) works, but I only have access to an XmlDocument, and the serialization happens later. On the XmlElement, InnerText escapes the & to &, as expected, and setting Value throws an exception.
Is there some way of setting the inner text of an XmlElement to the escaped ASCII text, regardless of the encoding that is ultimately used? I feel like I must be missing something obvious, or it's just not possible.
If you ask XmlWriter to produce ASCII output, it should give you character references for all non-ASCII content.
var settings = new XmlWriterSettings {Encoding = Encoding.ASCII, OmitXmlDeclaration = true};
The output is still valid UTF-8, because ASCII is a subset of UTF-8.

XDocument: saving XML to file without BOM

I'm generating an utf-8 XML file using XDocument.
XDocument xml_document = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement(ROOT_NAME,
new XAttribute("note", note)
)
);
...
xml_document.Save(#file_path);
The file is generated correctly and validated with an xsd file with success.
When I try to upload the XML file to an online service, the service says that my file is wrong at line 1; I have discovered that the problem is caused by the BOM on the first bytes of the file.
Do you know why the BOM is appended to the file and how can I save the file without it?
As stated in Byte order mark Wikipedia article:
While Unicode standard allows BOM in
UTF-8 it does not require or
recommend it. Byte order has no
meaning in UTF-8 so a BOM only
serves to identify a text stream or
file as UTF-8 or that it was converted
from another format that has a BOM
Is it an XDocument problem or should I contact the guys of the online service provider to ask for a parser upgrade?
Use an XmlTextWriter and pass that to the XDocument's Save() method, that way you can have more control over the type of encoding used:
var doc = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("root", new XAttribute("note", "boogers"))
);
using (var writer = new XmlTextWriter(".\\boogers.xml", new UTF8Encoding(false)))
{
doc.Save(writer);
}
The UTF8Encoding class constructor has an overload that specifies whether or not to use the BOM (Byte Order Mark) with a boolean value, in your case false.
The result of this code was verified using Notepad++ to inspect the file's encoding.
First of all: the service provider MUST handle it, according to XML spec, which states that BOM may be present in case of UTF-8 representation.
You can force to save your XML without BOM like this:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = new UTF8Encoding(false); // The false means, do not emit the BOM.
using (XmlWriter w = XmlWriter.Create("my.xml", settings))
{
doc.Save(w);
}
(Googled from here: http://social.msdn.microsoft.com/Forums/en/xmlandnetfx/thread/ccc08c65-01d7-43c6-adf3-1fc70fdb026a)
The most expedient way to get rid of the BOM character when using XDocument is to just save the document, then do a straight File read as a file, then write it back out. The File routines will strip the character out for you:
XDocument xTasks = new XDocument();
XElement xRoot = new XElement("tasklist",
new XAttribute("timestamp",lastUpdated),
new XElement("lasttask",lastTask)
);
...
xTasks.Add(xRoot);
xTasks.Save("tasks.xml");
// read it straight in, write it straight back out. Done.
string[] lines = File.ReadAllLines("tasks.xml");
File.WriteAllLines("tasks.xml",lines);
(it's hoky, but it works for the sake of expediency - at least you'll have a well-formed file to upload to your online provider) ;)
By UTF-8 Documents
String XMLDec = xDoc.Declaration.ToString();
StringBuilder sb = new StringBuilder(XMLDec);
sb.Append(xDoc.ToString());
Encoding encoding = new UTF8Encoding(false); // false = without BOM
File.WriteAllText(outPath, sb.ToString(), encoding);

Why can't I set the XDocument XDeclaration encoding type to iso-8859-1?

Why doesn't the following code set the XML declaration encoding type? It always sets the encoding to utf-16 instead. Am I missing something very obvious?
var xdoc = new XDocument(
new XDeclaration("1.0", "iso-8859-1", null),
new XElement("root", "")
);
output:
<?xml version="1.0" encoding="utf-16"?>
<root></root>
See the answer about specifying the TextWriter's encoding.
As an aside: ISO-8859-1 is a character-set, not an encoding. Unicode is also a character-set, but UTF-16 is an encoding of the Unicode character set into a sequence of bytes. You cannot specify a document's encoding as ISO-8859-1, just as you cannot specify a document's character-set as UTF-16. Note that Unicode is the native character-set and UTF-16 is the native Unicode encoding for both .NET and Java String classes and text-based or string-based operations.
As stated, the .NET XML/Stream writing implementation 'picks up' or interprets the encoding from somewhere other than the declared XML encoding. I have successfully tested a working solution, as described at the URL contained within the earlier Stackoverflow post
XDocument xmlDoc = new XDocument(
new XDeclaration("1.0", "utf-8", "no"),
new XElement("foo", "bar"));
MemoryStream memstream = new MemoryStream();
XmlTextWriter xmlwriter = new XmlTextWriter(memstream, new UTF8Encoding());
//'Write' (save) XDocument XML to MemoryStream-backed XmlTextWriter instance
xmlDoc.Save(xmlwriter);
//Read back XML string from stream
xmlwriter.Flush();
memstream.Seek(0, SeekOrigin.Begin); //OR "stream.Position = 0"
StreamReader streamreader = new StreamReader(memstream);
string xml = streamreader.ReadToEnd();
Console.WriteLine(xml);
Console.WriteLine(reader.ReadToEnd());
I hope this helps somebody.
Cheers
I somehow can't find any working answer here, so here is an actual solution which will output the wanted encoding in the header:
private void CreateXml()
{
XmlTextWriter xmlwriter = new XmlTextWriter("c:\\test.xml", Encoding.GetEncoding("iso-8859-1"));
XDocument xdoc = new XDocument(
new XElement("Test")
);
xdoc.Save(xmlwriter);
xmlwriter.Close();
}
The reason why you are getting UTF-16 is that strings are encoded with UTF-16 in memory, and as long as you don't specify an encoding for the output of the XML, it will override the encoding in the XML header to match the actual encoding being used. Using an XmlTextWriter is one method of specifying a different encoding.
You can also let the XmlTextWriter write to a MemoryStream and then transform it back to string if you need to perform the whole operation in memory.

Categories

Resources