I have the following bit of code in C# to convert an XML file to another using XSLT/
string xmlInput = #"<?xml version='1.0' encoding='UTF-8'?><catalog><cd><title> Empire Burlesque </title ><artist> Bob Dylan </artist><country> USA </country><company> Columbia </company><price> 10.90 </price><year> 1985 </year></cd></catalog>";
///////////////////////////////////////////////////////////////
string xmlOutput = String.Empty;
using (StringReader sri = new StringReader(xmlInput))
{
using (XmlReader xri = XmlReader.Create(sri))
{
XslCompiledTransform xslt = new XslCompiledTransform();
//xslt.Load(xrt);
xslt.Load(#"XSLT/slide2.xslt");
using (StringWriter sw = new StringWriter())
using (XmlWriter xwo = XmlWriter.Create(sw, new XmlWriterSettings { Encoding = Encoding.UTF8 }))
{
xslt.Transform(xri, xwo);
xmlOutput = sw.ToString();
}
}
}
xmlOutput gives me "<?xml version=\"1.0\" encoding=\"utf-16\"?><root> Empire Burlesque </root>"
How can I get utf-8 and no slashes?
.NET strings are sequences of UTF-16 encoded characters and StringWriter/StringBuilder default to that encoding. (source https://forums.asp.net/post/3240311.aspx)
So you need to make a class which inherits of the default stringwriter:
public class StringWriterWithEncoding : StringWriter
{
Encoding myEncoding;
public override Encoding Encoding
{
get
{
return myEncoding;
}
}
public StringWriterWithEncoding(Encoding encoding) : base()
{
myEncoding = encoding;
}
public StringWriterWithEncoding(Encoding encoding) : base(CultureInfo.CurrentCulture)
{
myEncoding = encoding;
}
public StringWriterWithEncoding(StringBuilder sb, Encoding encoding) : base(sb, CultureInfo.CurrentCulture)
{
myEncoding = encoding;
}
}
and create an instance of that e.g. StringWriterWithEncoding utf8Writer = new StringWriterWithEncoding(Encoding.UTF8); and pass that as the third argument to the Transform method of your XslCompiledTransform.
use like this:
StringBuilder sb = new StringBuilder();
using (StringWriterWithEncoding sw = new StringWriterWithEncoding(sb, Encoding.UTF8))
{
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(#"XSLT/slide2.xslt");
xslt.Transform(xri, sw);
}
xmlOutput = sb.ToString();
First issue is caused by StringWriter
using (StringWriter sw = new StringWriter())
using (XmlWriter xwo = XmlWriter.Create(sw, new XmlWriterSettings { Encoding = Encoding.UTF8 }))
Even though you specifically set XmlWriterSettings.Encoding to UTF-8, you specify output stream to be StringWriter and since .NET strings are UTF-16, XmlWriter is forced to use UTF-16.
If you use for instance FileStream instead of StringWriter, output will be in UTF-8 or whatever encoding you specify.
Slashes issue is just your IDE escaping it. If you print xmlOutput to Console you will see it contains no extra slashes.
You can include this line in your XSLT stylesheet:
<xsl:output encoding="utf-8"/>
(or of course whichever encoding you prefer), and it will automatically set the output settings to the utf-8 encoding.
I believe that using the MemoryStream is a better way to handle this. .net strings are utf-16 internally and that's how they are encoded when you write to a StringWriter of StringBuilder object. Using a memory stream, you avoid that pitfall.
string xmlDoc = "";
// Use a memory stream to avoid the .net internal string utf-16 encoding pitfall.
using (MemoryStream xmlStream = new MemoryStream())
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlAsText)))
using (XmlReader xsltReader = XmlReader.Create(new StringReader(xsltAsText)))
{
// Transform XML string to new XML based on the XSLT
// Load the XSLT and transform source XML to target XML
XslCompiledTransform myXslTrans = new XslCompiledTransform();
myXslTrans.Load(xsltReader);
myXslTrans.Transform(xmlReader, null, xmlStream);
// Using the encoding from the xslt, transform the xml stream bytes to the xml string.
// If no encoding in xslt, defaults to UTF-8.
xmlDoc = myXslTrans.OutputSettings.Encoding.GetString(xmlStream.ToArray());
// Remove the BOM if it exists
string byteOrderMark = myXslTrans.OutputSettings.Encoding.GetString(myXslTrans.OutputSettings.Encoding.GetPreamble());
if (xmlDoc.StartsWith(byteOrderMark, StringComparison.Ordinal))
{
xmlDoc = xmlDoc.Remove(0, byteOrderMark.Length);
}
}
If there is no encoding attribute in the stylesheet, then you get UTF-8 by default, not UTF-16. This is the same as if you were to write it directly to file. I can't say how it works for different cultures, sorry.
I'm suggesting this method over the other methods where they are hard coding UTF-8. This works with whatever (valid) encoding is in the style sheet, such as ISO-8859-1.
Related
I am using this code to store my class:
FileStream stream = new FileStream(myPath, FileMode.Create);
XmlSerializer serializer = new XmlSerializer(typeof(myClass));
serializer.Serialize(stream, myClass);
stream.Close();
This writes a file that I can read alright with XmlSerializer.Deserialize. The generated file, however, is not a proper text file. XmlSerializer.Serialize doesn't store a BOM, but still inserts multibyte characters. Thus it is implicitely declared an ANSI file (because we expect an XML file to be a text file, and a text file without a BOM is considered ANSI by Windows), showing ö as ö in some editors.
Is this a known bug? Or some setting that I'm missing?
Here is what the generated file starts with:
<?xml version="1.0"?>
<SvnProjects xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
The first byte in the file is hex 3C, i.e the <.
Having or not having a BOM is not a definition of a "proper text file". In fact, I'd say that the most typical format these days is UTF-8 without BOM; I don't think I've ever seen anyone actually use the UTF-8 BOM in real systems! But: if you want a BOM, that's fine: just pass the correct Encoding in; if you want UTF-8 with BOM:
using (var writer = XmlWriter.Create(myPath, s_settings))
{
XmlSerializer serializer = new XmlSerializer(typeof(MyClass));
serializer.Serialize(writer, obj);
}
with:
static readonly XmlWriterSettings s_settings =
new XmlWriterSettings { Encoding = new UTF8Encoding(true) };
The result of this is a file that starts EF-BB-BF, the UTF-8 BOM.
If you want a different encoding, then just replace new UTF8Encoding with whatever you did want, remembering to enable the BOM.
(note: the static Encoding.UTF8 instance has the BOM enabled, but IMO it is better to be very explicit here if you specifically intend to use a BOM, just like you should be very explicit about what Encoding you intended to use)
Edit: the key difference here is that Serialize(Stream, object) ends up using:
XmlTextWriter xmlWriter = new XmlTextWriter(stream, encoding: null) {
Formatting = Formatting.Indented,
Indentation = 2
};
which then ends up using:
public StreamWriter(Stream stream) : this(stream,
encoding: UTF8NoBOM, // <==== THIS IS THE PROBLEM
bufferSize: 1024, leaveOpen: false)
{
}
so: UTF-8 without BOM is the default if you use that API.
you must xml an instance not a class definition
for getting Unicode you must declare a XmlWriter or TextWriter
FileStream stream = new FileStream(myPath, FileMode.Create);
XmlSerializer serializer = new XmlSerializer(typeof(myClass));
XmlWriter writer = new XmlTextWriter(fs, Encoding.Unicode);
serializer.Serialize(writer, myClass);
stream.Close();
I am trying to set encoding type to ISO-8859-1 while serializing an object. The code runs without throwing any exception but the returned coding type is always set to "UTF-16". I have searched many examples, there are 100s of samples, but I am simply unable to force the desired encoding type.
My questions is that how can I force it to set encoding to ISO-8859-1?
Thanks in advance.
The code is:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = Encoding.GetEncoding("ISO-8859-1")
};
using (var stringWriter = new StringWriter())
{
using (var xmlWriter = XmlWriter.Create(stringWriter, xmlWriterSettings))
{
serializer.Serialize(xmlWriter, obj, ns);
}
return stringWriter.ToString();
}
When you write XML to a TextWriter (e.g. StringWriter, StreamWriter), it always uses the encoding specified by the TextWriter, and ignores the one specified in XmlWriterSettings. For StringWriter, the encoding is always UTF-16, because that's the encoding used by .NET strings in memory.
The workaround is to write to a MemoryStream and read the string from there:
var encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = encoding
};
using (var stream = new MemoryStream())
{
using (var xmlWriter = XmlWriter.Create(stream, xmlWriterSettings))
{
serializer.Serialize(xmlWriter, obj, ns);
}
return encoding.GetString(stream.ToArray());
}
Note that if you have the XML in a string, when you write it to a file you need to make sure to write it with the same encoding.
I am using a routine that serializes <T>. It works, but when downloaded to the browser I see a blank page. I can view the page source or open the download in a text editor and I see the xml, but it is in UTF-16 which I think is why browser pages show blank?
How do I modify my serializer routine to return UTF-8 instead of UTF-16?
The XML source returned:
<?xml version="1.0" encoding="utf-16"?>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>January</string>
<string>February</string>
<string>March</string>
<string>April</string>
<string>May</string>
<string>June</string>
<string>July</string>
<string>August</string>
<string>September</string>
<string>October</string>
<string>November</string>
<string>December</string>
<string />
</ArrayOfString>
An example call to the serializer:
DateTimeFormatInfo dateTimeFormatInfo = new DateTimeFormatInfo();
var months = dateTimeFormatInfo.MonthNames.ToList();
string SelectionId = "1234567890";
return new XmlResult<List<string>>(SelectionId)
{
Data = months
};
The Serializer:
public class XmlResult<T> : ActionResult
{
private string filename = DateTime.Now.ToString("ddmmyyyyhhss");
public T Data { private get; set; }
public XmlResult(string selectionId = "")
{
if (selectionId != "")
{
filename = selectionId;
}
}
public override void ExecuteResult(ControllerContext context)
{
HttpContextBase httpContextBase = context.HttpContext;
httpContextBase.Response.Buffer = true;
httpContextBase.Response.Clear();
httpContextBase.Response.AddHeader("content-disposition", "attachment; filename=" + filename + ".xml");
httpContextBase.Response.ContentType = "text/xml";
using (StringWriter writer = new StringWriter())
{
XmlSerializer xml = new XmlSerializer(typeof(T));
xml.Serialize(writer, Data);
httpContextBase.Response.Write(writer);
}
}
}
You can use a StringWriter that will force UTF8. Here is one way to do it:
public class Utf8StringWriter : StringWriter
{
// Use UTF8 encoding but write no BOM to the wire
public override Encoding Encoding
{
get { return new UTF8Encoding(false); } // in real code I'll cache this encoding.
}
}
and then use the Utf8StringWriter writer in your code.
using (StringWriter writer = new Utf8StringWriter())
{
XmlSerializer xml = new XmlSerializer(typeof(T));
xml.Serialize(writer, Data);
httpContextBase.Response.Write(writer);
}
answer is inspired by Serializing an object as UTF-8 XML in .NET
Encoding of the Response
I am not quite familiar with this part of the framework. But according to the MSDN you can set the content encoding of an HttpResponse like this:
httpContextBase.Response.ContentEncoding = Encoding.UTF8;
Encoding as seen by the XmlSerializer
After reading your question again I see that this is the tough part. The problem lies within the use of the StringWriter. Because .NET Strings are always stored as UTF-16 (citation needed ^^) the StringWriter returns this as its encoding. Thus the XmlSerializer writes the XML-Declaration as
<?xml version="1.0" encoding="utf-16"?>
To work around that you can write into an MemoryStream like this:
using (MemoryStream stream = new MemoryStream())
using (StreamWriter writer = new StreamWriter(stream, Encoding.UTF8))
{
XmlSerializer xml = new XmlSerializer(typeof(T));
xml.Serialize(writer, Data);
// I am not 100% sure if this can be optimized
httpContextBase.Response.BinaryWrite(stream.ToArray());
}
Other approaches
Another edit: I just noticed this SO answer linked by jtm001. Condensed the solution there is to provide the XmlSerializer with a custom XmlWriter that is configured to use UTF8 as encoding.
Athari proposes to derive from the StringWriter and advertise the encoding as UTF8.
To my understanding both solutions should work as well. I think the take-away here is that you will need one kind of boilerplate code or another...
To serialize as UTF8 string:
private string Serialize(MyData data)
{
XmlSerializer ser = new XmlSerializer(typeof(MyData));
// Using a MemoryStream to store the serialized string as a byte array,
// which is "encoding-agnostic"
using (MemoryStream ms = new MemoryStream())
// Few options here, but remember to use a signature that allows you to
// specify the encoding
using (XmlTextWriter tw = new XmlTextWriter(ms, Encoding.UTF8))
{
tw.Formatting = Formatting.Indented;
ser.Serialize(tw, data);
// Now we get the serialized data as a string in the desired encoding
return Encoding.UTF8.GetString(ms.ToArray());
}
}
To return it as XML on a web response, don't forget to set the response encoding:
string xml = Serialize(data);
Response.ContentType = "application/xml";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.Output.Write(xml);
This should hopefully be a simple one.
I am serializaing a List<> of C# objects to an XML document. Everything is going great however my XML document has ASCII encoding (spaces are represented as X0020 for example) and the client is complaining so I want to change the encoding to UTF8 like so:
private void SerializeToXML(List<ResponseData> finalXML)
{
XmlSerializer serializer = new XmlSerializer(typeof(List<ResponseData>));
TextWriter textWriter = new StreamWriter(txtFileLocation.Text, Encoding.UTF8);
serializer.Serialize(textWriter, finalXML);
textWriter.Close();
}
Intellisense is telling me this should work...
...but is complaining when I try it...
What am I doing wrong?
Thanks
There is no (string, Encoding) method signature for the StreamWriter constructor.
There is a (Stream, Encoding) signature for the constructor.
here is a snippet that is working like a charm:
using (Stream stream = File.Open(SerializeXmlFileName, FileMode.Create))
{
using (TextWriter writer = new StreamWriter(stream, Encoding.UTF8))
{
XmlSerializer xmlFormatter = new XmlSerializer(this.Member.GetType());
xmlFormatter.Serialize(writer, this.Member);
writer.Close();
}
stream.Close();
}
using System;
public class clsPerson
{
public string FirstName;
public string MI;
public string LastName;
}
class class1
{
static void Main(string[] args)
{
clsPerson p=new clsPerson();
p.FirstName = "Jeff";
p.MI = "A";
p.LastName = "Price";
System.Xml.Serialization.XmlSerializer x = new System.Xml.Serialization.XmlSerializer(p.GetType());
x.Serialize(Console.Out, p);
Console.WriteLine();
Console.ReadLine();
}
}
taken from http://support.microsoft.com/kb/815813
1)
System.Xml.Serialization.XmlSerializer x = new System.Xml.Serialization.XmlSerializer(p.GetType());
What does this line do? what is GetType()?
2) how do I get the encoding to
<?xml version="1.0" encoding="utf-8"?>
< clsPerson xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
instead of
<?xml version="1.0" encoding="IBM437"?>
<clsPerson xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3
.org/2001/XMLSchema">
or not include the encoding type at all?
If you pass the serializer an XmlWriter, you can control some parameters like encoding, whether to omit the declaration (eg for a fragment), etc.
This is not meant to be a definitive guide, but an alternative so you can see what's going on, and something that isn't just going to console first.
Note also, if you create your XmlWriter with a StringBuilder instead of a MemoryStream, your xml will ignore your Encoding and come out as utf-16 encoded. See the blog post writing xml with utf8 encoding for more information.
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = Encoding.UTF8
};
using (MemoryStream memoryStream = new MemoryStream() )
using (XmlWriter xmlWriter = XmlWriter.Create(memoryStream, xmlWriterSettings))
{
var x = new System.Xml.Serialization.XmlSerializer(p.GetType());
x.Serialize(xmlWriter, p);
// we just output back to the console for this demo.
memoryStream.Position = 0; // rewind the stream before reading back.
using( StreamReader sr = new StreamReader(memoryStream))
{
Console.WriteLine(sr.ReadToEnd());
} // note memory stream disposed by StreamReaders Dispose()
}
1) The GetType() function returns a Type object representing the type of your object, in this case the class clsPerson. You could also use typeof(clsPerson) and get the same result. That line creates an XmlSerializer object for your particular class.
2) If you want to change the encoding, I believe there is an override of the Serialize() function that lets you specify that. See MSDN for details. You may have to create an XmlWriter object to use it though, details for that are also on MSDN:
XmlWriter writer = XmlWriter.Create(Console.Out, settings);
You can also set the encoding in the XmlWriter, the XmlWriterSettings object has an Encoding property.
I took the solution offered by #robert-paulson here for a similar thing I was trying to do and get the string of an XmlSchema. By default it would return as utf-16. However as mentioned the solution here suffers from a Stream Closed Read error. So I tool the liberty of posting the refactor as an extension method with the tweek mentioned by #Liam to move the using block.
public static string ToXmlString(this XmlSchema xsd)
{
var xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = Encoding.UTF8
};
using (var memoryStream = new MemoryStream())
{
using (var xmlWriter = XmlWriter.Create(memoryStream, xmlWriterSettings))
{
xsd.Write(xmlWriter);
}
memoryStream.Position = 0;
using (var sr = new StreamReader(memoryStream))
{
return sr.ReadToEnd();
}
}
}
1) This creates a XmlSerializer for the class clsPerson.
2) encoding is IBM437 because that is the form for the Console.Out stream.
PS: Hungarian notation is not preferred in C#; just name you class Person.