Writing XML files using XmlTextWriter with ISO-8859-1 encoding

Writing XML files using XmlTextWriter with ISO-8859-1 encoding - c#

I'm having a problem writing Norwegian characters into an XML file using C#. I have a string variable containing some Norwegian text (with letters like æøå).
I'm writing the XML using an XmlTextWriter, writing the contents to a MemoryStream like this:
MemoryStream stream = new MemoryStream();
XmlTextWriter xmlTextWriter = new XmlTextWriter(stream, Encoding.GetEncoding("ISO-8859-1"));
xmlTextWriter.Formatting = Formatting.Indented;
xmlTextWriter.WriteStartDocument(); //Start doc
Then I add my Norwegian text like this:
xmlTextWriter.WriteCData(myNorwegianText);
Then I write the file to disk like this:
FileStream myFile = new FileStream(myPath, FileMode.Create);
StreamWriter sw = new StreamWriter(myFile);
stream.Position = 0;
StreamReader sr = new StreamReader(stream);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
myFile.Flush();
myFile.Close();
Now the problem is that in the file on this, all the Norwegian characters look funny.
I'm probably doing the above in some stupid way. Any suggestions on how to fix it?

Why are you writing the XML first to a MemoryStream and then writing that to the actual file stream? That's pretty inefficient. If you write directly to the FileStream it should work.
If you still want to do the double write, for whatever reason, do one of two things. Either
Make sure that the StreamReader and StreamWriter objects you use all use the same encoding as the one you used with the XmlWriter (not just the StreamWriter, like someone else suggested), or
Don't use StreamReader/StreamWriter. Instead just copy the stream at the byte level using a simple byte[] and Stream.Read/Write. This is going to be, btw, a lot more efficient anyway.

Both your StreamWriter and your StreamReader are using UTF-8, because you're not specifying the encoding. That's why things are getting corrupted.
As tomasr said, using a FileStream to start with would be simpler - but also MemoryStream has the handy "WriteTo" method which lets you copy it to a FileStream very easily.
I hope you've got a using statement in your real code, by the way - you don't want to leave your file handle open if something goes wrong while you're writing to it.
Jon

You need to set the encoding everytime you write a string or read binary data as a string.
Encoding encoding = Encoding.GetEncoding("ISO-8859-1");
FileStream myFile = new FileStream(myPath, FileMode.Create);
StreamWriter sw = new StreamWriter(myFile, encoding);
stream.Position = 0;
StreamReader sr = new StreamReader(stream, encoding);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
myFile.Flush();
myFile.Close();

As mentioned in above answers, the biggest issue here is the Encoding, which is being defaulted due to being unspecified.
When you do not specify an Encoding for this kind of conversion, the default of UTF-8 is used - which may or may not match your scenario. You are also converting the data needlessly by pushing it into a MemoryStream and then out into a FileStream.
If your original data is not UTF-8, what will happen here is that the first transition into the MemoryStream will attempt to decode using default Encoding of UTF-8 - and corrupt your data as a result. When you then write out to the FileStream, which is also using UTF-8 as encoding by default, you simply persist that corruption into the file.
In order to fix the issue, you likely need to specify Encoding into your Stream objects.
You can actually skip the MemoryStream process entirely, also - which will be faster and more efficient. Your updated code might look something more like:
FileStream fs = new FileStream(myPath, FileMode.Create);
XmlTextWriter xmlTextWriter =
new XmlTextWriter(fs, Encoding.GetEncoding("ISO-8859-1"));
xmlTextWriter.Formatting = Formatting.Indented;
xmlTextWriter.WriteStartDocument(); //Start doc
xmlTextWriter.WriteCData(myNorwegianText);
StreamWriter sw = new StreamWriter(fs);
fs.Position = 0;
StreamReader sr = new StreamReader(fs);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
fs.Flush();
fs.Close();

Which encoding do you use for displaying the result file? If it is not in ISO-8859-1, it will not display correctly.
Is there a reason to use this specific encoding, instead of for example UTF8?

After investigating, this is that worked best for me:
var doc = new XDocument(new XDeclaration("1.0", "ISO-8859-1", ""));
using (XmlWriter writer = doc.CreateWriter()){
writer.WriteStartDocument();
writer.WriteStartElement("Root");
writer.WriteElementString("Foo", "value");
writer.WriteEndElement();
writer.WriteEndDocument();
}
doc.Save("dte.xml");

Related

XmlSerializer.Serialize BOM missing

I am using this code to store my class:
FileStream stream = new FileStream(myPath, FileMode.Create);
XmlSerializer serializer = new XmlSerializer(typeof(myClass));
serializer.Serialize(stream, myClass);
stream.Close();
This writes a file that I can read alright with XmlSerializer.Deserialize. The generated file, however, is not a proper text file. XmlSerializer.Serialize doesn't store a BOM, but still inserts multibyte characters. Thus it is implicitely declared an ANSI file (because we expect an XML file to be a text file, and a text file without a BOM is considered ANSI by Windows), showing ö as Ã¶ in some editors.
Is this a known bug? Or some setting that I'm missing?
Here is what the generated file starts with:
<?xml version="1.0"?>
<SvnProjects xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
The first byte in the file is hex 3C, i.e the <.

Having or not having a BOM is not a definition of a "proper text file". In fact, I'd say that the most typical format these days is UTF-8 without BOM; I don't think I've ever seen anyone actually use the UTF-8 BOM in real systems! But: if you want a BOM, that's fine: just pass the correct Encoding in; if you want UTF-8 with BOM:
using (var writer = XmlWriter.Create(myPath, s_settings))
{
XmlSerializer serializer = new XmlSerializer(typeof(MyClass));
serializer.Serialize(writer, obj);
}
with:
static readonly XmlWriterSettings s_settings =
new XmlWriterSettings { Encoding = new UTF8Encoding(true) };
The result of this is a file that starts EF-BB-BF, the UTF-8 BOM.
If you want a different encoding, then just replace new UTF8Encoding with whatever you did want, remembering to enable the BOM.
(note: the static Encoding.UTF8 instance has the BOM enabled, but IMO it is better to be very explicit here if you specifically intend to use a BOM, just like you should be very explicit about what Encoding you intended to use)
Edit: the key difference here is that Serialize(Stream, object) ends up using:
XmlTextWriter xmlWriter = new XmlTextWriter(stream, encoding: null) {
Formatting = Formatting.Indented,
Indentation = 2
};
which then ends up using:
public StreamWriter(Stream stream) : this(stream,
encoding: UTF8NoBOM, // <==== THIS IS THE PROBLEM
bufferSize: 1024, leaveOpen: false)
{
}
so: UTF-8 without BOM is the default if you use that API.

you must xml an instance not a class definition
for getting Unicode you must declare a XmlWriter or TextWriter
FileStream stream = new FileStream(myPath, FileMode.Create);
XmlSerializer serializer = new XmlSerializer(typeof(myClass));
XmlWriter writer = new XmlTextWriter(fs, Encoding.Unicode);
serializer.Serialize(writer, myClass);
stream.Close();

XML Serialization - Why can't I change the encoding?

This should hopefully be a simple one.
I am serializaing a List<> of C# objects to an XML document. Everything is going great however my XML document has ASCII encoding (spaces are represented as X0020 for example) and the client is complaining so I want to change the encoding to UTF8 like so:
private void SerializeToXML(List<ResponseData> finalXML)
{
XmlSerializer serializer = new XmlSerializer(typeof(List<ResponseData>));
TextWriter textWriter = new StreamWriter(txtFileLocation.Text, Encoding.UTF8);
serializer.Serialize(textWriter, finalXML);
textWriter.Close();
}
Intellisense is telling me this should work...
...but is complaining when I try it...
What am I doing wrong?
Thanks

There is no (string, Encoding) method signature for the StreamWriter constructor.
There is a (Stream, Encoding) signature for the constructor.

here is a snippet that is working like a charm:
using (Stream stream = File.Open(SerializeXmlFileName, FileMode.Create))
{
using (TextWriter writer = new StreamWriter(stream, Encoding.UTF8))
{
XmlSerializer xmlFormatter = new XmlSerializer(this.Member.GetType());
xmlFormatter.Serialize(writer, this.Member);
writer.Close();
}
stream.Close();
}

how to read special character like é, â and others in C#

I can't read those special characters
I tried like this
1st way #
string xmlFile = File.ReadAllText(fileName);
2nd way #
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
StreamReader r = new StreamReader(fs);
string s = r.ReadToEnd();
But both statements don't understand those special characters.
How should I read?
UPDATE ###
I also try all encoding with
string xmlFile = File.ReadAllText(fileName, Encoding. );
but still don't understand those special characters.

There is no such thing as "special character". What those likely are is extended ascii characters from the latin1 set (iso-8859-1).
You can read those by supplying encoding explicitly to the stream reader (otherwise it will assume UTF8)
using (StreamReader r = new StreamReader(fileName, Encoding.GetEncoding("iso-8859-1")))
r.ReadToEnd();

StreamReader sr = new StreamReader(stream, Encoding.UTF8)

This worked for me :
var json = System.IO.File.ReadAllText(#"././response/response.json" , System.Text.Encoding.GetEncoding("iso-8859-1"));

You have to tell the StreamReader that you are reading Unicode like so
StreamReader sr = new StreamReader(stream, Encoding.Unicode);
If your file is of some other encoding, specify it as the second parameter

I had to "find" the encoding of the file first
//try to "find" the encoding, if not found, use UTF8
var enc = GetEncoding(filePath)??Encoding.UTF8;
var text = File.ReadAllText(filePath, enc );
(please refer to this answer to get the GetEncoding function)

If you can modify the file in question, you can save it with encoding.
I had a json file that I had created (normally) in VS, and I was having the same problem. Rather than specify the encoding when reading the file (I was using System.IO.File.ReadAllText which defaults to UTF8), I resaved the file (File->Save As) and on the Save button, I clicked the arrow and chose "Save with Encoding", then chose "Unicode (UTF-8 with signature) - Codepage 65001".
Problem solved, no need to specify the encoding when reading the file.

Investigating XMLWriter object

How can I see the XML contents of fully populated XmlWriter object while debugging. My silverlight application doesn't permit to actually write to a file and check the contents.

Have it write to a MemoryStream or StringBuilder instead of a file. That will allow you to check the output.

You can create the XmlWriter based on a MemoryStream, then unencode the bytes from the memory stream and display it in a text box, for example.
MemoryStream ms = new MemoryStream();
XmlWriterSettings ws = new XmlWriterSettings();
ws.Encoding = Encoding.UTF8;
XmlWriter w = XmlWriter.Create(ms, ws);
// populate the writer
w.Flush();
textBox1.Text = Encoding.UTF8.GetString(ms.GetBuffer(), 0, (int)ms.Position);

An XmlReader is not "populated". It represents the state of an XML parsing operation, as that operation is in progress. This state will change as the XML is read.

Saving unicode characters as "“" into a file; C#

I have an xml file (converted from xfdl) which contains something like:
<custom:popUp xfdl:compute="toggle(activated,'off','on') == '1' ? viewer.messageBox('o Once you click ..... page.
o When you use the “Create ” function in.......Portal.','Information'):''">
I load it and save it using...
XmlDocument xmlOut = new XmlDocument(); //note: not read only
FileStream outfs = new FileStream(tempOutXmlFileName, FileMode.Open, FileAccess.Read,
FileShare.ReadWrite);
xmlOut.Load(outfs);
xmlOut.Save(tempOutXmlFileName);
outfs.Close();
This process converts some of the unicode instructions into actual characters which completely messes up the xml/xfdl parsing as there are now quotation marks where quotation marks shouldn't be.
Does anybody know a way I can save the file with all the lovely “ characters intact?
Thank you.
Well, after fiddling around for a bit and getting the xml->xfdl conversion working better, I ran into a new problem.
The solution below seems to work and all the parsing of the xml is correct, but the program to read the xfdl file doesn't seem to like when I encode it using UTF-8 and wants the encoding to be ISO-8859-1.
Any ideas?

Using StreamReader and StreamWriter should help. To be clear you are trying to read from and write to the same file? I added some nice using statements aswell.
XmlDocument xmlOut = new XmlDocument();
//note: not read only
using (FileStream outfs = new FileStream(tempOutXmlFileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (StreamReader reader = new StreamReader(outfs, Encoding.UTF8))
{
xmlOut.Load(reader);
}
using (StreamWriter writer = new StreamWriter(tempOutXmlFileName, false, Encoding.UTF8))
{
xmlOut.Save(writer);
}
I set append to false in the StreamWriter, seems to make sense.

Turns out that reading and writing the files byte by byte solved the problem since the writer never got the opportunity to do any interpretation on the content.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Writing XML files using XmlTextWriter with ISO-8859-1 encoding - c#

Which encoding do you use for displaying the result file? If it is not in ISO-8859-1, it will not display correctly. Is there a reason to use this specific encoding, instead of for example UTF8?

Related

XmlSerializer.Serialize BOM missing

XML Serialization - Why can't I change the encoding?

how to read special character like é, â and others in C#

Investigating XMLWriter object

Saving unicode characters as "“" into a file; C#

Categories

Resources