XML Deserialize with UTF-8 encoding

XML Deserialize with UTF-8 encoding - c#

I already searched a lot today about this and I can't find how to Deserialize with UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
<AvailabilityRequestV2 xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
siteid="0000"
apikey="0000"
async="false" waittime="0">
<Type>4</Type>
<Id>159266</Id>
<Radius>0</Radius>
<Latitude>0</Latitude>
<Longitude>0</Longitude>
</AvailabilityRequestV2>
If I try this
string xmlString = File above;
XmlSerializer serializer = new XmlSerializer(typeof(AvailabilityRequestV2));
AvailabilityRequestV2 request = (AvailabilityRequestV2)serializer.Deserialize(
new MemoryStream(Encoding.UTF8.GetBytes(xmlString)));
If I put in debugging mode the mouse over request I get this:
{<?xml version="1.0" encoding="utf-16"?><AvailabilityRequestV2
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
..................
How can I force to be UTF-8 ?
I only saw to Serialize, but Deserialize I didn't.

You can use a StreamReader and specify UTF-8, you can also tell it to use the BOM if present:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {
XmlSerializer serializer = new XmlSerializer(typeof(SomeType));
object result = serializer.Deserialize(reader);
}
I'm unsure what happens when the XML reader however encounters the encoding="utf-16" directive within the XML, it may switch over.

Once you have slurped the contents of a file into a .Net/CLR string, it is UTF-16 encoded: it has been transformed from its original source encoding. The CLR uses UTF-16 internally—hence the reason for a char being 16 bits.
As a result, the encoding specified in the document's [original] XML Declaration is now at odds with the actual encoding of the document.
Best to pass a StreamReader as recommended by #Lloyd above.

I think the example from #Lloyd needs the new keyword:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {

Related

C# Parsing XML in ISO-8859-1

I'm working on a tool for validating XML files grabbed from a mainframe. For reasons beyond my control every XML file is encoded in ISO 8859-1.
<?xml version="1.0" encoding="ISO 8859-1"?>
My C# application utilizes the System.XML library to parse the XML and eventually a string of a message contained within one of the child nodes.
If I manually remove the XML encoding line it works just fine. But i'd like to find a solution that doesn't require manual intervention. Are there any elegant approaches to solving this? Thanks in advance.
The exception that is thrown reads as:
System.Xml.XmlException' occurred in System.Xml.dll. System does not support 'ISO 8859-1' encoding. Line 1, position 31
My code is
XMLDocument xmlDoc = new XMLDocument();
xmlDoc.Load(//fileLocation);

As Jeroen pointed out in a comment, the encoding should be:
<?xml version="1.0" encoding="ISO-8859-1"?>
not:
<?xml version="1.0" encoding="ISO 8859-1"?>
(missing dash -).
You can use a StreamReader with an explicit encoding to read the file anyway:
using (var reader = new StreamReader("//fileLocation", Encoding.GetEncoding("ISO-8859-1")))
{
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
// ...
}
(from answer by competent_tech in other thread I linked in an earlier comment).
If you do not want the using statement, I guess you can do:
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(File.ReadAllText("//fileLocation", Encoding.GetEncoding("ISO-8859-1")));
Instead of XmlDocument, you can use the XDocument class in the namespace System.Xml.Linq if you refer the assembly System.Xml.Linq.dll (since .NET 3.5). It has static methods like Load(Stream) and Parse(string) which you can use as above.

How to Specify XML Encoding when Serializing an Object in C#

I am serializing a C# object into an XML document and sending the XML document to a third party vendor. The vendor is telling me that the encoding specification in the document is UTF-16, but the XML document contains UTF-8 content and they can't use it. Here is the code I am using to create the XML file, which runs without error and creates an XML document.
// Instantiate xmlSerializer with my object type.
XmlSerializer xmlSerializer = new XmlSerializer(typeof(MyObject));
// Instantiate a new stream and pass file location and mode.
Stream stream = new FileStream(#"C:\doc.xml", FileMode.Create);
// Instantiate xmlWriter and pass stream and encoding.
XmlWriter xmlWriter = new XmlTextWriter(stream, Encoding.Unicode);
// Call serialize method and pass xmlWriter and my object.
xmlSerializer.Serialize(xmlWriter, myObject);
// Close writer and stream.
xmlWriter.Close();
stream.Close();
When I run this, the XML Doc shows this on the first line:
<?xml version="1.0" encoding="UTF-16"?>
I've tried changing the Encoding from Encoding.Unicode to Encoding.UTF8 in the XmlTextWriter, but that doesn't change the first line of the XML Doc and it still shows UTF-16.
I also tried using the Serialize method signature that takes 4 parameters (writer, object, namespaces, encoding) and specified UTF8 as the encoding and that didn't change the XML Doc specification either.
I believe all I need to do is change the encoding that shows in the XML Doc to UTF-8 and the third party vendor will be happy. I can't figure out what I am doing wrong.

If I change from Encoding.Unicode to Encoding.UTF8, the file is generated properly. Perhaps you're looking at an old version of your file?
In an unrelated bit, you should use using for deterministic disposal of objects which implement IDisposable:
XmlSerializer xmlSerializer = new XmlSerializer(typeof(MyObject));
using (Stream stream = new FileStream(#".\doc.xml", FileMode.Create))
using (XmlWriter xmlWriter = new XmlTextWriter(stream, Encoding.UTF8))
{
xmlSerializer.Serialize(xmlWriter, myObject);
}

Reading contents of XML file without having to remove the XML declaration

I want to read all XML contents from a file. The code below only works when the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) is removed. What is the best way to read the file without removing the XML declaration?
XmlTextReader reader = new XmlTextReader(#"c:\my path\a.xml");
reader.Read();
string rs = reader.ReadOuterXml();
Without removing the XML declaration, reader.ReadOuterXml() returns an empty string.
<?xml version="1.0" encoding="UTF-8"?>
<s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:a="http://www.w3.org/2005/08/addressing">
<s:Header>
<a:Action s:mustUnderstand="1">http://www.as.com/ver/ver.IClaimver/Car</a:Action>
<a:MessageID>urn:uuid:b22149b6-2e70-46aa-8b01-c2841c70c1c7</a:MessageID>
<ActivityId CorrelationId="16b385f3-34bd-45ff-ad13-8652baeaeb8a" xmlns="http://schemas.microsoft.com/2004/09/ServiceModel/Diagnostics">04eb5b59-cd42-47c6-a946-d840a6cde42b</ActivityId>
<a:ReplyTo>
<a:Address>http://www.w3.org/2005/08/addressing/anonymous</a:Address>
</a:ReplyTo>
<a:To s:mustUnderstand="1">http://localhost/ver.Web/ver2011.svc</a:To>
</s:Header>
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Car xmlns="http://www.as.com/ver">
<carApplication>
<HB_Base xsi:type="HB" xmlns="urn:core">
<Header>
<Advisor>
<AdvisorLocalAuthorityCode>11</AdvisorLocalAuthorityCode>
<AdvisorType>1</AdvisorType>
</Advisor>
</Header>
<General>
<ApplyForHB>yes</ApplyForHB>
<ApplyForCTB>yes</ApplyForCTB>
<ApplyForFSL>yes</ApplyForFSL>
<ConsentSupplied>no</ConsentSupplied>
<SupportingDocumentsSupplied>no</SupportingDocumentsSupplied>
</General>
</HB_Base>
</carApplication>
</Car>
</s:Body>
</s:Envelope>
Update
I know other methods that use NON-xml reader (e.g. by using File.ReadAllText()). But I need to know a way that uses an xml method.

There can be no text or whitespace before the <?xml ?> encoding declaration other than a BOM, and no text between the declaration and the root element other than line break.
Anything else is an invalid document.
UPDATE:
I think your expectation of XmlTextReader.read() is incorrect.
Each call to XmlTextReader.Read() steps through the next "token" in the XML document, one token at a time. "Token" means XML elements, whitespace, text, and XML encoding declaration.
Your call to reader.ReadOuterXML() is returning an empty string because the first token in your XML file is an XML declaration, and an XML declaration does not have an OuterXML.
Consider this code:
XmlTextReader reader = new XmlTextReader("test.xml");
reader.Read();
Console.WriteLine(reader.NodeType); // XMLDeclaration
reader.Read();
Console.WriteLine(reader.NodeType); // Whitespace
reader.Read();
Console.WriteLine(reader.NodeType); // Element
string rs = reader.ReadOuterXml();
The code above produces this output:
XmlDeclaration
Whitespace
Element
The first "token" is the XML declaration.
The second "token" encountered is the line break after the XML declaration.
The third "token" encountered is the <s:Envelope> element. From here a call to reader.ReadOuterXML() will return what I think you're expecting to see - the text of <s:Envelope> element, which is the entire soap packet.
If what you really want is to load the XML file into memory as objects, just call
var doc = XDocument.Load("test.xml")
and be done with the parsing in one fell swoop.
Unless you're working with an XML doc that is so monstrously huge that it won't fit in system memory, there's really not a lot of reason to go poking through the XML document one token at a time.

What about
XmlDocument doc=new XmlDocument;
doc.Load(#"c:\my path\a.xml");
//Now we have the XML document - convert it to a String
//There are many ways to do this, one should be:
StringWriter sw=new StringWriter();
doc.Save(sw);
String finalresult=sw.ToString();

EDIT: I'm assuming you mean you actually have text between the document declaration and the root element. If that's not the case, please clarify.
Without removing the extra text, it's simply an invalid XML file. I wouldn't expect it to work. You don't have an XML file - you have something a bit like an XML file, but with extraneous stuff before the root element.

IMHO you can't read this file. It's because there's a plain text before the root element <s:Envelope> which makes whole document invalid.

You're parsing an XML document as XML just to obtain the source text? Why?
If you really want to do that then:
string rs;
using(var rdr = new StreamReader(#"c:\my path\a.xml"))
rs = rdr.ReadToEnd();
Will work, but I'm really not sure that is what you actually want. This pretty much ignores that it's XML and just reads the text. Useful for some things, but not a lot.

validate xml string content including encoding using C#

I need to validate a string that contains XML Data, there is no schema validation required. All I need to do is make sure that the XML is well formed and properly encoded. For example, I want my code to identify this snippet of XML as invalid:
<?xml version="1.0" encoding="utf-8"?>
<parentNode> Positions1 ’</parentNode>
Using the LoadXML method in XMLDocument does not work, there are no errors thrown when I load the snippet above.
I am aware of how to do this if the content were in an XML file, the following snippet of code shows that:
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.ConformanceLevel = ConformanceLevel.Document;
readerSettings.CheckCharacters = true;
readerSettings.ValidationType = ValidationType.None;
xmlReader = XmlReader.Create(xmlFileName, readerSettings);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(xmlReader);
So short of creating a temporary file to write out my xml string content and then creating an XmlReader instance to read it, is there any alternative? Appreciate much if someone could guide me in the right direction with this problem.

You have not fully understand what encoding means. If you have a .Net string in memory, it's no more "raw data" and has no encoding for that reason. And so LoadXML ingores for a good reason. So what you want to do makes not much sense at all. But if you really want to do it:
You can convert your string into a in memory stream, so you don't have to write a temporary file. Then you can use that stream instead of the xmlFileName in your call to XmlReader.Create.

Achim,
Thanks for your detailed replies, I was able to finally come up with a solution that fits my needs. It involves grabbing the bytes out of the 'unicode' string and then transforming the bytes to utf8 encoding.
try
{
byte[] xmlContentInBytes = new System.Text.UnicodeEncoding().GetBytes(xmlContent);
System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding(false, true);
utf8.GetChars(xmlContentInBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
return false;
}

Correcting Encoding in a large Xml File

I'm importing data from XML files containing this type of content:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.
I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
Update: I do not have any encoding declaration or <?xml in this document.
I've seen some links say to add it dynamically? Is this UTF-16 encoding?

It appears that:
The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
But it was decoded using windows-1252 (the "ANSI" code page).

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?
Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...
Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.
The fix is simple: just specify this encoding when opening your XML file:
XmlDocument doc = new XmlDocument();
using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
doc.Load(reader);
}

From here:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You might want to take a look at this: How to best detect encoding in XML file?
For actual reading you can use StreamReader to take care of BOM(Byte order mark):
string xml;
using (var reader = new StreamReader("FilePath", true))
{ // ↑
xml= reader.ReadToEnd(); // detectEncodingFromByteOrderMarks
}
Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.
Edit 2: Detecting Text Encoding for StreamReader

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XML Deserialize with UTF-8 encoding - c#

I think the example from #Lloyd needs the new keyword: using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {

Related

C# Parsing XML in ISO-8859-1

How to Specify XML Encoding when Serializing an Object in C#

Reading contents of XML file without having to remove the XML declaration

validate xml string content including encoding using C#

Correcting Encoding in a large Xml File

Categories

Resources