Correcting Encoding in a large Xml File

Correcting Encoding in a large Xml File - c#

I'm importing data from XML files containing this type of content:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.
I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
Update: I do not have any encoding declaration or <?xml in this document.
I've seen some links say to add it dynamically? Is this UTF-16 encoding?

It appears that:
The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
But it was decoded using windows-1252 (the "ANSI" code page).

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?
Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...
Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.
The fix is simple: just specify this encoding when opening your XML file:
XmlDocument doc = new XmlDocument();
using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
doc.Load(reader);
}

From here:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You might want to take a look at this: How to best detect encoding in XML file?
For actual reading you can use StreamReader to take care of BOM(Byte order mark):
string xml;
using (var reader = new StreamReader("FilePath", true))
{ // ↑
xml= reader.ReadToEnd(); // detectEncodingFromByteOrderMarks
}
Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.
Edit 2: Detecting Text Encoding for StreamReader

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

Related

Parsing and removing BOM/Preamble from XML via filesystem

I am processing XBRL files, and ran in to a bunch of them that have a Byte-Order-Mark (BOM) at the start. If I manually remove it, I can process the file without any issue.
I've had several failed attempts to remove the BOM from the start of the XML files that I am reading from.
This is the error message I am receiving:
Data at the root level is invalid. Line 1, position 1.
Originally I was using XDocument.Load(filename) but this was failing with the same error, so I modified the code after gaining advice from Parsing xml string to an xml document fails if the string begins with <?xml... ?> section without success.
void Main()
{
XDocument doc;
var #filename = #"C:\accounts\toprocess\2008\Prod224_8998_00741575_20080630.xml";
byte[] file = File.ReadAllBytes(filename);
using (MemoryStream memory = new MemoryStream(file))
{
using (XmlTextReader oReader = new XmlTextReader(memory))
{
doc = XDocument.Load(oReader);
}
}
}
The XML file can be found here: http://s000.tinyupload.com/download.php?file_id=92333278767554773703&t=9233327876755477370347742

C3 AF C2 BB C2 BF looks to be a double UTF-8 encoded BOM. UTF-8 encoding of the BOM is EF BB BF. If you were to treat each of those as a separate character and UTF-8 encode, you'd end up with the sequence that you're seeing.
So the document you have is broken. Something is taking a document containing a UTF-8 BOM and treating it as extended ASCII. If you can't get the documents fixed at source, I'd be inclined to look for that specific sequence at the start of the file and strip it if present.
If the documents in question use other extended ASCII characters, there's a good chance they'll be broken too.

The sequence C3 AF C2 BB C2 BF does not look like any BOM.
You probably should investigate what it is, if it is consistent (in length) etc.
As it is, you can simply skip the first 6 bytes:
using (var stream = File.Open(fileName, FileMode.Open))
{
stream.Seek(6, SeekOrigin.Begin);
var doc = XDocument.Load(stream);
// ...use it
}

XmlException: Text node cannot appear in this state. Line 1, position 1

Before I get into the issue, I'm aware there is another question that sounds exactly the same as mine. However, I've tried that solution (using Notepad++ to encode the xml file as UTF-8 (without BOM) ) and it doesn't work.
XmlDocument namesDoc = new XmlDocument();
XmlDocument factionsDoc = new XmlDocument();
namesDoc.LoadXml(Application.persistentDataPath + "/names.xml");
factionsDoc.LoadXml(Application.persistentDataPath + "/factions.xml");
Above is the code I have problems with. I'm not sure what the problem is.
<?xml version="1.0" encoding="UTF-8"?>
<factions>
<major id="0">
...
Above is a section of the XML file (the start of it - names.xml is also the same except it has no 'id' attribute). The file(s) are both encoded in UTF-8 - in the latest notepad++ version, there is no option of "encode in UTF-8 without BOM" afaik UTF-8 is the same as UTF-8 without BOM.
Does anyone have any idea what the cause may be? Or am I doing something wrong/forgetting something? :/

You are receiving an error because the .LoadXml() method expects a string argument that contains the XML data, not the location of an XML file. If you want to load an XML file then you need to use the .Load() method, not the .LoadXml() method.

XML Deserialize with UTF-8 encoding

I already searched a lot today about this and I can't find how to Deserialize with UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
<AvailabilityRequestV2 xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
siteid="0000"
apikey="0000"
async="false" waittime="0">
<Type>4</Type>
<Id>159266</Id>
<Radius>0</Radius>
<Latitude>0</Latitude>
<Longitude>0</Longitude>
</AvailabilityRequestV2>
If I try this
string xmlString = File above;
XmlSerializer serializer = new XmlSerializer(typeof(AvailabilityRequestV2));
AvailabilityRequestV2 request = (AvailabilityRequestV2)serializer.Deserialize(
new MemoryStream(Encoding.UTF8.GetBytes(xmlString)));
If I put in debugging mode the mouse over request I get this:
{<?xml version="1.0" encoding="utf-16"?><AvailabilityRequestV2
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
..................
How can I force to be UTF-8 ?
I only saw to Serialize, but Deserialize I didn't.

You can use a StreamReader and specify UTF-8, you can also tell it to use the BOM if present:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {
XmlSerializer serializer = new XmlSerializer(typeof(SomeType));
object result = serializer.Deserialize(reader);
}
I'm unsure what happens when the XML reader however encounters the encoding="utf-16" directive within the XML, it may switch over.

Once you have slurped the contents of a file into a .Net/CLR string, it is UTF-16 encoded: it has been transformed from its original source encoding. The CLR uses UTF-16 internally—hence the reason for a char being 16 bits.
As a result, the encoding specified in the document's [original] XML Declaration is now at odds with the actual encoding of the document.
Best to pass a StreamReader as recommended by #Lloyd above.

I think the example from #Lloyd needs the new keyword:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {

Getting "ï»¿" at the beginning of my XML File after save() [duplicate]

This question already has answers here:
How can I remove the BOM from XmlTextWriter using C#?
(2 answers)
Closed 7 years ago.
I'm opening an existing XML file with C#, and I replace some nodes in there. All works fine. Just after I save it, I get the following characters at the beginning of the file:
ï»¿ (EF BB BF in HEX)
The whole first line:
ï»¿<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The rest of the file looks like a normal XML file.
The simplified code is here:
XmlDocument doc = new XmlDocument();
doc.Load(xmlSourceFile);
XmlNode translation = doc.SelectSingleNode("//trans-unit[#id='127']");
translation.InnerText = "testing";
doc.Save(xmlTranslatedFile);
I'm using a C# Windows Forms application with .NET 4.0.
Any ideas? Why would it do that? Can we disable that somehow? It's for Adobe InCopy, and it does not open it like this.
UPDATE:
Alternative Solution:
Saving it with the XmlTextWriter works too:
XmlTextWriter writer = new XmlTextWriter(inCopyFilename, null);
doc.Save(writer);

It is the UTF-8 BOM, which is actually discouraged by the Unicode standard:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
Use of a BOM is neither required nor
recommended for UTF-8, but may be
encountered in contexts where UTF-8
data is converted from other encoding
forms that use a BOM or where the BOM
is used as a UTF-8 signature
You may disable it using:
var sw = new IO.StreamWriter(path, new System.Text.UTF8Encoding(false));
doc.Save(sw);
sw.Close();

It's a UTF-8 Byte Order Mark (BOM) and is to be expected.

You can try to change the encoding of the XmlDocument. Below is the example copied from MSDN
using System; using System.IO; using System.Xml;
public class Sample {
public static void Main() {
// Create and load the XML document.
XmlDocument doc = new XmlDocument();
string xmlString = "<book><title>Oberon's Legacy</title></book>";
doc.Load(new StringReader(xmlString));
// Create an XML declaration.
XmlDeclaration xmldecl;
xmldecl = doc.CreateXmlDeclaration("1.0",null,null);
xmldecl.Encoding="UTF-16";
xmldecl.Standalone="yes";
// Add the new node to the document.
XmlElement root = doc.DocumentElement;
doc.InsertBefore(xmldecl, root);
// Display the modified XML document
Console.WriteLine(doc.OuterXml);
}
}

As everybody else mentioned, it's Unicode issue.
I advise you to try LINQ To XML. Although not really related, I mention it as it's super easy compared to old ways and, more importantly, I assume it might have automatic resolutions to issues like these without extra coding from you.

C# Help reading foreign characters using StreamReader

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?

You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.

I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;

Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.

Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.

Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.

For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.

File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}

I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}

I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.

for Arabic, I used Encoding.GetEncoding(1256). it is working good.

I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Correcting Encoding in a large Xml File - c#

It appears that: The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is). The XML file was encoded using the DOS "OEM" code page, probably 437 or 850. But it was decoded using windows-1252 (the "ANSI" code page).

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

Related

Parsing and removing BOM/Preamble from XML via filesystem

XmlException: Text node cannot appear in this state. Line 1, position 1

XML Deserialize with UTF-8 encoding

Getting "ï»¿" at the beginning of my XML File after save() [duplicate]

C# Help reading foreign characters using StreamReader

Categories

Resources