Strange UTF-8 encoding issues when reading XML, writing results in C#

Strange UTF-8 encoding issues when reading XML, writing results in C# - c#

I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?

The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.

Related

C# replaces special characters with question marks

i'm having a problem with encoding in c#
i'm downloading an xml file encoded in windows-1250 and then, when saved to a file, special characters like Š and Đ are replaced with ? even tho the file is saved correctly using the windows-1250 encoding.
this is an example of my code (simplified):
var res = Encoding.GetEncoding("Windows-1250").GetBytes(client.DownloadString("http://link/file.xml"));
var result = Encoding.GetEncoding("Windows-1250").GetString(res);
File.AppendAllText("file.xml", result);
the xml file is in fact encoded using windows-1250 and it reads just fine when i download it using the browser.
anyone knows what's going on here?

The problem could result from two different sources, one at the beginning and one at the end of your snippet.
And as has been pointed out, the Encoding and Decoding you are doing in your code is actually useless, because the origin (what DownloadString returns) and target (the variable result) are both C# Unicode strings.
Source 1: DownloadString
DownloadString could not properly decode the Windows-1250 encoded string, because either the server did not send the correct charset in the Content-Type header, or DownloadString doesn't even support this (unlikely, but I'm not familiar with DownloadString).
Source 2: File.AppendAllText
The string was downloaded correctly, then encoded in memory to Windows-1250, then decoded to a Unicode string again and everything worked well.
But then it was written by File.AppendAllText in another default encoding. AppendAllText has an optional, third parameter that you can use to specify the encoding. You should set this to Windows-1250 to actually write a file in Windows-1250 encoding.
Also, make sure that whatever editor you use to open the file uses the same encoding - this is often not very easy to guarantee, so I'd suggest you open it in a "developer-friendly" editor that lets you specify the encoding when opening a text file. (Vim, Emacs, Notepad++, Visual Studio, ...).

Reading accented characters (á) C# UTF-8 / Windows 1252

I am trying to read a file that has some letters that aren't showing up correctly when I am trying to convert the file to XMl. The letters come up as blocks when I open them in notepad++, but in the original document they are correct. An example letter is á.
I am using UTF-8 to encode the file so it should be covered in that but it isn't for some reason. If I change it to windows 1252 then it shows the character correctly.
Why is it not available in UTF-8 encoding but is in Windows 1252?
If you need anymore information then just ask, Thanks in advance.

Chinese Character Encoding (UTF-8, GBK)

I have a web crawler that is run on different websites (Chinese in this case).
Now when I retrieve the data and display it on my website, the Chinese characters all end up as garbage. Well I read about character encoding, And I found out that UTF-8 is generally the best encoding.
Now the problem is when I use UTF-8 - The data crawled from WEBSITE-1 are shown correctly but not for WEBSITE-2.
For WEBSITE-2, the character encoding gb18030 is working correctly.
My question is, is there a way to know the character encoding for a website so that I can build a generic solution ? I mean I can render a page on my local website knowing what character encoding to use. In this way I can code in the backend, and not really worry on the front end what encoding is required to open a page.
Right now I have two pages, 1 for UTF-8 chinese characters, and one for GB18030 chinese characters.

Use the html meta tag "Content-Type" for html < 5 or the meta tag "char-set" for html 5
W3schools charset

Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF?

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?

This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).

On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).

For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.