Reading accented characters (á) C# UTF-8 / Windows 1252

Reading accented characters (á) C# UTF-8 / Windows 1252 - c#

I am trying to read a file that has some letters that aren't showing up correctly when I am trying to convert the file to XMl. The letters come up as blocks when I open them in notepad++, but in the original document they are correct. An example letter is á.
I am using UTF-8 to encode the file so it should be covered in that but it isn't for some reason. If I change it to windows 1252 then it shows the character correctly.
Why is it not available in UTF-8 encoding but is in Windows 1252?
If you need anymore information then just ask, Thanks in advance.

Related

Unicode to UTF-8 to Unicode?

I'm reading some data, including CDATA strings from an XML. The XML is generated by a linux machine, and encoded in utf-8. The text in the XML is again created by a person on a windows machine and may contain windows unicode symbols like „ and “. Now these symbols somehow get corrupted in the whole process. When I look at the XML with my browser, the symbols are invisible, when I paste the text into the windows editor, they are displayed as rectangles (invalid chars). When I paste them into and ascii decoder (http://www.asciivalue.com/index.php) they get untangled into their correct HTML representation. (& #132; & #147;). When I save them with unicode formatting in the editor, they will come out as 84 00 93 00.
How can I convert the XML string in C# so that these unicode symbols will be restored ?

Your terminology is confusing. Unicode is a set of characters, UTF-8 is an encoding of Unicode; you can't therefore convert Unicode to UTF-8, you can only convert between UTF-8 and some other encoding of Unicode. Similarly, "Windows Unicode" is nonsense.
I think that when the „ and “ characters were inserted into your XML file, they were incorrectly represented using their Windows-1252 codes rather than their UTF-8 codes. So your file is a mixture of UTF-8 and Windows-1252, which makes it impossible to decode. You need to prevent this happening.

Forcinge the codepage of a file so that it can show characters encoded with windows-1256 correctly?

I need to save a plain text file with Arabic characters in it, and
Arabic characters must be appear as Arabic when the file will be opened.
I can insert the names without problem using Encoding.GetEncoding(1256) and save the file - again using 1256 as the StreamWriter's codepage.
However, when viewing the resulting file in Notepad++ the characters do not appear correctly and I have to deliberately switch the codepage back to 1256 for them to appear in Arabic.
I am then transmitting the file to a third party, but they cannot change the codepage (I have no idea why!) and therefore cannot read the Arabic.
Is there any way I can save the file so that the codepage to be used is "embedded" in the file?

Save the file as UTF-8. That should automatically include a magic BOM (Byte Order Mark) in the beginning of the file so applications opening the file will know it is encoded in UTF-8.

Handling special characters in c#

In a C# console app, I am using stringbuilder to write data to a local file. It seems to be mishandling special characters
Muñoz
outputs to the file as
MuÃ±oz
at a bit of a loss how to manage that correctly.

Your C# code is correctly writing a UTF8 file, in which ñ is encoded as 3 bytes.
You're incorrectly reading the file as a different encoding which shows those bytes as three unwanted characters.
You need to read the file as UTF8.

Strange UTF-8 encoding issues when reading XML, writing results in C#

I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?

The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.