I am trying to read a file that has some letters that aren't showing up correctly when I am trying to convert the file to XMl. The letters come up as blocks when I open them in notepad++, but in the original document they are correct. An example letter is á.
I am using UTF-8 to encode the file so it should be covered in that but it isn't for some reason. If I change it to windows 1252 then it shows the character correctly.
Why is it not available in UTF-8 encoding but is in Windows 1252?
If you need anymore information then just ask, Thanks in advance.
This question already has answers here:
How can I detect the encoding/codepage of a text file?
(21 answers)
Closed 9 years ago.
I have an input file in my asp.net application.
The user submit a CSV file to update the database.
This CSV file is created by exporting a .xlsx.
This .xlsx file contains non-ascii chars, such as França, Rússia, etc.
The user sometimes, incorrectly saves it via "CSV (MS-DOS)" (which writes ASCII format) instead of "CSV (comma separated file)" (which preserves .xlsx encoding).
So, to validate file encoding before write its content in the database....
How can I safely detect file encoding of a file submitted in .net?
ps.: BOM verification is not enough. A file can be UTF w/ BOM.
How can I safely detect file encoding of a file submitted in .net?
You can't.
Excel's "CSV" saving comes out in the machine's ANSI code page, and "CSV (MS-DOS)" comes out in the OEM code page. Both these encodings vary from machine to machine and they're never anything useful like UTF-8 or UTF-16. (Indeed, on some East Asian machines, they may not even be fully ASCII-compatible.)
You might be able take a guess based on heuristics. For example if França is a common value in the documents you handle, you could detect its common encodings:
F r a n ç a
Code page 1252 (ANSI on Western European machines): 46 72 61 6e e7 61
Code page 850 (OEM on Western European Machines): 46 72 61 6e 87 61
If you don't have any constant patterns like that the best you can do is arbitrary guessing (see this question). Either way it hardly qualifies as 'safe'.
CSV as a format does not have a mechanism for declaring encoding, and there isn't a de-facto standard of just using UTF-8. So it can't really be used as a mechanism for transferring non-ASCII text with any degree of reliability.
An alternative you could look at would be to encourage your users to save from Excel as "Unicode text". This will give you a .txt file in the UTF-16LE encoding (Encoding.Unicode in .NET terms), which you can easily detect from the BOM. The content is TSV, so same quoting rules as CSV but with tab separators.
I need to save a plain text file with Arabic characters in it, and
Arabic characters must be appear as Arabic when the file will be opened.
I can insert the names without problem using Encoding.GetEncoding(1256) and save the file - again using 1256 as the StreamWriter's codepage.
However, when viewing the resulting file in Notepad++ the characters do not appear correctly and I have to deliberately switch the codepage back to 1256 for them to appear in Arabic.
I am then transmitting the file to a third party, but they cannot change the codepage (I have no idea why!) and therefore cannot read the Arabic.
Is there any way I can save the file so that the codepage to be used is "embedded" in the file?
Save the file as UTF-8. That should automatically include a magic BOM (Byte Order Mark) in the beginning of the file so applications opening the file will know it is encoded in UTF-8.
In a C# console app, I am using stringbuilder to write data to a local file. It seems to be mishandling special characters
Muñoz
outputs to the file as
Muñoz
at a bit of a loss how to manage that correctly.
Your C# code is correctly writing a UTF8 file, in which ñ is encoded as 3 bytes.
You're incorrectly reading the file as a different encoding which shows those bytes as three unwanted characters.
You need to read the file as UTF8.
I'm having an issue with a simple C# program that is meant to read an XML document from the web, pull out some elements, and then write the contents of those elements to an HTML file (in a simple table). Though the XML documents are correctly encoded as UTF-8, in the end, all of my generated HTML files are failing to correctly transcribe non-Western English characters (e.g. "Wingdings"-like output when parsing Japanese).
Since the XML files are really large, the program works by having an XmlReader yielding matching elements as it encounters them, which are then written to the HTML file using a StreamWriter.
Does anyone have a sense of where in a program like this the UTF-8 encoding might have to be explicitly forced?
The short explanation
I'm going to guess here: Your browser is displaying the page using the wrong character encoding.
You need to answer: What character encoding does your browser think the HTML is? (I bet it's not UTF-8.)
Try to adjust your browser: for example, in Firefox, this is View → Character Encoding, then select the character encoding to match your document.
Since you seem to have a very multilingual document, have your C# output in UTF-8 - which supports every character known to man, including Japanese, Chinese, Latin, etc. Then try to tell Firefox, IE, whatever, to use UTF-8. Your document should display.
If this is the problem, the you need to inform the browser of the encoding of your document. Do so by (see this):
Having your web server return the character encoding in the HTTP headers.
Specifying a character encoding in a <meta> tag.
Specifying a character encoding in the XML preamble for XHTML.
The more of those you do, the merrier.
The long explanation
Let's have a look at a few things you mentioned:
using (StreamWriter sw = new StreamWriter(outputFile,true,System.Text.Encoding.UTF8))
and
found that using Text.Encoding.Default made other Western character sets with accents work (Spanish accents, German umlauts), although Japanese still exhibits problems.
I'm going to go out on a limb, and say that you're an American computer user. Thus, for you, the "default" encoding on Windows is probably Windows-1252. The default encoding that a web browser will use, if it can't detect the encoding on an HTML document, is ISO-8859-1. ISO-8859-1 and Windows-1252 are very similar, and they both display ASCII plus some common Latin characters such as é, è, etc. More importantly, the accented characters are encoded the same, so, for those characters, the two encodings will both decode the same data. Thus, when you switched to "default", the browser was correctly decoding your Latin characters, albeit with the wrong encoding. Japanese doesn't exist in either ISO-8859-1 or Windows-1252, and both of those will result in Japanese just appears as random characters. ("Mojibake")
The fact that you noted that switching to "default" fixes some of the accented latin characters tells me that your browser is using ISO-8859-1, which isn't what we want: We want to encode the text using UTF-8, and we need the browser to read it back as such. See the short explanation for the how to do that.