Read txt files (in unicode and utf8) by means of C#

Read txt files (in unicode and utf8) by means of C# - c#

I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:
...File.ReadAllText(utf8FileFullName, Encoding.UTF8);
and
...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);
But in both cases I got this "thank you - ???????". What's wrong?
Upd:
code for utf8
static void Main(string[] args)
{
var encoding = Encoding.UTF8;
var file = new FileInfo(#"D:\encodes\enc.txt");
Console.OutputEncoding = encoding;
var content = File.ReadAllText(file.FullName, encoding);
Console.WriteLine("encoding: " + encoding);
Console.WriteLine("content: " + content);
Console.ReadLine();
}
Result:
thanks ÑÐ¿Ð°ÑÐ¸Ð±Ð¾

Edited as UTF8 should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this
string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);

Use the Encoding type Default
File.ReadAllText(unicodeFileFullName, Encoding.Default);
It will fix the ???? Chracters.

When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.
Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.
"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."
As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.

Related

C# Set encoding to read and write File with special characters

I have a csv file with special characters from French and German language, for example:
é, ü, ö...
this file needs to be rearranged in program into different csv format.
Problem is special characters in result file are replaced either by � or just ?
I checked and reading source file is not good:
Encoding encoding = Encoding.GetEncoding("utf-8");
string fileName = #"D:\temp\sourceFile.csv";
var lines = File.ReadAllLines(fileName, encoding);
when I checked what is in lines list, already there special letters are replaced with ?, so the reading of file is not good in the first place. I tried various encoding but none of them works. Original csv file is ANSI, I don't know why that doesn't work at all.
in the end result is written into result file:
TextWriter tw = new StreamWriter(#"d:\temp\Result.csv", false, encoding);
foreach (var item in result)
{
tw.WriteLine(item.Key + ";" + item.Value);
}
tw.Close();
This encoding seems to be a problem from the start. I tried various including Encoding.Default, nothing works. Please, help!

i have same problem:
you could try some encodings if Default not functional:
"Windows 1252" -> Encoding.GetEncoding(1252); #Europe Ouest (Windows)
"ISO-8859-1" -> Encoding.GetEncoding(28591); #Europe occidentale (ISO)

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(Âµm / m)
Dümme Mässöng ==> DÃ¼mme MÃ¤ssÃ¶ng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;

This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)

try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.

This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).

"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).

Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}

I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.

For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

How to set the BOM for a file being read

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:
tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);
No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.
I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:
tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";
Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.
I then presented the saved file to my code and it displayed all the content correctly.
I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):
The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.
What puzzles me is specifying the " Encoding.UTF8" value should take care of this.
The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.
Edit:
Still no luck despite trying the suggestion in the answer.
I cut the test data down to containing just Arabic as follows:
Code as follows:
FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();
Tried with both "true" and "false" on the 2nd line, but both give the same result.
If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.
Here is what is looks like in Notepad++ (and what I would liek the textbox to display):
Not sure if the issue is in the reading from file, or the writing to the textbox.
I will try inspecting the data post read to see. But at the moment, I'm puzzled.

The StreamReader class has a constructor that will take care of testing for the BOM for you:
using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
using (var sr = new StreamReader(stream, Encoding.UTF8, true))
{
var text = sr.ReadToEnd();
}
}
The final true parameter is detectEncodingFromByteOrderMark:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:
UTF-8
little-endian Unicode
and big-endian Unicode text
if the file
starts with the appropriate byte order marks. Otherwise, the
user-provided encoding is used. See the Encoding.GetPreamble method
for more information.

C#: bytes to UTF-8 string conversion. Why doesn't it work?

There is a Chinese character 𤭢 which is presented in UTF-8 as F0 A4 AD A2. This character is described here: http://en.wikipedia.org/wiki/UTF-8
𤭢 U+24B62 F0 A4 AD A2
When I run this code in C# ...
byte[] data = { 0xF0, 0xA4, 0xAD, 0xA2 };
string abc = Encoding.UTF8.GetString(data);
Console.WriteLine("Test: description = {0}", abc);
... I redirect the output to the text file and then open it with notepad.exe choosing UTF-8 encoding. I expect to get 𤭢 in the output, but do get two question marks (??).
The byte sequence is right. It works in Perl:
print "\xF0\xA4\xAD\xA2";
In the output, I get 𤭢
So my question is: why do I get "??" instead of "𤭢" in C#?
P.S. Nothing special with this character: I got the same thing for any character (2, 3 or 4 byte long).

Console can't display Unicode characters by default. It displays only ASCII. To enable it display Unicode, use:
Console.OutputEncoding = System.Text.Encoding.Unicode
before writing to it.
But anyway it will fail on most OS, because Windows Command line doesn't support Unicode itself.
So, for testing purpose it would be better to write output to file

You need to write to a file using UTF8. The code below shows how you may do it. When opening the resulting file in Notepad, the character 𤭢 is shown correctly:
string c = "𤭢";
var bytes = Encoding.UTF8.GetBytes(c);
var cBack = Encoding.UTF8.GetString(bytes);
using (var writer = new StreamWriter(#"c:\temp\char.txt", false, Encoding.UTF8))
{
writer.WriteLine(cBack);
}

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(Âµm / m)
Dümme Mässöng ==> DÃ¼mme MÃ¤ssÃ¶ng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;

This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)

try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.

This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).

"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).

Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}

I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.

For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.