I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048
Related
I'm trying to save two text files in ANSI encoding for later processing by a legacy system. However when I save it in the correct encoding, it still saves as a UTF-8 file.
I've tried the following:
File.WriteAllLines(filePath, lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath, lines, Encoding.GetEncoding("windows-1252"));
using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.GetEncoding(1252)))
{
foreach (string line in lines)
{
writer.WriteLine(line);
}
}
I've also tried converting an existing utf-8 file to ansi
File.WriteAllBytes(fileName, Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1252), File.ReadAllBytes(fileName)));
None of the above solutions have worked; they're still UTF-8. The only way I managed to make it save as ANSI was by inserting swedish characters like åäö, which is a hack I cannot use for one of the files.
I'm at a loss. Has anyone got a solution to this issue?
We're on .NET Framework 4.5, C# 7.3
I did a thorough investigation and found that it works, just not in the way I expected. As #jdweng said: Nothing in the data contains the encoding, you're just saving bytes. For the most part you're saving regular ASCII characters, so when you for instance open Notepad++ to read it, it will default to whatever encoding it prefers, unless you have a special character that hints to the program which encoding to use.
I encoded a file in four encodings (default (UTF-8), ANSI, ASCII and UTF-8-BOM) and opened up all files in a hex editor and found that in most cases the ä in these files determined which decoder to use in Notepad++.
So if the legacy system uses an ANSI decoder, it should be able to open an "ANSI" encoded file without special characters. Despite it showing up as UTF-8 in Notepad++.
It definitely works. Try the following program:
using System.IO;
using System.Text;
namespace Demo
{
static class Program
{
static void Main()
{
string filePath = #"E:\tmp\test"; // Put your path here.
string[] lines = { "ÿ" };
File.WriteAllLines(filePath + ".1.bin", lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath + ".2.bin", lines);
}
}
}
Run the program and then inspect the contents of the files in a binary editor.
You will see the following:
test.1.bin contains: FF 0D 0A
test.2.bin contains: 0C BF 0D 0A
(Note: If you drag and drop a ".bin" file into Visual Studio, it will open it in binary mode.)
I'm wrestling with the following problem.
I am working with V.S.10 and using the .NET framework 2.0. Coding in C#.
I'm making a simple editor which hands over its text to a webservice. I know that .NET uses UTF-16 (I believe the default is LE? And I want Big Endian). I want to make it able to work in any editor and therefore attatch a BOM. The problem is that going through httml it gets changed I believe to UTF-8? Or at least that is what it seems from the following error:
Client found response content type of 'text/html;
charset=UTF-8', but expected 'text/xml'.
The request failed with an empty response.
EDIT: the documentation warns that the encoding of all the properties are UTF-8 withOUT a BOM marker. editorTextString is one of the properties. BUT the file content to upload must be in UTF-16BE WITH a BOM. I've checked to see if .net automatically translates the encoding and it does not. Or at least the chinese letters become ?'s. So I need to re-encode or convert better said, the text to UTF-16BE WITH BOM instead of the UTF-8 without BOM that it is in now.
I've looked through a ton of examples and can't see what I'm doing wrong here. Can someone offer advice or correct the code? (Yes I've also read Jon's really cool article about unicode :)) The theory is clear, the actual practice is lacking.
// Convert to UTF-16 Big Endian
Encoding leUnicode = Encoding.Unicode;
Encoding beUnicode = Encoding.BigEndianUnicode;
byte[] editorTextBytesLE = leUnicode.GetBytes(editorTextString);
Console.WriteLine("Little Endian - Encoded bytes:");
foreach (Byte b in editorTextBytesLE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
byte[] editorTextBytesBE = Encoding.Convert(leUnicode, beUnicode, editorTextBytesLE);
Console.WriteLine("BIG ENDIAN - Encoded bytes:");
foreach (Byte b in editorTextBytesBE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
String decodedString = UnicodeEncoding.BigEndianUnicode.GetString(editorTextBytesBE);
Console.WriteLine();
Console.WriteLine("Decoded bytes:");
Console.WriteLine(decodedString);
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);
File.WriteAllText(fileName, decodedString);
Console.WriteLine("Uploading {0} to {1} ...", fileName, myURL);
// Upload the file to the URL
editorTextBytesBE = myWebClient.UploadFile(myURL, "PUT", fileName);
I haven't been able to find anything to switch to big endian, but I've seen some examples (which I couldn't get working alas) to switch TO UTF-8. Would much appreciate any help, examples, or links to get the code to UTF-16BE.
Partial answer:
The following code does not look like it is inserting anything. Instead it overwrites 2 bytes at positions 2 and 3 with your BOM. It skips the first 2.
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);
To have a file with BOM in either of UTF-X encoding simply create TextWriter with correct encoding:
using(var writer =
new StreamWriter(fileName, new Encoding.UnicodeEncoding(true,true,true))
{
writer.Write(editorTextString);
}
Use UnicodeEncoding constructor that give BOM.
Side note: there is a good chance that your problem is not related to use of this rare encoding, but it should fix what your code tries to do now.
I managed to work with the following code:
byte[] BOMTextBytesToSend = {0xfe, 0xff };
byte[] editorTextBytesToSend = System.Text.Encoding.BigEndianUnicode.GetBytes(editorTextString);
BOMTextBytesToSend.CopyTo(editorTextBytesToSend, 0);
I have c# program we use to replace some Values with others, to be used after as parameters. Like 'NAME1' replaced with &1, 'NAME2' with &2, and so on.
The problem is that the data to modify is on a text file encoded on UNIX, and special characters like í, which even on memory, gets read as a square(Invalid char). Due specifications that are out of my control, the file can't be changed and have no other choice than read it like that.
I have tryed to read with most of the 130 Encodings c# offers me with:
EncodingInfo[] info = System.Text.Encoding.GetEncodings();
string text;
for (int a = 0; a < info.Length; ++a)
{
text = File.ReadAllText(fn, info[a].GetEncoding());
File.WriteAllText(fn + a, text, info[a].GetEncoding());
}
fn is the file path to read. Have checked all the made files(like 130), no one of them writes properly the í so im out of ideas and im unable to find anything on internet.
SOLUTION:
Looks like finally this code made the work to get the text properly, also, had to fix the same encoder for the Writing part:
System.Text.Encoding encoding = System.Text.Encoding.GetEncodings()[41].GetEncoding();
String text = File.ReadAllText(fn, encoding); // get file text
// DO ALL THE STUFF I HAD TO
File.WriteAllText(fn, text, encoding) System.Text.Encoding.GetEncodings()[115].GetEncoding(); //Latin 9 (ISO)
/* ALL THIS ENCODINGS WORKED APARENTLY FOR ME WITH ALL WEIRD CHARS I WAS ABLE TO WRITE :P
System.Text.Encoding.GetEncodings()[108].GetEncoding(); //Baltic (ISO)
System.Text.Encoding.GetEncodings()[107].GetEncoding(); //Latin 3 (ISO)
System.Text.Encoding.GetEncodings()[106].GetEncoding(); //Central European (ISO)
System.Text.Encoding.GetEncodings()[105].GetEncoding(); //Western European (ISO)
System.Text.Encoding.GetEncodings()[49].GetEncoding(); //Vietnamese (Windows)
System.Text.Encoding.GetEncodings()[45].GetEncoding(); //Turkish (Windows)
System.Text.Encoding.GetEncodings()[41].GetEncoding(); //Central European (Windows) <-- Used this one
*/
Thank you very much for your help
Noman(1)
you have to get the proper encoding format. try
use file -i. That will output MIME-type information for the file,
which will also include the character-set encoding. I found a
man-page for it, too :)
Or try enca
It can guess and even convert between encodings. Just look at
the man page.
If you have the proper encoding format, look for a way to apply it to your file reading.
Quotes: How to find encoding of a file in Unix via script(s)
I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048
I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.