"É" not getting converted to two bytes correctly - c#

Further to this question I've got a supplementary problem.
I've found a track with an "É" in the title.
My code:
var playList = new StreamWriter(playlist, false, Encoding.UTF8);
-
private static void WriteUTF8(StreamWriter playList, string output)
{
byte[] byteArray = Encoding.UTF8.GetBytes(output);
foreach (byte b in byteArray)
{
playList.Write(Convert.ToChar(b));
}
}
converts this to the following bytes:
195
137
which is being output as à followed by a square (which is an character that can't be printed in the current font).
I've exported the same file to a playlist in Media Monkey at it writes the "É" as "É" - which I'm assuming is correct (as KennyTM pointed out).
My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one?
UPDATE
People seem to be missing the point.
I can get the "É" written to the file using
playList.WriteLine("É");
that's not the problem.
The problem is that Media Monkey requires the file to be in the following format:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.
UPDATE 2
I should be getting c9 replaced by c3 89.
I was going to put what I'm actually getting, but in doing the tests for this I've managed to get a test program to output the text in the right format "as is". So I need to do some more investigation.

Using Convert.ToChar like that is almost certainly a bad idea. You're basically encoding things twice.
You should either be performing the conversion yourself and then writing directly to a stream, or you should be letting the StreamWriter do the conversion. Why are you using a StreamWriter at all if you're trying to perform the conversions yourself?
Are you trying to write to a binary file, or a simple text file? If it's a simple text file, just use a StreamWriter and let that do the conversion. If it's a binary file, use a Stream instead of a StreamWriter, and perform text encoding directly where you need to, writing the bytes straight to the stream afterwards.
EDIT: Here's what's happening with your original code:
Encoding.UTF8.GetBytes(text) => byte[] { 0xc3, 0x89 };
Convert.ToChar(0xc3) => char U+00C3
StreamWriter writes U+00C3 as byte[] { 0xc3, 0x83 };
Convert.ToChar(0x89) => char U+0089
StreamWriter writes U+00C3 as byte[] { 0xc2, 0x89 };
So that's why you're getting c3 83 c2 89 written to the file.

StreamWriter already converts the characters you send it to UTF-8 — that's its entire purpose. Throw WriteUTF8 away; it's broken and useless.
(WriteUTF8 is taking characters, converting them to UTF-8 bytes, converting each single byte to the character it maps to in the current code page, then encoding each of those characters in UTF-8. So in the best case you have a doubly-UTF-8-encoded string; in the worst, you've completely lost bytes that weren't mapped in the system code page repertoire; especially bad for DBCS code pages.)
The problem you're having with Media Monkey may be just that it doesn't support UTF-8 or Unicode filenames at all. Try asking it to play (and export a playlist for) files with characters that don't fit in your system codepage, for example by renaming a file to αβγ.mp3.
Edit:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
OK, what you've got there is a mixture of encodings in the same file: it's no wonder text editors are going to have trouble opening it. The uncommented and #EXTINF lines are in the system default code page, and are present to support media players that can't read Unicode filenames. Any filename characters not present in the system code page (eg. Greek as above, on a Western Windows install) will be mangled and unplayable for anything that doesn't know about the #UTF8 (and #EXTINFUTF8 for the description) lines.
So if this is your target format, you'll need to grab two encodings and use each in turn, something like:
private static void writePlaylistEntry(Stream playlist, string filename, int length) {
Encoding utf8= new UTF8Encoding(false);
Encoding ansi= Encoding.Default;
playlist.Write(utf8.GetBytes("#EXTINFUTF8:"+length+","+filename+"\n"));
playlist.Write(ansi.GetBytes("#EXTINF:"+length+","+filename+"\n"));
playlist.Write(utf8.GetBytes("#UTF8:"+filename+"\n"));
playlist.Write(ansi.GetBytes(filename+"\n"));
}

I don't do C# but the symptoms tell me that you're indeed writing it as UTF-8, but that the output/console/application/whatever with which you're viewing the written output is not using UTF-8, but ISO-8859-1 to display them and that MediaMonkey is using CP1252 to display them.
If you're viewing them in the IDE console, then you need to configure the IDE to use UTF-8 as console and text file encoding.
Update you apparently want to write UTF-8 data as CP-1252. Now the question/problem is more clear. Again, I don't do C#, but the Java equivalent would be:
Writer writer = new OutputStreamWriter(new FileOutputStream("file.ext"), "CP-1252");
writer.write(someUTF8String); // Will be written as CP-1252. "É" would become "É"
Hopefully this gives some insights.

The more fundamental problem is in the name of the method:
private static void WriteUTF8(...)
.M3U files aren't UTF-8. They're Latin-1 (or Windows-1252).
Instead of Encoding.UTF8, you should be using Encoding.GetEncoding(1252). Then you can just write directly to the stream, you won't need any of this conversion weirdness.
Update:
I just tried the following C# code and the resulting .M3U opens just fine in both Winamp and WMP:
static void Main(string[] args)
{
string fileName = #"C:\Temp\Test.m3u";
using (StreamWriter writer = new StreamWriter(fileName, false,
Encoding.GetEncoding(1252)))
{
writer.WriteLine("#EXTM3U");
writer.WriteLine("#EXTINF:140,Yann Tiersen " +
"- Comptine D'Un Autre Été: L'Après Midi");
writer.WriteLine("04-Comptine D'Un Autre Été- L'Après Midi.mp3");
}
}
So, as I said - just use the right encoding to begin with. You don't need all those extra #EXTINFUTF8 and #UTF8 lines, unless it's some bizarre requirement for Media Monkey (it's definitely not part of the basic M3U spec).

Right, first off thanks to everyone for their help and patience.
I've finally got it working correctly. I've implemented a version of bobince's solution which is why he gets the acceptance (up-votes to everyone else). Here's my code:
var playList = new StreamWriter(playlist, false, Encoding.Default);
playList.WriteLine("#EXTM3U");
foreach (string track in tracks)
{
// Read ID3 tags from file
var info = new FileProperties(track);
// Write extended info (#EXTINF:<time>,<artist> - <title>
if (Encoding.UTF8.GetBytes(info.Artist).Length != info.Artist.Length ||
Encoding.UTF8.GetBytes(info.Title).Length != info.Title.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#EXTINFUTF8:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(string.Format("#EXTINF:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
// Write the name of the file (removing the drive letter)
string file = Path.GetFileName(track);
if (Encoding.UTF8.GetBytes(file).Length != file.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#UTF8:{0}", file));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(file);
}
playList.Close();
As you can see I assume I'm not going to have to write UTF8, but when I do I close the stream and reopen it with UTF8 encoding. I then, after writing the offending line, close and reopen it with the default encoding.
Now I don't know why my previous code gave inconsistent results. Given what everyone (particularly Jon) said it should have failed all the time, or possibly worked all of the time.

Related

Unable to force ANSI encoding (windows-1252)

I'm trying to save two text files in ANSI encoding for later processing by a legacy system. However when I save it in the correct encoding, it still saves as a UTF-8 file.
I've tried the following:
File.WriteAllLines(filePath, lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath, lines, Encoding.GetEncoding("windows-1252"));
using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.GetEncoding(1252)))
{
foreach (string line in lines)
{
writer.WriteLine(line);
}
}
I've also tried converting an existing utf-8 file to ansi
File.WriteAllBytes(fileName, Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1252), File.ReadAllBytes(fileName)));
None of the above solutions have worked; they're still UTF-8. The only way I managed to make it save as ANSI was by inserting swedish characters like åäö, which is a hack I cannot use for one of the files.
I'm at a loss. Has anyone got a solution to this issue?
We're on .NET Framework 4.5, C# 7.3
I did a thorough investigation and found that it works, just not in the way I expected. As #jdweng said: Nothing in the data contains the encoding, you're just saving bytes. For the most part you're saving regular ASCII characters, so when you for instance open Notepad++ to read it, it will default to whatever encoding it prefers, unless you have a special character that hints to the program which encoding to use.
I encoded a file in four encodings (default (UTF-8), ANSI, ASCII and UTF-8-BOM) and opened up all files in a hex editor and found that in most cases the ä in these files determined which decoder to use in Notepad++.
So if the legacy system uses an ANSI decoder, it should be able to open an "ANSI" encoded file without special characters. Despite it showing up as UTF-8 in Notepad++.
It definitely works. Try the following program:
using System.IO;
using System.Text;
namespace Demo
{
static class Program
{
static void Main()
{
string filePath = #"E:\tmp\test"; // Put your path here.
string[] lines = { "ÿ" };
File.WriteAllLines(filePath + ".1.bin", lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath + ".2.bin", lines);
}
}
}
Run the program and then inspect the contents of the files in a binary editor.
You will see the following:
test.1.bin contains: FF 0D 0A
test.2.bin contains: 0C BF 0D 0A
(Note: If you drag and drop a ".bin" file into Visual Studio, it will open it in binary mode.)

Decode Stream to CSV in Python by Byte (Translate from C# code)

I am trying to consume a streamed response in Python from a soap API, and output a CSV file. The response outputs a string coded in base 64, which I do not know what to do with. Also the api documentation says that the response must be read to a destination buffer-by-buffer.
Here is the C# code was provided by the api's documentation:
byte[] buffer = new byte[4000];
bool endOfStream = false;
int bytesRead = 0;
using (FileStream localFileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write))
{
using (Stream remoteStream = client.DownloadFile(jobId))
{
while (!endOfStream)
{
bytesRead = remoteStream.Read(buffer, 0, buffer.Length);
if (bytesRead > 0)
{
localFileStream.Write(buffer, 0, bytesRead);
totalBytes += bytesRead;
}
else
{
endOfStream = true;
}
}
}
}
I have tried many different things to get this stream to a readable csv file, but non have worked.
with open('test.csv', 'w') as f: f.write(FileString)
Returns a csv with the base64 string spread over multiple lines
Here is my latest attempt:
with open('csvfile13.csv', 'wb') as csvfile:
FileString = client.service.DownloadFile(yyy.JobId, False)
stream = io.BytesIO(str(FileString))
with open(stream,"rt",4000) as readstream:
csvfile.write(readstream)
This produces the error:
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO
Any help would be greatly appreciated, even if it is just to point me in the right direction. I will be ensure to award the points to whoever is the most helpful, even if I do not completely solve the issue!
I have asked several questions similar to this one, but I have yet to find an answer that works completely:
What is the Python equivalent to FileStream in C#?
Write Streamed Response(file-like object) to CSV file Byte by Byte in Python
How to replicate C# 'byte' and 'Write' in Python
Let me know if you need further clarification!
Update:
I have tried print(base64.b64decode(str(FileString)))
This gives me a page full of webdings like
]�P�O�J��Y��KW �
I have also tried
for data in client.service.DownloadFile(yyy.JobId, False):
print data
But this just loops through the output character by characater like any other string.
I have also managed to get a long string of bytes like \xbc\x97_D\xfb(not actual bytes, just similar format) by decoding the entire string, but I do not know how to make this readable.
Edit: Corrected the output of the sample python, added more example code, formatting
It sounds like you need to use the base64 module to decode the downloaded data.
It might be as simple as:
with open(destinationPath, 'w') as localFile:
remoteFile = client.service.DownloadFile(yyy.JobId, False)
remoteData = str(remoteFile).decode('base64')
localFile.write(remoteData)
I suggest you break the problem down and determine what data you have at each stage. For example what exactly are you getting back from client.service.DownloadFile?
Decoding your sample downloaded data (given in the comments):
'UEsYAItH7brgsgPutAG\AoAYYAYa='.decode('base64')
gives
'PK\x18\x00\x8bG\xed\xba\xe0\xb2\x03\xee\xb4\x01\x80\xa0\x06\x18\x01\x86'
This looks suspiciously like a ZIP file header. I suggest you rename the file .zip and open it as such to investigate.
If remoteData is a ZIP something like the following should extract and write your CSV.
import io
import zipfile
remoteFile = client.service.DownloadFile(yyy.JobId, False)
remoteData = str(remoteFile).decode('base64')
zipStream = io.BytesIO(remoteData)
z = zipfile.ZipFile(zipStream, 'r')
csvData = z.read(z.infolist()[0])
with open(destinationPath, 'w') as localFile:
localFile.write(csvData)
Note: BASE64 can have some variations regarding padding and alternate character mapping but once you can see the data it should be reasonably clear what you need. Of course carefully read the documentation on your SOAP interface.
Are you sure FileString is a Base64 string? Based on the source code here, suds.sax.text.Text is a subclass of Unicode. You can write this to a file as you would a normal string but whatever you use to read the data from the file may corrupt it unless it's UTF-8-encoded.
You can try writing your Text object to a UTF-8-encoded file using io.open:
import io
with io.open('/path/to/my/file.txt', 'w', encoding='utf_8') as f:
f.write(FileString)
Bear in mind, your console or text editor may have trouble displaying non-ASCII characters but that doesn't mean they're not encoded properly. Another way to inspect them is to open the file back up in the Python interactive shell:
import io
with io.open('/path/to/my/file.txt', 'r', encoding='utf_8') as f:
next(f) # displays the representation of the first line of the file as a Unicode object
In Python 3, you can even use the built-in csv to parse the file, however in Python 2, you'll need to pip install backports.csv because the built-in module doesn't work with Unicode objects:
from backports import csv
import io
with io.open('/path/to/my/file.txt', 'r', encoding='utf_8') as f:
r = csv.reader(f)
next(r) # displays the representation of the first line of the file as a list of Unicode objects (each value separated)

How to set the BOM for a file being read

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:
tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);
No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.
I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:
tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";
Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.
I then presented the saved file to my code and it displayed all the content correctly.
I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):
The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.
What puzzles me is specifying the " Encoding.UTF8" value should take care of this.
The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.
Edit:
Still no luck despite trying the suggestion in the answer.
I cut the test data down to containing just Arabic as follows:
Code as follows:
FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();
Tried with both "true" and "false" on the 2nd line, but both give the same result.
If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.
Here is what is looks like in Notepad++ (and what I would liek the textbox to display):
Not sure if the issue is in the reading from file, or the writing to the textbox.
I will try inspecting the data post read to see. But at the moment, I'm puzzled.
The StreamReader class has a constructor that will take care of testing for the BOM for you:
using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
using (var sr = new StreamReader(stream, Encoding.UTF8, true))
{
var text = sr.ReadToEnd();
}
}
The final true parameter is detectEncodingFromByteOrderMark:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:
UTF-8
little-endian Unicode
and big-endian Unicode text
if the file
starts with the appropriate byte order marks. Otherwise, the
user-provided encoding is used. See the Encoding.GetPreamble method
for more information.

Converting UTF-8 to UTF-16BE

I'm wrestling with the following problem.
I am working with V.S.10 and using the .NET framework 2.0. Coding in C#.
I'm making a simple editor which hands over its text to a webservice. I know that .NET uses UTF-16 (I believe the default is LE? And I want Big Endian). I want to make it able to work in any editor and therefore attatch a BOM. The problem is that going through httml it gets changed I believe to UTF-8? Or at least that is what it seems from the following error:
Client found response content type of 'text/html;
charset=UTF-8', but expected 'text/xml'.
The request failed with an empty response.
EDIT: the documentation warns that the encoding of all the properties are UTF-8 withOUT a BOM marker. editorTextString is one of the properties. BUT the file content to upload must be in UTF-16BE WITH a BOM. I've checked to see if .net automatically translates the encoding and it does not. Or at least the chinese letters become ?'s. So I need to re-encode or convert better said, the text to UTF-16BE WITH BOM instead of the UTF-8 without BOM that it is in now.
I've looked through a ton of examples and can't see what I'm doing wrong here. Can someone offer advice or correct the code? (Yes I've also read Jon's really cool article about unicode :)) The theory is clear, the actual practice is lacking.
// Convert to UTF-16 Big Endian
Encoding leUnicode = Encoding.Unicode;
Encoding beUnicode = Encoding.BigEndianUnicode;
byte[] editorTextBytesLE = leUnicode.GetBytes(editorTextString);
Console.WriteLine("Little Endian - Encoded bytes:");
foreach (Byte b in editorTextBytesLE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
byte[] editorTextBytesBE = Encoding.Convert(leUnicode, beUnicode, editorTextBytesLE);
Console.WriteLine("BIG ENDIAN - Encoded bytes:");
foreach (Byte b in editorTextBytesBE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
String decodedString = UnicodeEncoding.BigEndianUnicode.GetString(editorTextBytesBE);
Console.WriteLine();
Console.WriteLine("Decoded bytes:");
Console.WriteLine(decodedString);
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);
File.WriteAllText(fileName, decodedString);
Console.WriteLine("Uploading {0} to {1} ...", fileName, myURL);
// Upload the file to the URL
editorTextBytesBE = myWebClient.UploadFile(myURL, "PUT", fileName);
I haven't been able to find anything to switch to big endian, but I've seen some examples (which I couldn't get working alas) to switch TO UTF-8. Would much appreciate any help, examples, or links to get the code to UTF-16BE.
Partial answer:
The following code does not look like it is inserting anything. Instead it overwrites 2 bytes at positions 2 and 3 with your BOM. It skips the first 2.
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);
To have a file with BOM in either of UTF-X encoding simply create TextWriter with correct encoding:
using(var writer =
new StreamWriter(fileName, new Encoding.UnicodeEncoding(true,true,true))
{
writer.Write(editorTextString);
}
Use UnicodeEncoding constructor that give BOM.
Side note: there is a good chance that your problem is not related to use of this rare encoding, but it should fix what your code tries to do now.
I managed to work with the following code:
byte[] BOMTextBytesToSend = {0xfe, 0xff };
byte[] editorTextBytesToSend = System.Text.Encoding.BigEndianUnicode.GetBytes(editorTextString);
BOMTextBytesToSend.CopyTo(editorTextBytesToSend, 0);

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Categories

Resources