Converting UTF-8 to UTF-16BE

Converting UTF-8 to UTF-16BE - c#

I'm wrestling with the following problem.
I am working with V.S.10 and using the .NET framework 2.0. Coding in C#.
I'm making a simple editor which hands over its text to a webservice. I know that .NET uses UTF-16 (I believe the default is LE? And I want Big Endian). I want to make it able to work in any editor and therefore attatch a BOM. The problem is that going through httml it gets changed I believe to UTF-8? Or at least that is what it seems from the following error:
Client found response content type of 'text/html;
charset=UTF-8', but expected 'text/xml'.
The request failed with an empty response.
EDIT: the documentation warns that the encoding of all the properties are UTF-8 withOUT a BOM marker. editorTextString is one of the properties. BUT the file content to upload must be in UTF-16BE WITH a BOM. I've checked to see if .net automatically translates the encoding and it does not. Or at least the chinese letters become ?'s. So I need to re-encode or convert better said, the text to UTF-16BE WITH BOM instead of the UTF-8 without BOM that it is in now.
I've looked through a ton of examples and can't see what I'm doing wrong here. Can someone offer advice or correct the code? (Yes I've also read Jon's really cool article about unicode :)) The theory is clear, the actual practice is lacking.
// Convert to UTF-16 Big Endian
Encoding leUnicode = Encoding.Unicode;
Encoding beUnicode = Encoding.BigEndianUnicode;
byte[] editorTextBytesLE = leUnicode.GetBytes(editorTextString);
Console.WriteLine("Little Endian - Encoded bytes:");
foreach (Byte b in editorTextBytesLE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
byte[] editorTextBytesBE = Encoding.Convert(leUnicode, beUnicode, editorTextBytesLE);
Console.WriteLine("BIG ENDIAN - Encoded bytes:");
foreach (Byte b in editorTextBytesBE)
{
Console.Write("[{0}]", b);
}
Console.WriteLine();
String decodedString = UnicodeEncoding.BigEndianUnicode.GetString(editorTextBytesBE);
Console.WriteLine();
Console.WriteLine("Decoded bytes:");
Console.WriteLine(decodedString);
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);
File.WriteAllText(fileName, decodedString);
Console.WriteLine("Uploading {0} to {1} ...", fileName, myURL);
// Upload the file to the URL
editorTextBytesBE = myWebClient.UploadFile(myURL, "PUT", fileName);
I haven't been able to find anything to switch to big endian, but I've seen some examples (which I couldn't get working alas) to switch TO UTF-8. Would much appreciate any help, examples, or links to get the code to UTF-16BE.

Partial answer:
The following code does not look like it is inserting anything. Instead it overwrites 2 bytes at positions 2 and 3 with your BOM. It skips the first 2.
// inserting UTF-16BE BOM marker, which eases recognition for any editor
byte[] editorTextBytesToSend = { 0xfe, 0xff };
editorTextBytesToSend.CopyTo(editorTextBytesBE, 2);

To have a file with BOM in either of UTF-X encoding simply create TextWriter with correct encoding:
using(var writer =
new StreamWriter(fileName, new Encoding.UnicodeEncoding(true,true,true))
{
writer.Write(editorTextString);
}
Use UnicodeEncoding constructor that give BOM.
Side note: there is a good chance that your problem is not related to use of this rare encoding, but it should fix what your code tries to do now.

I managed to work with the following code:
byte[] BOMTextBytesToSend = {0xfe, 0xff };
byte[] editorTextBytesToSend = System.Text.Encoding.BigEndianUnicode.GetBytes(editorTextString);
BOMTextBytesToSend.CopyTo(editorTextBytesToSend, 0);

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(Âµm / m)
Dümme Mässöng ==> DÃ¼mme MÃ¤ssÃ¶ng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;

This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)

try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.

This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).

"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).

Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}

I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.

For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Decode Stream to CSV in Python by Byte (Translate from C# code)

I am trying to consume a streamed response in Python from a soap API, and output a CSV file. The response outputs a string coded in base 64, which I do not know what to do with. Also the api documentation says that the response must be read to a destination buffer-by-buffer.
Here is the C# code was provided by the api's documentation:
byte[] buffer = new byte[4000];
bool endOfStream = false;
int bytesRead = 0;
using (FileStream localFileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write))
{
using (Stream remoteStream = client.DownloadFile(jobId))
{
while (!endOfStream)
{
bytesRead = remoteStream.Read(buffer, 0, buffer.Length);
if (bytesRead > 0)
{
localFileStream.Write(buffer, 0, bytesRead);
totalBytes += bytesRead;
}
else
{
endOfStream = true;
}
}
}
}
I have tried many different things to get this stream to a readable csv file, but non have worked.
with open('test.csv', 'w') as f: f.write(FileString)
Returns a csv with the base64 string spread over multiple lines
Here is my latest attempt:
with open('csvfile13.csv', 'wb') as csvfile:
FileString = client.service.DownloadFile(yyy.JobId, False)
stream = io.BytesIO(str(FileString))
with open(stream,"rt",4000) as readstream:
csvfile.write(readstream)
This produces the error:
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO
Any help would be greatly appreciated, even if it is just to point me in the right direction. I will be ensure to award the points to whoever is the most helpful, even if I do not completely solve the issue!
I have asked several questions similar to this one, but I have yet to find an answer that works completely:
What is the Python equivalent to FileStream in C#?
Write Streamed Response(file-like object) to CSV file Byte by Byte in Python
How to replicate C# 'byte' and 'Write' in Python
Let me know if you need further clarification!
Update:
I have tried print(base64.b64decode(str(FileString)))
This gives me a page full of webdings like
]�P�O�J��Y��KW �
I have also tried
for data in client.service.DownloadFile(yyy.JobId, False):
print data
But this just loops through the output character by characater like any other string.
I have also managed to get a long string of bytes like \xbc\x97_D\xfb(not actual bytes, just similar format) by decoding the entire string, but I do not know how to make this readable.
Edit: Corrected the output of the sample python, added more example code, formatting

It sounds like you need to use the base64 module to decode the downloaded data.
It might be as simple as:
with open(destinationPath, 'w') as localFile:
remoteFile = client.service.DownloadFile(yyy.JobId, False)
remoteData = str(remoteFile).decode('base64')
localFile.write(remoteData)
I suggest you break the problem down and determine what data you have at each stage. For example what exactly are you getting back from client.service.DownloadFile?
Decoding your sample downloaded data (given in the comments):
'UEsYAItH7brgsgPutAG\AoAYYAYa='.decode('base64')
gives
'PK\x18\x00\x8bG\xed\xba\xe0\xb2\x03\xee\xb4\x01\x80\xa0\x06\x18\x01\x86'
This looks suspiciously like a ZIP file header. I suggest you rename the file .zip and open it as such to investigate.
If remoteData is a ZIP something like the following should extract and write your CSV.
import io
import zipfile
remoteFile = client.service.DownloadFile(yyy.JobId, False)
remoteData = str(remoteFile).decode('base64')
zipStream = io.BytesIO(remoteData)
z = zipfile.ZipFile(zipStream, 'r')
csvData = z.read(z.infolist()[0])
with open(destinationPath, 'w') as localFile:
localFile.write(csvData)
Note: BASE64 can have some variations regarding padding and alternate character mapping but once you can see the data it should be reasonably clear what you need. Of course carefully read the documentation on your SOAP interface.

Are you sure FileString is a Base64 string? Based on the source code here, suds.sax.text.Text is a subclass of Unicode. You can write this to a file as you would a normal string but whatever you use to read the data from the file may corrupt it unless it's UTF-8-encoded.
You can try writing your Text object to a UTF-8-encoded file using io.open:
import io
with io.open('/path/to/my/file.txt', 'w', encoding='utf_8') as f:
f.write(FileString)
Bear in mind, your console or text editor may have trouble displaying non-ASCII characters but that doesn't mean they're not encoded properly. Another way to inspect them is to open the file back up in the Python interactive shell:
import io
with io.open('/path/to/my/file.txt', 'r', encoding='utf_8') as f:
next(f) # displays the representation of the first line of the file as a Unicode object
In Python 3, you can even use the built-in csv to parse the file, however in Python 2, you'll need to pip install backports.csv because the built-in module doesn't work with Unicode objects:
from backports import csv
import io
with io.open('/path/to/my/file.txt', 'r', encoding='utf_8') as f:
r = csv.reader(f)
next(r) # displays the representation of the first line of the file as a list of Unicode objects (each value separated)

get the video file after convert from base64 and UTF8 in c#

what i need to do is sending a file from java to c#. the java act as the client meanwhile, c# act as server.
the file is loaded in java through fileinputstream and its been converted to utf8 then base64. see the code.
FileInputStream fin=new FileInputStream(fileName);
byte[] content = new byte[fin.available()];
fin.read(content, 0, content.length);
String asString = new String(content, "UTF8");
byte[] newBytes = asString.getBytes("UTF8");
String base64 = Base64.encodeToString(newBytes, Base64.DEFAULT);
and the server (using c# language) will read the data send and convert it back as a file. im using base64 then to utf8 and last i am not sure how to make it. what im trying to send is video.mp4 size of 144kb or less. so far, the output shows the catch of "WRONG FORMAT". see the code.
try
{
for (int i = 0; i <= _server.Q.NoOfItem - 1; i++)
{
words = _server.Q.ElementAtBuffer(i).ToString();
//textBox1.Text = words;
byte[] encodedDataAsBytes = System.Convert.FromBase64String(words);
string returnValue = System.Text.Encoding.UTF8.GetString(encodedDataAsBytes);
textBox1.Text = returnValue;
}
}
catch (ArgumentNullException argNull)
{
textBox1.Text = "Received null value";
}
catch (FormatException FrmtEx)
{
textBox1.Text = "Wrong format";
}
you can ignore the for (int i = 0; i <= _server.Q.NoOfItem - 1; i++) as this is the way i want to capture/retrieve the data sent.
p/s: it works when im just trying to pass any string without load the file (string >> utf8 >> base64) and to receive (base64 >> utf8 >> string).

the file is loaded in java through fileinputstream and its been converted to utf8
Then you've lost data. Video data is not text data, so don't load it as text data. Treat it as binary data - by all means encode it to base64 if you need to represent it as a string somewhere but don't perform any text decoding on it, as that's only meant for encoded text data, which this isn't.
It's really important to understand what's wrong here. The only thing the two lines below can do is lose data. If they don't lose data, they serve no purpose - and if they do lose data, they're clearly a bad idea:
String asString = new String(content, "UTF8");
byte[] newBytes = asString.getBytes("UTF8");
You should analyze how you ended up with this code in the first place... why did you feel the need to convert the byte array to a string and back?
jowierun's answer is also correct - you shouldn't be using available() at all. You might want to use utility methods from Guava, such as Files.toByteArray if you definitely need to read the whole file into memory in one go.
p/s: it works when im just trying to pass any string without load the file (string >> utf8 >> base64) and to receive (base64 >> utf8 >> string).
Well yes - if you start with text data, then that's fine - UTF-8 can represent every valid string, and base64 is lossless, so you're fine. (Admittedly you could break it by presenting an invalid string with half of a surrogate pair, but...) The problem is at the point where you treat non-text data as text in the first place.

You shouldn't use fin.available() to assume you can read the file in one go. That is likely to work for only small files. Instead you need to do the read in a loop and collect all the contents together before you encode it.
It would make sense to (on the java side at least) to have a decode that routine you can use to TEST that your encode is working (a unit test perhaps?). You will probably find that test is failing consistently with the problem you are getting.

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(Âµm / m)
Dümme Mässöng ==> DÃ¼mme MÃ¤ssÃ¶ng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;

This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)

try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.

This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).

"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).

Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}

I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.

For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

"É" not getting converted to two bytes correctly

Further to this question I've got a supplementary problem.
I've found a track with an "É" in the title.
My code:
var playList = new StreamWriter(playlist, false, Encoding.UTF8);
-
private static void WriteUTF8(StreamWriter playList, string output)
{
byte[] byteArray = Encoding.UTF8.GetBytes(output);
foreach (byte b in byteArray)
{
playList.Write(Convert.ToChar(b));
}
}
converts this to the following bytes:
195
137
which is being output as Ã followed by a square (which is an character that can't be printed in the current font).
I've exported the same file to a playlist in Media Monkey at it writes the "É" as "Ã‰" - which I'm assuming is correct (as KennyTM pointed out).
My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one?
UPDATE
People seem to be missing the point.
I can get the "É" written to the file using
playList.WriteLine("É");
that's not the problem.
The problem is that Media Monkey requires the file to be in the following format:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Ã‰tÃ©: L'AprÃ¨s Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Ã‰tÃ©- L'AprÃ¨s Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.
UPDATE 2
I should be getting c9 replaced by c3 89.
I was going to put what I'm actually getting, but in doing the tests for this I've managed to get a test program to output the text in the right format "as is". So I need to do some more investigation.

Using Convert.ToChar like that is almost certainly a bad idea. You're basically encoding things twice.
You should either be performing the conversion yourself and then writing directly to a stream, or you should be letting the StreamWriter do the conversion. Why are you using a StreamWriter at all if you're trying to perform the conversions yourself?
Are you trying to write to a binary file, or a simple text file? If it's a simple text file, just use a StreamWriter and let that do the conversion. If it's a binary file, use a Stream instead of a StreamWriter, and perform text encoding directly where you need to, writing the bytes straight to the stream afterwards.
EDIT: Here's what's happening with your original code:
Encoding.UTF8.GetBytes(text) => byte[] { 0xc3, 0x89 };
Convert.ToChar(0xc3) => char U+00C3
StreamWriter writes U+00C3 as byte[] { 0xc3, 0x83 };
Convert.ToChar(0x89) => char U+0089
StreamWriter writes U+00C3 as byte[] { 0xc2, 0x89 };
So that's why you're getting c3 83 c2 89 written to the file.

StreamWriter already converts the characters you send it to UTF-8 — that's its entire purpose. Throw WriteUTF8 away; it's broken and useless.
(WriteUTF8 is taking characters, converting them to UTF-8 bytes, converting each single byte to the character it maps to in the current code page, then encoding each of those characters in UTF-8. So in the best case you have a doubly-UTF-8-encoded string; in the worst, you've completely lost bytes that weren't mapped in the system code page repertoire; especially bad for DBCS code pages.)
The problem you're having with Media Monkey may be just that it doesn't support UTF-8 or Unicode filenames at all. Try asking it to play (and export a playlist for) files with characters that don't fit in your system codepage, for example by renaming a file to αβγ.mp3.
Edit:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Ã‰tÃ©: L'AprÃ¨s Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Ã‰tÃ©- L'AprÃ¨s Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
OK, what you've got there is a mixture of encodings in the same file: it's no wonder text editors are going to have trouble opening it. The uncommented and #EXTINF lines are in the system default code page, and are present to support media players that can't read Unicode filenames. Any filename characters not present in the system code page (eg. Greek as above, on a Western Windows install) will be mangled and unplayable for anything that doesn't know about the #UTF8 (and #EXTINFUTF8 for the description) lines.
So if this is your target format, you'll need to grab two encodings and use each in turn, something like:
private static void writePlaylistEntry(Stream playlist, string filename, int length) {
Encoding utf8= new UTF8Encoding(false);
Encoding ansi= Encoding.Default;
playlist.Write(utf8.GetBytes("#EXTINFUTF8:"+length+","+filename+"\n"));
playlist.Write(ansi.GetBytes("#EXTINF:"+length+","+filename+"\n"));
playlist.Write(utf8.GetBytes("#UTF8:"+filename+"\n"));
playlist.Write(ansi.GetBytes(filename+"\n"));
}

I don't do C# but the symptoms tell me that you're indeed writing it as UTF-8, but that the output/console/application/whatever with which you're viewing the written output is not using UTF-8, but ISO-8859-1 to display them and that MediaMonkey is using CP1252 to display them.
If you're viewing them in the IDE console, then you need to configure the IDE to use UTF-8 as console and text file encoding.
Update you apparently want to write UTF-8 data as CP-1252. Now the question/problem is more clear. Again, I don't do C#, but the Java equivalent would be:
Writer writer = new OutputStreamWriter(new FileOutputStream("file.ext"), "CP-1252");
writer.write(someUTF8String); // Will be written as CP-1252. "É" would become "Ã‰"
Hopefully this gives some insights.

The more fundamental problem is in the name of the method:
private static void WriteUTF8(...)
.M3U files aren't UTF-8. They're Latin-1 (or Windows-1252).
Instead of Encoding.UTF8, you should be using Encoding.GetEncoding(1252). Then you can just write directly to the stream, you won't need any of this conversion weirdness.
Update:
I just tried the following C# code and the resulting .M3U opens just fine in both Winamp and WMP:
static void Main(string[] args)
{
string fileName = #"C:\Temp\Test.m3u";
using (StreamWriter writer = new StreamWriter(fileName, false,
Encoding.GetEncoding(1252)))
{
writer.WriteLine("#EXTM3U");
writer.WriteLine("#EXTINF:140,Yann Tiersen " +
"- Comptine D'Un Autre Été: L'Après Midi");
writer.WriteLine("04-Comptine D'Un Autre Été- L'Après Midi.mp3");
}
}
So, as I said - just use the right encoding to begin with. You don't need all those extra #EXTINFUTF8 and #UTF8 lines, unless it's some bizarre requirement for Media Monkey (it's definitely not part of the basic M3U spec).

Right, first off thanks to everyone for their help and patience.
I've finally got it working correctly. I've implemented a version of bobince's solution which is why he gets the acceptance (up-votes to everyone else). Here's my code:
var playList = new StreamWriter(playlist, false, Encoding.Default);
playList.WriteLine("#EXTM3U");
foreach (string track in tracks)
{
// Read ID3 tags from file
var info = new FileProperties(track);
// Write extended info (#EXTINF:<time>,<artist> - <title>
if (Encoding.UTF8.GetBytes(info.Artist).Length != info.Artist.Length ||
Encoding.UTF8.GetBytes(info.Title).Length != info.Title.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#EXTINFUTF8:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(string.Format("#EXTINF:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
// Write the name of the file (removing the drive letter)
string file = Path.GetFileName(track);
if (Encoding.UTF8.GetBytes(file).Length != file.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#UTF8:{0}", file));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(file);
}
playList.Close();
As you can see I assume I'm not going to have to write UTF8, but when I do I close the stream and reopen it with UTF8 encoding. I then, after writing the offending line, close and reopen it with the default encoding.
Now I don't know why my previous code gave inconsistent results. Given what everyone (particularly Jon) said it should have failed all the time, or possibly worked all of the time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Converting UTF-8 to UTF-16BE - c#

I managed to work with the following code: byte[] BOMTextBytesToSend = {0xfe, 0xff }; byte[] editorTextBytesToSend = System.Text.Encoding.BigEndianUnicode.GetBytes(editorTextString); BOMTextBytesToSend.CopyTo(editorTextBytesToSend, 0);

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

Decode Stream to CSV in Python by Byte (Translate from C# code)

get the video file after convert from base64 and UTF8 in c#

c#, Excel + csv: how to get the correct encoding?

"É" not getting converted to two bytes correctly

Categories

Resources