Unable to force ANSI encoding (windows-1252) - c#

I'm trying to save two text files in ANSI encoding for later processing by a legacy system. However when I save it in the correct encoding, it still saves as a UTF-8 file.
I've tried the following:
File.WriteAllLines(filePath, lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath, lines, Encoding.GetEncoding("windows-1252"));
using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.GetEncoding(1252)))
{
foreach (string line in lines)
{
writer.WriteLine(line);
}
}
I've also tried converting an existing utf-8 file to ansi
File.WriteAllBytes(fileName, Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1252), File.ReadAllBytes(fileName)));
None of the above solutions have worked; they're still UTF-8. The only way I managed to make it save as ANSI was by inserting swedish characters like åäö, which is a hack I cannot use for one of the files.
I'm at a loss. Has anyone got a solution to this issue?
We're on .NET Framework 4.5, C# 7.3

I did a thorough investigation and found that it works, just not in the way I expected. As #jdweng said: Nothing in the data contains the encoding, you're just saving bytes. For the most part you're saving regular ASCII characters, so when you for instance open Notepad++ to read it, it will default to whatever encoding it prefers, unless you have a special character that hints to the program which encoding to use.
I encoded a file in four encodings (default (UTF-8), ANSI, ASCII and UTF-8-BOM) and opened up all files in a hex editor and found that in most cases the ä in these files determined which decoder to use in Notepad++.
So if the legacy system uses an ANSI decoder, it should be able to open an "ANSI" encoded file without special characters. Despite it showing up as UTF-8 in Notepad++.

It definitely works. Try the following program:
using System.IO;
using System.Text;
namespace Demo
{
static class Program
{
static void Main()
{
string filePath = #"E:\tmp\test"; // Put your path here.
string[] lines = { "ÿ" };
File.WriteAllLines(filePath + ".1.bin", lines, Encoding.GetEncoding(1252));
File.WriteAllLines(filePath + ".2.bin", lines);
}
}
}
Run the program and then inspect the contents of the files in a binary editor.
You will see the following:
test.1.bin contains: FF 0D 0A
test.2.bin contains: 0C BF 0D 0A
(Note: If you drag and drop a ".bin" file into Visual Studio, it will open it in binary mode.)

Related

C # - .Net Core Write an ANSI encoded text file

I have to create an ANSI encoded txt file, because the system where I have to load it only reads ANSI and does not read UTF-8. I tried to follow various threads but nothing worked, now my code is this:
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
StreamWriter sw = new StreamWriter(fileName, false, Encoding.GetEncoding(1252));
sw.Write(sb.ToString());
sw.Close();
sb.Clear();
When I open the generated file in Notepad ++ I see that it is in UTF-8. Can someone help me?
Thanks
Notepad++ guesses the encoding based on the file's contents, as text files don't include their used encoding (unless they start with a BOM).
If you write using code page 1252, then that's what's used.

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Read UNIX encoded file with C#

I have c# program we use to replace some Values with others, to be used after as parameters. Like 'NAME1' replaced with &1, 'NAME2' with &2, and so on.
The problem is that the data to modify is on a text file encoded on UNIX, and special characters like í, which even on memory, gets read as a square(Invalid char). Due specifications that are out of my control, the file can't be changed and have no other choice than read it like that.
I have tryed to read with most of the 130 Encodings c# offers me with:
EncodingInfo[] info = System.Text.Encoding.GetEncodings();
string text;
for (int a = 0; a < info.Length; ++a)
{
text = File.ReadAllText(fn, info[a].GetEncoding());
File.WriteAllText(fn + a, text, info[a].GetEncoding());
}
fn is the file path to read. Have checked all the made files(like 130), no one of them writes properly the í so im out of ideas and im unable to find anything on internet.
SOLUTION:
Looks like finally this code made the work to get the text properly, also, had to fix the same encoder for the Writing part:
System.Text.Encoding encoding = System.Text.Encoding.GetEncodings()[41].GetEncoding();
String text = File.ReadAllText(fn, encoding); // get file text
// DO ALL THE STUFF I HAD TO
File.WriteAllText(fn, text, encoding) System.Text.Encoding.GetEncodings()[115].GetEncoding(); //Latin 9 (ISO)
/* ALL THIS ENCODINGS WORKED APARENTLY FOR ME WITH ALL WEIRD CHARS I WAS ABLE TO WRITE :P
System.Text.Encoding.GetEncodings()[108].GetEncoding(); //Baltic (ISO)
System.Text.Encoding.GetEncodings()[107].GetEncoding(); //Latin 3 (ISO)
System.Text.Encoding.GetEncodings()[106].GetEncoding(); //Central European (ISO)
System.Text.Encoding.GetEncodings()[105].GetEncoding(); //Western European (ISO)
System.Text.Encoding.GetEncodings()[49].GetEncoding(); //Vietnamese (Windows)
System.Text.Encoding.GetEncodings()[45].GetEncoding(); //Turkish (Windows)
System.Text.Encoding.GetEncodings()[41].GetEncoding(); //Central European (Windows) <-- Used this one
*/
Thank you very much for your help
Noman(1)
you have to get the proper encoding format. try
use file -i. That will output MIME-type information for the file,
which will also include the character-set encoding. I found a
man-page for it, too :)
Or try enca
It can guess and even convert between encodings. Just look at
the man page.
If you have the proper encoding format, look for a way to apply it to your file reading.
Quotes: How to find encoding of a file in Unix via script(s)

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

"É" not getting converted to two bytes correctly

Further to this question I've got a supplementary problem.
I've found a track with an "É" in the title.
My code:
var playList = new StreamWriter(playlist, false, Encoding.UTF8);
-
private static void WriteUTF8(StreamWriter playList, string output)
{
byte[] byteArray = Encoding.UTF8.GetBytes(output);
foreach (byte b in byteArray)
{
playList.Write(Convert.ToChar(b));
}
}
converts this to the following bytes:
195
137
which is being output as à followed by a square (which is an character that can't be printed in the current font).
I've exported the same file to a playlist in Media Monkey at it writes the "É" as "É" - which I'm assuming is correct (as KennyTM pointed out).
My question is, how do I get the "‰" symbol output? Do I need to select a different font and if so which one?
UPDATE
People seem to be missing the point.
I can get the "É" written to the file using
playList.WriteLine("É");
that's not the problem.
The problem is that Media Monkey requires the file to be in the following format:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
Where all the "high-ascii" (for want of a better term) are written out as a pair of characters.
UPDATE 2
I should be getting c9 replaced by c3 89.
I was going to put what I'm actually getting, but in doing the tests for this I've managed to get a test program to output the text in the right format "as is". So I need to do some more investigation.
Using Convert.ToChar like that is almost certainly a bad idea. You're basically encoding things twice.
You should either be performing the conversion yourself and then writing directly to a stream, or you should be letting the StreamWriter do the conversion. Why are you using a StreamWriter at all if you're trying to perform the conversions yourself?
Are you trying to write to a binary file, or a simple text file? If it's a simple text file, just use a StreamWriter and let that do the conversion. If it's a binary file, use a Stream instead of a StreamWriter, and perform text encoding directly where you need to, writing the bytes straight to the stream afterwards.
EDIT: Here's what's happening with your original code:
Encoding.UTF8.GetBytes(text) => byte[] { 0xc3, 0x89 };
Convert.ToChar(0xc3) => char U+00C3
StreamWriter writes U+00C3 as byte[] { 0xc3, 0x83 };
Convert.ToChar(0x89) => char U+0089
StreamWriter writes U+00C3 as byte[] { 0xc2, 0x89 };
So that's why you're getting c3 83 c2 89 written to the file.
StreamWriter already converts the characters you send it to UTF-8 — that's its entire purpose. Throw WriteUTF8 away; it's broken and useless.
(WriteUTF8 is taking characters, converting them to UTF-8 bytes, converting each single byte to the character it maps to in the current code page, then encoding each of those characters in UTF-8. So in the best case you have a doubly-UTF-8-encoded string; in the worst, you've completely lost bytes that weren't mapped in the system code page repertoire; especially bad for DBCS code pages.)
The problem you're having with Media Monkey may be just that it doesn't support UTF-8 or Unicode filenames at all. Try asking it to play (and export a playlist for) files with characters that don't fit in your system codepage, for example by renaming a file to αβγ.mp3.
Edit:
#EXTINFUTF8:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#EXTINF:140,Yann Tiersen - Comptine D'Un Autre Été: L'Après Midi
#UTF8:04-Comptine D'Un Autre Été- L'Après Midi.mp3
04-Comptine D'Un Autre Été- L'Après Midi.mp3
OK, what you've got there is a mixture of encodings in the same file: it's no wonder text editors are going to have trouble opening it. The uncommented and #EXTINF lines are in the system default code page, and are present to support media players that can't read Unicode filenames. Any filename characters not present in the system code page (eg. Greek as above, on a Western Windows install) will be mangled and unplayable for anything that doesn't know about the #UTF8 (and #EXTINFUTF8 for the description) lines.
So if this is your target format, you'll need to grab two encodings and use each in turn, something like:
private static void writePlaylistEntry(Stream playlist, string filename, int length) {
Encoding utf8= new UTF8Encoding(false);
Encoding ansi= Encoding.Default;
playlist.Write(utf8.GetBytes("#EXTINFUTF8:"+length+","+filename+"\n"));
playlist.Write(ansi.GetBytes("#EXTINF:"+length+","+filename+"\n"));
playlist.Write(utf8.GetBytes("#UTF8:"+filename+"\n"));
playlist.Write(ansi.GetBytes(filename+"\n"));
}
I don't do C# but the symptoms tell me that you're indeed writing it as UTF-8, but that the output/console/application/whatever with which you're viewing the written output is not using UTF-8, but ISO-8859-1 to display them and that MediaMonkey is using CP1252 to display them.
If you're viewing them in the IDE console, then you need to configure the IDE to use UTF-8 as console and text file encoding.
Update you apparently want to write UTF-8 data as CP-1252. Now the question/problem is more clear. Again, I don't do C#, but the Java equivalent would be:
Writer writer = new OutputStreamWriter(new FileOutputStream("file.ext"), "CP-1252");
writer.write(someUTF8String); // Will be written as CP-1252. "É" would become "É"
Hopefully this gives some insights.
The more fundamental problem is in the name of the method:
private static void WriteUTF8(...)
.M3U files aren't UTF-8. They're Latin-1 (or Windows-1252).
Instead of Encoding.UTF8, you should be using Encoding.GetEncoding(1252). Then you can just write directly to the stream, you won't need any of this conversion weirdness.
Update:
I just tried the following C# code and the resulting .M3U opens just fine in both Winamp and WMP:
static void Main(string[] args)
{
string fileName = #"C:\Temp\Test.m3u";
using (StreamWriter writer = new StreamWriter(fileName, false,
Encoding.GetEncoding(1252)))
{
writer.WriteLine("#EXTM3U");
writer.WriteLine("#EXTINF:140,Yann Tiersen " +
"- Comptine D'Un Autre Été: L'Après Midi");
writer.WriteLine("04-Comptine D'Un Autre Été- L'Après Midi.mp3");
}
}
So, as I said - just use the right encoding to begin with. You don't need all those extra #EXTINFUTF8 and #UTF8 lines, unless it's some bizarre requirement for Media Monkey (it's definitely not part of the basic M3U spec).
Right, first off thanks to everyone for their help and patience.
I've finally got it working correctly. I've implemented a version of bobince's solution which is why he gets the acceptance (up-votes to everyone else). Here's my code:
var playList = new StreamWriter(playlist, false, Encoding.Default);
playList.WriteLine("#EXTM3U");
foreach (string track in tracks)
{
// Read ID3 tags from file
var info = new FileProperties(track);
// Write extended info (#EXTINF:<time>,<artist> - <title>
if (Encoding.UTF8.GetBytes(info.Artist).Length != info.Artist.Length ||
Encoding.UTF8.GetBytes(info.Title).Length != info.Title.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#EXTINFUTF8:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(string.Format("#EXTINF:{0},{1} - {2}",
info.Duration, info.Artist, info.Title));
// Write the name of the file (removing the drive letter)
string file = Path.GetFileName(track);
if (Encoding.UTF8.GetBytes(file).Length != file.Length)
{
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.UTF8);
playList.WriteLine(string.Format("#UTF8:{0}", file));
playList.Close();
playList = new StreamWriter(playlist, true, Encoding.Default);
}
playList.WriteLine(file);
}
playList.Close();
As you can see I assume I'm not going to have to write UTF8, but when I do I close the stream and reopen it with UTF8 encoding. I then, after writing the offending line, close and reopen it with the default encoding.
Now I don't know why my previous code gave inconsistent results. Given what everyone (particularly Jon) said it should have failed all the time, or possibly worked all of the time.

Categories

Resources