Encode cp1252 string to utf-8 string in c# - c#

How I can convert cp1252 string to utf-8 string in c#?
I tried this code, but it doesn't work:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.GetEncoding(1251);
byte[] wind1252Bytes = ReadFile(myString1252);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string myStringUtf8 = Encoding.UTF8.GetString(utf8Bytes);

var myGoodString = System.IO.File.ReadAllText(
#"C:\path\to\file.txt",
Encoding.GetEncoding("Windows-1252")
);
A .NET/CLR string in memory cannot be UTF-8. It is just Unicode, or UTF-16 if you like.
The above code will properly read a text file in CP1252 into a .NET string.
If you insist on going through a byte[] wind1252Bytes, it is simply:
var myGoodString = Encoding.GetEncoding("Windows-1252").GetString(wind1252Bytes);
Since this answer was written, new versions of the framework .NET have appeared which do not by default recognize all the old (legacy) Windows-specific code pages. If Encoding.GetEncoding("Windows-1252") throws an exception with your runtime version, try registrering an additional provider with
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
(may need additional assembly reference to System.Text.Encoding.CodePages.dll) before you use Encoding.GetEncoding("Windows-1252").
See CodePagesEncodingProvider class documentation.

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Convert text file from known (1252) encoding to UTF8 File

I have a file which has text with windows-1252 encoding. How to convert it into UTF8 file format?
The Encoding class supports conversions.
byte[] asciiBytes = File.ReadAllBytes("C:\\ascii.txt");
Encoding ASCII_1252 = Encoding.GetEncoding("windows-1252");
byte[] utf8Bytes = Encoding.Convert(ASCII_1252, Encoding.UTF8, asciiBytes);
File.WriteAllBytes("C:\\utf8.txt", utf8Bytes);
Note that GetEncoding() relies on the underlying platform to support most code pages as explained here: http://msdn.microsoft.com/en-us/library/t9a3kf7c(v=vs.100).aspx

How to "iso-8859-1" encoding a string in jQuery?

I'm looking for a jQuery(or jQuery plugin) equivalent of this C# code block. What it does is to encode a string to base64 string in iso-8859-1 character set.
string authInfo = "encrypted secret";
Encoding encoding = Encoding.GetEncoding("iso-8859-1");
byte[] authBytes = encoding.GetBytes(authInfo);
string encryptedMsg = Convert.ToBase64String(authBytes);
Is there a plugin out there that can do this?
Found a jQuery plugin that's close enough to what I need: Base64 encode and decode
It doesn't have an option to specify character set but I can live with it for now. So the jQuery code becomes:
authInfo = $.base64.encode(authInfo);
I believe you must specify the character encoding of the page (or where ever authInfo is defined) to ISO-8859-1. You may also specify the character encoding of the tag for referenced javascript files if authInfo is defined in one of those.
As for base64 encoding, here's a page that has a code snippet that does just that: http://www.webtoolkit.info/javascript-base64.html

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Difference in SHA hashes between ruby and C#

I'm developing an application, that makes use of some REST web services.
It's technical documentation says that I should pass SHA256 hash of some string in the request.
In an example request (in the documentation) a string:
hn-Rw2ZHYwllUYkklL5Zo_7lWJVkrbShZPb5CD1expires=1893013926label[0]=any/somestatistics=1d,2d,7d,28d,30d,31d,lifetimestatus=upl,livetitle=a
After executing:
digest = Digest::SHA256.digest(string_to_sign)
signature = Base64::encode64(digest).chomp.gsub(/=+$/, '')
results in a hash:
YRYuN2zO+VvxISNp/vKQM5Cl6Dpzoin7mNES0IZJ06U
This example is in ruby, as the documentation is for ruby developers.
I'm developing my application in C# and for the exactly same string, when I execute:
byte[] rawHash = sha256.ComputeHash(rawRequest, 0, rawRequest.Length);
string friendlyHash = Convert.ToBase64String(rawHash);
and remove the trailing "=" signs, I get:
Vw8pl/KxnjcEbyHtfNiMikXZdIunysFF2Ujsow8hyiw
and therefore, the application fails to execute resulting in an signature mismatch error.
I've tried changing the encoding while converting the string to a byte array preceding the hashing and nothing changed.
Any ideas?
Based on the document here, you are missing a - (that is a dash) in your string. Seems that Acrobat helpfully removes it in a copy paste from the document...
Here is some code that I splatted together that gets the same value as the example (well it would if you trimmed the final =)
string s = "hn-Rw2ZH-YwllUYkklL5Zo_7lWJVkrbShZPb5CD1expires=1893013926label[0]=any/somestatistics=1d,2d,7d,28d,30d,31d,lifetimestatus=upl,livetitle=a";
SHA256Managed sh = new SHA256Managed();
byte[] request = System.Text.UTF8Encoding.UTF8.GetBytes(s);
sh.Initialize();
byte[] b4bbuff = sh.ComputeHash(request, 0, request.Length);
string b64 = Convert.ToBase64String(b4bbuff);

Categories

Resources