How to convert from ISO-8859-1 to UTF-8? - c#

I need to convert my html file that is in charset = iso-8859-1 to UTF-8, could you help me?
this is my code:
string converHtml = File.ReadAllText(html);
Encoding iso = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] isoBytes = iso.GetBytes(converHtml);
byte[] utf8Bytes = Encoding.Convert(utf8, iso, isoBytes);
string msg = utf8.GetString(utf8Bytes);
msg = HttpUtility.HtmlDecode(msg);
return msg;

thanks Klaus Gütter, Alexei Levenkov, it worked for me.
this is my code:
StreamReader sr = new StreamReader(html, Encoding.GetEncoding(28591));
var ags = sr.ReadToEnd();

Related

How to remove unknown chars on string in windows-1251 charset

I have a text which cannot be converted to windows-1251 charset. For example:
中华全国工商业联合会-HelloWorld
I have a method for converting from UTF8 to windows-1251:
static string ChangeEncoding(string text)
{
if (text == null || text == "")
return "";
Encoding win1251 = Encoding.GetEncoding("windows-1251");
Encoding ascii = Encoding.UTF8;
byte[] utfBytes = ascii.GetBytes(text);
byte[] isoBytes = Encoding.Convert(ascii, win1251, utfBytes);
return win1251.GetString(isoBytes);
}
Now it is returning this:
??????????-HelloWorld
I don't want to show chars which was not converted to windows1251 charset correct. In this case I want just:
-HelloWorld
How can I do this?
According to #JeroenMostert suggestion this method helped me:
public static string ChangeEncoding(string text)
{
Encoding win1251 = Encoding.GetEncoding("windows-1251", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
return win1251.GetString(Encoding.Convert(Encoding.UTF8, win1251, Encoding.UTF8.GetBytes(text)));
}

Can't fully correct encoding issue from website [duplicate]

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Spanish:
Acción
whereas it should look like this:
Acción
According to the answer on this question:
How to know string encoding in C#, the encoding I am receiving should be coming on UTF-8 already, but it is read on Encoding.Default (probably ANSI?).
I am trying to transform this string into real UTF-8, but one of the problems is that I can only see a subset of the Encoding class (UTF8 and Unicode properties only), probably because I'm limited to the windows surface API.
I have tried some snippets I've found on the internet, but none of them have proved successful so far for eastern languages (i.e. korean). One example is as follows:
var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);
myString= utf8.GetString(utfBytes, 0, utfBytes.Length);
I also tried extracting the string into a byte array and then using UTF8.GetString:
byte[] myByteArray = new byte[myString.Length];
for (int ix = 0; ix < myString.Length; ++ix)
{
char ch = myString[ix];
myByteArray[ix] = (byte) ch;
}
myString = Encoding.UTF8.GetString(myByteArray, 0, myString.Length);
Do you guys have any other ideas that I could try?
As you know the string is coming in as Encoding.Default you could simply use:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);
Another thing you may have to remember: If you are using Console.WriteLine to output some strings, then you should also write Console.OutputEncoding = System.Text.Encoding.UTF8;!!! Or all utf8 strings will be outputed as gbk...
string utf8String = "Acción";
string propEncodeString = string.Empty;
byte[] utf8_Bytes = new byte[utf8String.Length];
for (int i = 0; i < utf8String.Length; ++i)
{
utf8_Bytes[i] = (byte)utf8String[i];
}
propEncodeString = Encoding.UTF8.GetString(utf8_Bytes, 0, utf8_Bytes.Length);
Output should look like
Acción
day’s displays
day's
call DecodeFromUtf8();
private static void DecodeFromUtf8()
{
string utf8_String = "day’s";
byte[] bytes = Encoding.Default.GetBytes(utf8_String);
utf8_String = Encoding.UTF8.GetString(bytes);
}
Your code is reading a sequence of UTF8-encoded bytes, and decoding them using an 8-bit encoding.
You need to fix that code to decode the bytes as UTF8.
Alternatively (not ideal), you could convert the bad string back to the original byte array—by encoding it using the incorrect encoding—then re-decode the bytes as UTF8.
Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(mystring));
#anothershrubery answer worked for me. I've made an enhancement using StringEntensions Class so I can easily convert any string at all in my program.
Method:
public static class StringExtensions
{
public static string ToUTF8(this string text)
{
return Encoding.UTF8.GetString(Encoding.Default.GetBytes(text));
}
}
Usage:
string myString = "Acción";
string strConverted = myString.ToUTF8();
Or simply:
string strConverted = "Acción".ToUTF8();
If you want to save any string to mysql database do this:->
Your database field structure i phpmyadmin [ or any other control panel] should set to utf8-gerneral-ci
2) you should change your string [Ex. textbox1.text] to byte, therefor
2-1) define byte[] st2;
2-2) convert your string [textbox1.text] to unicode [ mmultibyte string] by :
byte[] st2 = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);
3) execute this sql command before any query:
string mysql_query2 = "SET NAMES 'utf8'";
cmd.CommandText = mysql_query2;
cmd.ExecuteNonQuery();
3-2) now you should insert this value in to for example name field by :
cmd.CommandText = "INSERT INTO customer (`name`) values (#name)";
4) the main job that many solution didn't attention to it is the below line:
you should use addwithvalue instead of add in command parameter like below:
cmd.Parameters.AddWithValue("#name",ut);
++++++++++++++++++++++++++++++++++
enjoy real data in your database server instead of ????
Use the below code snippet to get bytes from csv file
protected byte[] GetCSVFileContent(string fileName)
{
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(fileName, Encoding.Default, true))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
string allines = sb.ToString();
UTF8Encoding utf8 = new UTF8Encoding();
var preamble = utf8.GetPreamble();
var data = utf8.GetBytes(allines);
return data;
}
Call the below and save it as an attachment
Encoding csvEncoding = Encoding.UTF8;
//byte[] csvFile = GetCSVFileContent(FileUpload1.PostedFile.FileName);
byte[] csvFile = GetCSVFileContent("Your_CSV_File_NAme");
string attachment = String.Format("attachment; filename={0}.csv", "uomEncoded");
Response.Clear();
Response.ClearHeaders();
Response.ClearContent();
Response.ContentType = "text/csv";
Response.ContentEncoding = csvEncoding;
Response.AppendHeader("Content-Disposition", attachment);
//Response.BinaryWrite(csvEncoding.GetPreamble());
Response.BinaryWrite(csvFile);
Response.Flush();
Response.End();

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Spanish:
Acción
whereas it should look like this:
Acción
According to the answer on this question:
How to know string encoding in C#, the encoding I am receiving should be coming on UTF-8 already, but it is read on Encoding.Default (probably ANSI?).
I am trying to transform this string into real UTF-8, but one of the problems is that I can only see a subset of the Encoding class (UTF8 and Unicode properties only), probably because I'm limited to the windows surface API.
I have tried some snippets I've found on the internet, but none of them have proved successful so far for eastern languages (i.e. korean). One example is as follows:
var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);
myString= utf8.GetString(utfBytes, 0, utfBytes.Length);
I also tried extracting the string into a byte array and then using UTF8.GetString:
byte[] myByteArray = new byte[myString.Length];
for (int ix = 0; ix < myString.Length; ++ix)
{
char ch = myString[ix];
myByteArray[ix] = (byte) ch;
}
myString = Encoding.UTF8.GetString(myByteArray, 0, myString.Length);
Do you guys have any other ideas that I could try?
As you know the string is coming in as Encoding.Default you could simply use:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);
Another thing you may have to remember: If you are using Console.WriteLine to output some strings, then you should also write Console.OutputEncoding = System.Text.Encoding.UTF8;!!! Or all utf8 strings will be outputed as gbk...
string utf8String = "Acción";
string propEncodeString = string.Empty;
byte[] utf8_Bytes = new byte[utf8String.Length];
for (int i = 0; i < utf8String.Length; ++i)
{
utf8_Bytes[i] = (byte)utf8String[i];
}
propEncodeString = Encoding.UTF8.GetString(utf8_Bytes, 0, utf8_Bytes.Length);
Output should look like
Acción
day’s displays
day's
call DecodeFromUtf8();
private static void DecodeFromUtf8()
{
string utf8_String = "day’s";
byte[] bytes = Encoding.Default.GetBytes(utf8_String);
utf8_String = Encoding.UTF8.GetString(bytes);
}
Your code is reading a sequence of UTF8-encoded bytes, and decoding them using an 8-bit encoding.
You need to fix that code to decode the bytes as UTF8.
Alternatively (not ideal), you could convert the bad string back to the original byte array—by encoding it using the incorrect encoding—then re-decode the bytes as UTF8.
Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(mystring));
#anothershrubery answer worked for me. I've made an enhancement using StringEntensions Class so I can easily convert any string at all in my program.
Method:
public static class StringExtensions
{
public static string ToUTF8(this string text)
{
return Encoding.UTF8.GetString(Encoding.Default.GetBytes(text));
}
}
Usage:
string myString = "Acción";
string strConverted = myString.ToUTF8();
Or simply:
string strConverted = "Acción".ToUTF8();
If you want to save any string to mysql database do this:->
Your database field structure i phpmyadmin [ or any other control panel] should set to utf8-gerneral-ci
2) you should change your string [Ex. textbox1.text] to byte, therefor
2-1) define byte[] st2;
2-2) convert your string [textbox1.text] to unicode [ mmultibyte string] by :
byte[] st2 = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);
3) execute this sql command before any query:
string mysql_query2 = "SET NAMES 'utf8'";
cmd.CommandText = mysql_query2;
cmd.ExecuteNonQuery();
3-2) now you should insert this value in to for example name field by :
cmd.CommandText = "INSERT INTO customer (`name`) values (#name)";
4) the main job that many solution didn't attention to it is the below line:
you should use addwithvalue instead of add in command parameter like below:
cmd.Parameters.AddWithValue("#name",ut);
++++++++++++++++++++++++++++++++++
enjoy real data in your database server instead of ????
Use the below code snippet to get bytes from csv file
protected byte[] GetCSVFileContent(string fileName)
{
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(fileName, Encoding.Default, true))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
string allines = sb.ToString();
UTF8Encoding utf8 = new UTF8Encoding();
var preamble = utf8.GetPreamble();
var data = utf8.GetBytes(allines);
return data;
}
Call the below and save it as an attachment
Encoding csvEncoding = Encoding.UTF8;
//byte[] csvFile = GetCSVFileContent(FileUpload1.PostedFile.FileName);
byte[] csvFile = GetCSVFileContent("Your_CSV_File_NAme");
string attachment = String.Format("attachment; filename={0}.csv", "uomEncoded");
Response.Clear();
Response.ClearHeaders();
Response.ClearContent();
Response.ContentType = "text/csv";
Response.ContentEncoding = csvEncoding;
Response.AppendHeader("Content-Disposition", attachment);
//Response.BinaryWrite(csvEncoding.GetPreamble());
Response.BinaryWrite(csvFile);
Response.Flush();
Response.End();

How to convert a string to UTF8?

I have a string that contains some unicode, how do I convert it to UTF-8 encoding?
This snippet makes an array of bytes with your string encoded in UTF-8:
UTF8Encoding utf8 = new UTF8Encoding();
string unicodeString = "Quick brown fox";
byte[] encodedBytes = utf8.GetBytes(unicodeString);
Try this function, this should fix it out-of-box. You may need to fix naming conventions though.
private string UnicodeToUTF8(string strFrom)
{
byte[] bytSrc;
byte[] bytDestination;
string strTo = String.Empty;
bytSrc = Encoding.Unicode.GetBytes(strFrom);
bytDestination = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, bytSrc);
strTo = Encoding.ASCII.GetString(bytDestination);
return strTo;
}
This should be with the minimum code:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);
try to this code
string unicodeString = "Quick brown fox";
var bytes = new List<byte>(unicodeString);
foreach (var c in unicodeString)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());

C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H

I have googled on this topic and I have looked at every answer, but I still don't get it.
Basically I need to convert UTF-8 string to ISO-8859-1 and I do it using following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(Message));
My source string is
Message = "ÄäÖöÕõÜü"
But unfortunately my result string becomes
msg = "�ä�ö�õ�ü
What I'm doing wrong here?
Use Encoding.Convert to adjust the byte array before attempting to decode it into your destination encoding.
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);
I think your problem is that you assume that the bytes that represent the utf8 string will result in the same string when interpreted as something else (iso-8859-1). And that is simply just not the case. I recommend that you read this excellent article by Joel spolsky.
Try this:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8,iso,utfBytes);
string msg = iso.GetString(isoBytes);
You need to fix the source of the string in the first place.
A string in .NET is actually just an array of 16-bit unicode code-points, characters, so a string isn't in any particular encoding.
It's when you take that string and convert it to a set of bytes that encoding comes into play.
In any case, the way you did it, encoded a string to a byte array with one character set, and then decoding it with another, will not work, as you see.
Can you tell us more about where that original string comes from, and why you think it has been encoded wrong?
Seems bit strange code. To get string from Utf8 byte stream all you need to do is:
string str = Encoding.UTF8.GetString(utf8ByteArray);
If you need to save iso-8859-1 byte stream to somewhere then just use:
additional line of code for previous:
byte[] iso88591data = Encoding.GetEncoding("ISO-8859-1").GetBytes(str);
Maybe it can help
Convert one codepage to another:
public static string fnStringConverterCodepage(string sText, string sCodepageIn = "ISO-8859-8", string sCodepageOut="ISO-8859-8")
{
string sResultado = string.Empty;
try
{
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding(sCodepageIn).GetBytes(sText);
sResultado = System.Text.Encoding.GetEncoding(sCodepageOut).GetString(tempBytes);
}
catch (Exception)
{
sResultado = "";
}
return sResultado;
}
Usage:
string sMsg = "ERRO: Não foi possivel acessar o servico de Autenticação";
var sOut = fnStringConverterCodepage(sMsg ,"ISO-8859-1","UTF-8"));
Output:
"Não foi possivel acessar o servico de Autenticação"
Encoding targetEncoding = Encoding.GetEncoding(1252);
// Encode a string into an array of bytes.
Byte[] encodedBytes = targetEncoding.GetBytes(utfString);
// Show the encoded byte values.
Console.WriteLine("Encoded bytes: " + BitConverter.ToString(encodedBytes));
// Decode the byte array back to a string.
String decodedString = Encoding.Default.GetString(encodedBytes);
Just used the Nathan's solution and it works fine. I needed to convert ISO-8859-1 to Unicode:
string isocontent = Encoding.GetEncoding("ISO-8859-1").GetString(fileContent, 0, fileContent.Length);
byte[] isobytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(isocontent);
byte[] ubytes = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, isobytes);
return Encoding.Unicode.GetString(ubytes, 0, ubytes.Length);
Here is a sample for ISO-8859-9;
protected void btnKaydet_Click(object sender, EventArgs e)
{
Response.Clear();
Response.Buffer = true;
Response.ContentType = "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet";
Response.AddHeader("Content-Disposition", "attachment; filename=XXXX.doc");
Response.ContentEncoding = Encoding.GetEncoding("ISO-8859-9");
Response.Charset = "ISO-8859-9";
EnableViewState = false;
StringWriter writer = new StringWriter();
HtmlTextWriter html = new HtmlTextWriter(writer);
form1.RenderControl(html);
byte[] bytesInStream = Encoding.GetEncoding("iso-8859-9").GetBytes(writer.ToString());
MemoryStream memoryStream = new MemoryStream(bytesInStream);
string msgBody = "";
string Email = "mail#xxxxxx.org";
SmtpClient client = new SmtpClient("mail.xxxxx.org");
MailMessage message = new MailMessage(Email, "mail#someone.com", "ONLINE APP FORM WITH WORD DOC", msgBody);
Attachment att = new Attachment(memoryStream, "XXXX.doc", "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet");
message.Attachments.Add(att);
message.BodyEncoding = System.Text.Encoding.UTF8;
message.IsBodyHtml = true;
client.Send(message);}

Categories

Resources