How to remove unknown chars on string in windows-1251 charset

How to remove unknown chars on string in windows-1251 charset - c#

I have a text which cannot be converted to windows-1251 charset. For example:
中华全国工商业联合会-HelloWorld
I have a method for converting from UTF8 to windows-1251:
static string ChangeEncoding(string text)
{
if (text == null || text == "")
return "";
Encoding win1251 = Encoding.GetEncoding("windows-1251");
Encoding ascii = Encoding.UTF8;
byte[] utfBytes = ascii.GetBytes(text);
byte[] isoBytes = Encoding.Convert(ascii, win1251, utfBytes);
return win1251.GetString(isoBytes);
}
Now it is returning this:
??????????-HelloWorld
I don't want to show chars which was not converted to windows1251 charset correct. In this case I want just:
-HelloWorld
How can I do this?

According to #JeroenMostert suggestion this method helped me:
public static string ChangeEncoding(string text)
{
Encoding win1251 = Encoding.GetEncoding("windows-1251", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
return win1251.GetString(Encoding.Convert(Encoding.UTF8, win1251, Encoding.UTF8.GetBytes(text)));
}

Related

Base 64 encoding in wp7?

I have a string that i would like to encode in base64. I would also like the final encoding to be saved in a string.
In iOS would be :
- (NSString *)encodeCredentials
{
//string to be encoded
NSString *deviceUUID = "34543647yrgsav635Chbvcew4f56v"
NSData *plainTextData = [deviceUUID dataUsingEncoding:NSUTF8StringEncoding];
NSString *base64String = [plainTextData base64EncodedString];
//i return the encoded string
return base64String;
}
How would that be in wp7 ?

For encoding:
public string Encode(string str)
{
return Convert.ToBase64String(System.Text.Encoding.UTF8.GetBytes(str));
}
For decoding:
public string Decode(string str)
{
return System.Text.Encoding.UTF8.GetString(Convert.FromBase64String(str));
}

I find this method on the net, hope it can help :
static public string EncodeTo64(string toEncode)
{
byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
return returnValue;
}
In wp7 you will probably be forced to replace ASCIIEncoding by UTF8Encoding or Encoding I do not remember well but Intellisense does, it's in System.Text anyway.
Here is the doc for System.Convert.ToBase64String : http://msdn.microsoft.com/en-us/library/dhx0d524.aspx

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Spanish:
AcciÃ³n
whereas it should look like this:
Acción
According to the answer on this question:
How to know string encoding in C#, the encoding I am receiving should be coming on UTF-8 already, but it is read on Encoding.Default (probably ANSI?).
I am trying to transform this string into real UTF-8, but one of the problems is that I can only see a subset of the Encoding class (UTF8 and Unicode properties only), probably because I'm limited to the windows surface API.
I have tried some snippets I've found on the internet, but none of them have proved successful so far for eastern languages (i.e. korean). One example is as follows:
var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);
myString= utf8.GetString(utfBytes, 0, utfBytes.Length);
I also tried extracting the string into a byte array and then using UTF8.GetString:
byte[] myByteArray = new byte[myString.Length];
for (int ix = 0; ix < myString.Length; ++ix)
{
char ch = myString[ix];
myByteArray[ix] = (byte) ch;
}
myString = Encoding.UTF8.GetString(myByteArray, 0, myString.Length);
Do you guys have any other ideas that I could try?

As you know the string is coming in as Encoding.Default you could simply use:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);
Another thing you may have to remember: If you are using Console.WriteLine to output some strings, then you should also write Console.OutputEncoding = System.Text.Encoding.UTF8;!!! Or all utf8 strings will be outputed as gbk...

string utf8String = "AcciÃ³n";
string propEncodeString = string.Empty;
byte[] utf8_Bytes = new byte[utf8String.Length];
for (int i = 0; i < utf8String.Length; ++i)
{
utf8_Bytes[i] = (byte)utf8String[i];
}
propEncodeString = Encoding.UTF8.GetString(utf8_Bytes, 0, utf8_Bytes.Length);
Output should look like
Acción
dayâ€™s displays
day's
call DecodeFromUtf8();
private static void DecodeFromUtf8()
{
string utf8_String = "dayâ€™s";
byte[] bytes = Encoding.Default.GetBytes(utf8_String);
utf8_String = Encoding.UTF8.GetString(bytes);
}

Your code is reading a sequence of UTF8-encoded bytes, and decoding them using an 8-bit encoding.
You need to fix that code to decode the bytes as UTF8.
Alternatively (not ideal), you could convert the bad string back to the original byte array—by encoding it using the incorrect encoding—then re-decode the bytes as UTF8.

Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(mystring));

#anothershrubery answer worked for me. I've made an enhancement using StringEntensions Class so I can easily convert any string at all in my program.
Method:
public static class StringExtensions
{
public static string ToUTF8(this string text)
{
return Encoding.UTF8.GetString(Encoding.Default.GetBytes(text));
}
}
Usage:
string myString = "AcciÃ³n";
string strConverted = myString.ToUTF8();
Or simply:
string strConverted = "AcciÃ³n".ToUTF8();

If you want to save any string to mysql database do this:->
Your database field structure i phpmyadmin [ or any other control panel] should set to utf8-gerneral-ci
2) you should change your string [Ex. textbox1.text] to byte, therefor
2-1) define byte[] st2;
2-2) convert your string [textbox1.text] to unicode [ mmultibyte string] by :
byte[] st2 = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);
3) execute this sql command before any query:
string mysql_query2 = "SET NAMES 'utf8'";
cmd.CommandText = mysql_query2;
cmd.ExecuteNonQuery();
3-2) now you should insert this value in to for example name field by :
cmd.CommandText = "INSERT INTO customer (`name`) values (#name)";
4) the main job that many solution didn't attention to it is the below line:
you should use addwithvalue instead of add in command parameter like below:
cmd.Parameters.AddWithValue("#name",ut);
++++++++++++++++++++++++++++++++++
enjoy real data in your database server instead of ????

Use the below code snippet to get bytes from csv file
protected byte[] GetCSVFileContent(string fileName)
{
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(fileName, Encoding.Default, true))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
string allines = sb.ToString();
UTF8Encoding utf8 = new UTF8Encoding();
var preamble = utf8.GetPreamble();
var data = utf8.GetBytes(allines);
return data;
}
Call the below and save it as an attachment
Encoding csvEncoding = Encoding.UTF8;
//byte[] csvFile = GetCSVFileContent(FileUpload1.PostedFile.FileName);
byte[] csvFile = GetCSVFileContent("Your_CSV_File_NAme");
string attachment = String.Format("attachment; filename={0}.csv", "uomEncoded");
Response.Clear();
Response.ClearHeaders();
Response.ClearContent();
Response.ContentType = "text/csv";
Response.ContentEncoding = csvEncoding;
Response.AppendHeader("Content-Disposition", attachment);
//Response.BinaryWrite(csvEncoding.GetPreamble());
Response.BinaryWrite(csvFile);
Response.Flush();
Response.End();

How to convert a string to UTF8?

I have a string that contains some unicode, how do I convert it to UTF-8 encoding?

This snippet makes an array of bytes with your string encoded in UTF-8:
UTF8Encoding utf8 = new UTF8Encoding();
string unicodeString = "Quick brown fox";
byte[] encodedBytes = utf8.GetBytes(unicodeString);

Try this function, this should fix it out-of-box. You may need to fix naming conventions though.
private string UnicodeToUTF8(string strFrom)
{
byte[] bytSrc;
byte[] bytDestination;
string strTo = String.Empty;
bytSrc = Encoding.Unicode.GetBytes(strFrom);
bytDestination = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, bytSrc);
strTo = Encoding.ASCII.GetString(bytDestination);
return strTo;
}

This should be with the minimum code:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);

try to this code
string unicodeString = "Quick brown fox";
var bytes = new List<byte>(unicodeString);
foreach (var c in unicodeString)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());

Unicode-to-string conversion in C#

How can I convert a Unicode value to its equivalent string?
For example, I have "రమెశ్", and I need a function that accepts this Unicode value and returns a string.
I was looking at the System.Text.Encoding.Convert() function, but that does not take in a Unicode value; it takes two encodings and a byte array.
I bascially have a byte array that I need to save in a string field and then come back later and convert the string first back to a byte array.
So I use ByteConverter.GetString(byteArray) to save the byte array to a string, but I can't get it back to a byte array.

Use .ToString();:
this.Text = ((char)0x00D7).ToString();

Try the following:
byte[] bytes = ...;
string convertedUtf8 = Encoding.UTF8.GetString(bytes);
string convertedUtf16 = Encoding.Unicode.GetString(bytes); // For UTF-16
The other way around is using `GetBytes():
byte[] bytesUtf8 = Encoding.UTF8.GetBytes(convertedUtf8);
byte[] bytesUtf16 = Encoding.Unicode.GetBytes(convertedUtf16);
In the Encoding class, there are more variants if you need them.

To convert a string to a Unicode string, do it like this: very simple... note the BytesToString function which avoids using any inbuilt conversion stuff. Fast, too.
private string BytesToString(byte[] Bytes)
{
MemoryStream MS = new MemoryStream(Bytes);
StreamReader SR = new StreamReader(MS);
string S = SR.ReadToEnd();
SR.Close();
return S;
}
private string ToUnicode(string S)
{
return BytesToString(new UnicodeEncoding().GetBytes(S));
}

UTF8Encoding Class
UTF8Encoding uni = new UTF8Encoding();
Console.WriteLine( uni.GetString(new byte[] { 1, 2 }));

There are different types of encoding. You can try some of them to see if your bytestream get converted correctly:
System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
System.Text.UTF8Encoding encodingUTF8 = new System.Text.UTF8Encoding();
System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();
var ascii = string.Format("{0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(textBytesASCII));
var utf = string.Format("{0}: {1}", encodingUTF8.ToString(), encodingUTF8.GetString(textBytesUTF8));
var unicode = string.Format("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(textBytesCyrillic));
Have a look here as well: http://george2giga.com/2010/10/08/c-text-encoding-and-transcoding/.

var ascii = $"{new ASCIIEncoding().ToString()}: {((ASCIIEncoding)new ASCIIEncoding()).GetString(textBytesASCII)}";
var utf = $"{new UTF8Encoding().ToString()}: {((UTF8Encoding)new UTF8Encoding()).GetString(textBytesUTF8)}";
var unicode = $"{new UnicodeEncoding().ToString()}: {((UnicodeEncoding)new UnicodeEncoding()).GetString(textBytesCyrillic)}";

Wrote a cycle for converting unicode symbols in string to UTF8 letters:
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

C# Email subject parsing

I'm building a system for reading emails in C#. I've got a problem parsing the subject, a problem which I think is related to encoding.
The subject I'm reading is as follows: =?ISO-8859-1?Q?=E6=F8sd=E5f=F8sdf_sdfsdf?=, the original subject sent is æøsdåføsdf sdfsdf (Norwegian characters in there).
Any ideas how I can change encoding or parse this correctly? So far I've tried to use the C# encoding conversion techniques to encode the subject to utf8, but without any luck.
Here is one of the solutions I tried:
Encoding iso = Encoding.GetEncoding("iso-8859-1");
Encoding utf = Encoding.UTF8;
string decodedSubject =
utf.GetString(Encoding.Convert(utf, iso,
iso.GetBytes(m.Subject.Split('?')[3])));

The encoding is called quoted printable.
See the answers to this question.
Adapted from the accepted answer:
public string DecodeQuotedPrintable(string value)
{
Attachment attachment = Attachment.CreateAttachmentFromString("", value);
return attachment.Name;
}
When passed the string =?ISO-8859-1?Q?=E6=F8sd=E5f=F8sdf_sdfsdf?= this returns "æøsdåføsdf_sdfsdf".

public static string DecodeEncodedWordValue(string mimeString)
{
var regex = new Regex(#"=\?(?<charset>.*?)\?(?<encoding>[qQbB])\?(?<value>.*?)\?=");
var encodedString = mimeString;
var decodedString = string.Empty;
while (encodedString.Length > 0)
{
var match = regex.Match(encodedString);
if (match.Success)
{
// If the match isn't at the start of the string, copy the initial few chars to the output
decodedString += encodedString.Substring(0, match.Index);
var charset = match.Groups["charset"].Value;
var encoding = match.Groups["encoding"].Value.ToUpper();
var value = match.Groups["value"].Value;
if (encoding.Equals("B"))
{
// Encoded value is Base-64
var bytes = Convert.FromBase64String(value);
decodedString += Encoding.GetEncoding(charset).GetString(bytes);
}
else if (encoding.Equals("Q"))
{
// Encoded value is Quoted-Printable
// Parse looking for =XX where XX is hexadecimal
var regx = new Regex("(\\=([0-9A-F][0-9A-F]))", RegexOptions.IgnoreCase);
decodedString += regx.Replace(value, new MatchEvaluator(delegate(Match m)
{
var hex = m.Groups[2].Value;
var iHex = Convert.ToInt32(hex, 16);
// Return the string in the charset defined
var bytes = new byte[1];
bytes[0] = Convert.ToByte(iHex);
return Encoding.GetEncoding(charset).GetString(bytes);
}));
decodedString = decodedString.Replace('_', ' ');
}
else
{
// Encoded value not known, return original string
// (Match should not be successful in this case, so this code may never get hit)
decodedString += encodedString;
break;
}
// Trim off up to and including the match, then we'll loop and try matching again.
encodedString = encodedString.Substring(match.Index + match.Length);
}
else
{
// No match, not encoded, return original string
decodedString += encodedString;
break;
}
}
return decodedString;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove unknown chars on string in windows-1251 charset - c#

Related

Base 64 encoding in wp7?

How can I transform string to UTF-8 in C#?

How to convert a string to UTF8?

Unicode-to-string conversion in C#

C# Email subject parsing

Categories

Resources