Convert UTF-8 to Chinese Simplified (GB2312) - c#

Is there a way to convert UTF-8 string to Chinese Simplified (GB2312) in C#. Any help is greatly appreciated.
Regards
Jyothish George

The first thing to be aware of is that there's no such thing as a "UTF-8 string" in .NET. All strings in .NET are effectively UTF-16. However, .NET provides the Encoding class to allow you to decode binary data into strings, and re-encode it later.
Encoding.Convert can convert a byte array representing text encoded with one encoding into a byte array with the same text encoded with a different encoding. Is that what you want?
Alternatively, if you already have a string, you can use:
byte[] bytes = Encoding.GetEncoding("gb2312").GetBytes(text);
If you can provide more information, that would be helpful.

Try this;
public string GB2312ToUtf8(string gb2312String)
{
Encoding fromEncoding = Encoding.GetEncoding("gb2312");
Encoding toEncoding = Encoding.UTF8;
return EncodingConvert(gb2312String, fromEncoding, toEncoding);
}
public string Utf8ToGB2312(string utf8String)
{
Encoding fromEncoding = Encoding.UTF8;
Encoding toEncoding = Encoding.GetEncoding("gb2312");
return EncodingConvert(utf8String, fromEncoding, toEncoding);
}
public string EncodingConvert(string fromString, Encoding fromEncoding, Encoding toEncoding)
{
byte[] fromBytes = fromEncoding.GetBytes(fromString);
byte[] toBytes = Encoding.Convert(fromEncoding, toEncoding, fromBytes);
string toString = toEncoding.GetString(toBytes);
return toString;
}
source here

Related

How do I create/encode a string with a specific encoding in C#?

How do I create/encode a string with a specific encoding in C#/.Net framework? For example, I would like to make a string which uses the Western European ISO 8859-1 encoding.
C#/.Net/.NetCore Framework(s) use the UTF-16 encoding (i.e. any string you create will be this encoding). Which is found under Encoding.Unicode (but not necessarily UTF-16 for everyone...).
Thus you need to convert your string to the desired encoding. Note that this approach/code is only if you have created your own string, otherwise you have to take a different approach if you got the string/text from somewhere else like a file.
Encoding westernEuropeanIso8859 = Encoding.GetEncoding("ISO-8859-1");
Encoding utf16CSharpDefault = Encoding.Unicode;
byte[] utfBytes = utf16CSharpDefault.GetBytes(vExp);
byte[] isoBytes = Encoding.Convert(utf16CSharpDefault, westernEuropeanIso8859, utfBytes);
string stringWithDesiredEncoding = westernEuropeanIso8859.GetString(isoBytes);

C# Encoding from utf-16 to ascii

I get question marks in output of my program: ?????? ??????
string str = "Привет медвед";
Encoding srcEncodingFormat = Encoding.GetEncoding("utf-16");
Encoding dstEncodingFormat = Encoding.ASCII;
byte [] originalByteString = srcEncodingFormat.GetBytes(str);
byte [] convertedByteString = Encoding.Convert(srcEncodingFormat,
dstEncodingFormat, originalByteString);
string finalString = dstEncodingFormat.GetString(convertedByteString);
Console.WriteLine (finalString);
There is no text but encoded text. But, .NET's char and string use Unicode/UTF-16, as you know. So, you can simplify your code by calling GetBytes and passing in the string instead of doing it twice as your code does.
As for your question, you have a choice of a lossy conversion or no conversion at all. Below is code that prevents a lossy conversion.
Now, how to see the result? As with all text, it is a sequence of bytes. Your best bet is to write them to a file and open the file in an editor that you can indicate the encoding to and that can use a font that supports the characters you want to see.
string str = "Привет медвед";
Encoding dstEncodingFormat = Encoding.GetEncoding("US-ASCII",
new EncoderExceptionFallback(),
new DecoderReplacementFallback());
byte[] output = dstEncodingFormat.GetBytes(str);
File.WriteAllBytes("Test Привет медвед.txt", output);

Convert.FromBase64String returns unicode sometimes, or UTF-8

Sometimes the byte array b64 is UTF-8, and other times is UTF-16. I keep reading online that C# strings are always UTF-16, but that is not the case for me here. Why is this happening, and how do I fix it? I have a simple method for converting a base64 string to a normal string:
public static string FromBase64(this string input)
{
String corrected = new string(input.ToCharArray());
byte[] b64 = Convert.FromBase64String(corrected);
if (b64[1] == 0)
{
return System.Text.Encoding.Unicode.GetString(b64);
}
else
{
return System.Text.Encoding.UTF8.GetString(b64);
}
}
The same thing is happening to my base 64 encoder:
public static string ToBase64(this string input)
{
String b64 = Convert.ToBase64String(input.GetBytes());
return b64;
}
public static byte[] GetBytes(this string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
Example:
On my computer, "cABhAHMAcwB3AG8AcgBkADEA" decodes to:
'p','\0','a','\0','s','\0','s','\0','w','\0','o','\0','r','\0','d','\0','1','\0'
But on my coworkers computer it is:
'p','a','s','s','w','o','r','d','1'
Edit:
I know that the string I create comes from a textbox, and that the file where I am saving it to is always going to be UTF-8, so everything is pointing to the Convert method causing my encoding switch.
Update:
After digging in further, it appears that my coworker had a very important line commented in his version of the code, the one that saves the value read from file to the hashtable. The default value I was using is a UTF-8 base64 value, so I am going to correct the default, to a utf-16 value, then I can clean up the code removing any UTF8 references.
Also, I had been naively using the UTF-8 base64 encoding I had retrieved from a website, not realizing what I was getting myself into. The funny part is I would never have found that fact if my coworker hadn't commented the line that saves the values from the file.
Final version of the code:
public static string FromBase64(this string input)
{
byte[] b64 = Convert.FromBase64String(input);
return System.Text.Encoding.Unicode.GetString(b64);
}
public static string ToBase64(this string input)
{
String b64 = Convert.ToBase64String(input.GetBytes());
return b64;
}
public static byte[] GetBytes(this string str)
{
return System.Text.Encoding.Unicode.GetBytes(str);
}
First of all, I want to debunk the title of the question:
Convert.FromBase64String() returns Unicode sometimes, or UTF-8
That is not the case. Give then same input, valid base64 encoded text, Convert.FromBase64String() always returns the same output.
Moving on, you cannot determine definitively, just by examining the payload, the encoding used for a string. You attempt to do this with
if (b64[1] == 0)
// encoding must be UTF-16
This is not the case. The overwhelming majority of UTF-16 character elements fail that test. It does not matter how you try to write this test it is doomed to fail. And that is because there exist byte arrays that are well-defined strings when interpreted as different encodings. In other words it is possible, for instance, to construct byte arrays that are valid when considered as either UTF-8 or UTF-16.
So, you have to know a priori whether the payload is encoded as UTF-16, UTF-8 or indeed some other encoding.
The solution will be to keep track of the original encoding, before the base64 encoding. Pass that information along with the base64 encoded payload. Then when you decode, you can determine which Encoding to use to decode back to a string.
It looks to me very much that your strings are all coming from UTF-16 .net strings. In which case you won't have UTF-8 strings ever, and should always decode with UTF-16. That is you use Encoding.Unicode.GetString().
Also, the GetBytes method in your code is poor. It should be:
public static byte[] GetBytes(this string str)
{
return Encoding.Unicode.GetBytes(str);
}
Another oddity:
String corrected = new string(input.ToCharArray());
This is a no-op.
Finally, it is quite likely that your text will be more compact when encoded as UTF-8. So perhaps you should consider doing that before applying the base64 encoding.
Regarding your update, what you state is incorrect. This code:
string str = Encoding.Unicode.GetString(
Convert.FromBase64String("cABhAHMAcwB3AG8AcgBkADEA"));
assigns password1 to str wherever it is run.
Try revising the code to make it a little more readable/accurate. As mentioned in my comment and David Hefferman's answer you're trying to do things that either:
A) do nothing
or
B) demonstrate flawed logic
The following code based upon yours works fine:
class Program
{
static void Main(string[] args)
{
string original = "password1";
string encoded = original.ToBase64();
string decoded = encoded.FromBase64();
Console.WriteLine("Original: {0}", original);
Console.WriteLine("Encoded: {0}", encoded);
Console.WriteLine("Decoded: {0}", decoded);
}
}
public static class Extensions
{
public static string FromBase64(this string input)
{
return System.Text.Encoding.Unicode.GetString(Convert.FromBase64String(input));
}
public static string ToBase64(this string input)
{
return Convert.ToBase64String(input.GetBytes());
}
public static byte[] GetBytes(this string str)
{
return System.Text.Encoding.Unicode.GetBytes(str);
}
}
What you are doing is no different than encoding data in either EBCDIC or ASCII, then trying to figure out which was used during the decode. As you have already discovered, this is not going to work very well.
The only way to get this to work correctly is to have a single encoding format used by all participants. This is a fundamental concept of communications.
Pick an encoding - let's say UTF-8 - and use it for all transformations between String and byte[]. This will ensure that you have accurate knowledge of the format of the payload and how to deal with it, as David Tanner has been telling you.
Here's the basic form:
public static string ToBase64(this string self)
{
byte[] bytes = Encoding.UTF8.GetBytes(self);
return Convert.ToBase64String(bytes);
}
public static string FromBase64(this string self)
{
byte[] bytes = Convert.FromBase64String(self);
return Encoding.UTF8.GetString(bytes);
}
Regardless of whatever weirdness might be happening between your computers, this code will produce the same encoded strings.

Sql binary to c# - How to get SQL binary equivalent of binary in c#

It might seem a dumb question to you guys. I have one SQL table with one binary column. It has some data in binary format.
e.g. 0x9A8B9D9A002020202020202020202020
It's equivalent english representation is "test".
Is it possible to convert this string into equivalent binary form in c# ?
string s = "test"; // C# code to convert s to it's equivalent SQL binary form.
public static byte[] ConvertToBinary(string str)
{
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
return encoding.GetBytes(str);
}
or
Convert.ToByte(string);
It really depends on which encoding was used when you originally converted from string to binary:
byte[] binaryString = (byte[])reader[1];
// if the original encoding was ASCII
string x = Encoding.ASCII.GetString(binaryString);
// if the original encoding was UTF-8
string y = Encoding.UTF8.GetString(binaryString);
// if the original encoding was UTF-16
string z = Encoding.Unicode.GetString(binaryString);
// etc

C# and utf8_decode

Is there a C# utf8_decode equivalent?
Use the Encoding class.
For example:
byte[] bytes = something;
string str = Encoding.UTF8.GetString(bytes);
Yes. You can use the System.Text.Encoding class to convert the encoding.
string source = "Déjà vu";
Encoding unicode = Encoding.Unicode;
// iso-8859-1 <- codepage 28591
Encoding latin1 = Encoding.GetEncoding(28591);
Byte[] result = Encoding.Convert(unicode, latin1, unicode.GetBytes(s));
// result contains the byte sequence for the latin1 encoded string
edit: or simply
string source = "Déjà vu";
Byte[] latin1 = Encoding.GetEncoding(28591).GetBytes(source);
string (System.String) is always unicode encoded, i.e. if you convert the byte sequence back to string (Encoding.GetString()) your data will again be stored as utf-16 codepoints again.
If your input is a string here is a method that would probably work (assuming your from wester europe :)
public string Utf8Decode(string inputDate)
{
return Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(inputDate));
}
Of course, if the current encoding of the inputData is not latin1, change the "iso-8859-1" to the correct encoding.
I tried to make this implementation on Xamarin C#.
The code below worked for me:
public static string Utf8Encode(string inputDate)
{
byte[] bytes = Encoding.UTF8.GetBytes(inputDate);
return Encoding.GetEncoding("iso-8859-1").GetString(bytes,0, bytes.Length);
}
public static string Utf8Decode(string inputDate)
{
byte[] bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(inputDate);
return Encoding.UTF8.GetString(bytes, 0, bytes.Length);
}

Categories

Resources