C# and utf8_decode - c#

Is there a C# utf8_decode equivalent?

Use the Encoding class.
For example:
byte[] bytes = something;
string str = Encoding.UTF8.GetString(bytes);

Yes. You can use the System.Text.Encoding class to convert the encoding.
string source = "Déjà vu";
Encoding unicode = Encoding.Unicode;
// iso-8859-1 <- codepage 28591
Encoding latin1 = Encoding.GetEncoding(28591);
Byte[] result = Encoding.Convert(unicode, latin1, unicode.GetBytes(s));
// result contains the byte sequence for the latin1 encoded string
edit: or simply
string source = "Déjà vu";
Byte[] latin1 = Encoding.GetEncoding(28591).GetBytes(source);
string (System.String) is always unicode encoded, i.e. if you convert the byte sequence back to string (Encoding.GetString()) your data will again be stored as utf-16 codepoints again.

If your input is a string here is a method that would probably work (assuming your from wester europe :)
public string Utf8Decode(string inputDate)
{
return Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(inputDate));
}
Of course, if the current encoding of the inputData is not latin1, change the "iso-8859-1" to the correct encoding.

I tried to make this implementation on Xamarin C#.
The code below worked for me:
public static string Utf8Encode(string inputDate)
{
byte[] bytes = Encoding.UTF8.GetBytes(inputDate);
return Encoding.GetEncoding("iso-8859-1").GetString(bytes,0, bytes.Length);
}
public static string Utf8Decode(string inputDate)
{
byte[] bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(inputDate);
return Encoding.UTF8.GetString(bytes, 0, bytes.Length);
}

Related

c# - How to convert a converted UTF8 string to UTF16?

I'm trying to convert a converted UTF-8 string to UTF-16, because I'm going to read a file and it comes like the var strUTF8 below.
For example, the entry would be the string "Não é possível equipar" and the return I needed is "Não é possível equipar".
static void Main(string[] args)
{
test3();
Console.ReadKey();
}
static void test3()
{
string str = "Não é possível equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
return Encoding.Unicode.GetString(unicodeBytes);
}
I really don't know how to solve this. Any tips?
If you want to read a file then you should read a file. When you read the file, specify the encoding of that file. If I'm not mistaken UTF8 is the default, so reading files encoded with UTF8 doesn't require the encoding to be specified. If you want to save that text to a file with a specific encoding, specify that encoding when saving the file.
var text = File.ReadAllText(filePath, Encoding.UTF8);
File.WriteAllText(filePath, text, Encoding.Unicode);
That will effectively convert a file from UTF8 encoding to UTF16. A more verbose version would be:
var data = File.ReadAllBytes(filePath);
var text = Encoding.UTF8.GetString(data);
data = Encoding.Unicode.GetBytes(text);
File.WriteAllBytes(filePath, data);
Your Utf8ToUtf16() function is effectively a no-op. You are taking an arbitrary UTF-16 string as input, encoding it into UTF-8 bytes, then decoding those bytes as UTF-8 back into UTF-16. So, you effectively end up with the same string value you started with. You may as well have just written the following, the result would be the same:
static string Utf8ToUtf16(string utf8String)
{
return utf8String;
}
That being said, Não é possível equipar is what you get when the UTF-8 encoded form of Não é possível equipar is mis-interpreted as Latin (probably ISO-8859-1) or Windows-125x etc, instead of being properly interpreted as UTF-8 to begin with.
If you have a C# string that contains such UTF-8 bytes which were up-scaled as-is to UTF-16 (why???), then you need to down-scale those characters as-is back into 8-bit bytes, and then you can decode those bytes as UTF-8, eg:
static void test3()
{
string str = "Não é possível equipar";
string strUTF16 = Utf8ToUtf16(str);
Console.WriteLine(str);
Console.WriteLine(strUTF16);
}
static string Utf8ToUtf16(string utf8String)
{
byte[] utf8Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(utf8String); // or: GetEncoding(28591)
return Encoding.UTF8.GetString(utf8Bytes);
}

Convert.FromBase64String returns unicode sometimes, or UTF-8

Sometimes the byte array b64 is UTF-8, and other times is UTF-16. I keep reading online that C# strings are always UTF-16, but that is not the case for me here. Why is this happening, and how do I fix it? I have a simple method for converting a base64 string to a normal string:
public static string FromBase64(this string input)
{
String corrected = new string(input.ToCharArray());
byte[] b64 = Convert.FromBase64String(corrected);
if (b64[1] == 0)
{
return System.Text.Encoding.Unicode.GetString(b64);
}
else
{
return System.Text.Encoding.UTF8.GetString(b64);
}
}
The same thing is happening to my base 64 encoder:
public static string ToBase64(this string input)
{
String b64 = Convert.ToBase64String(input.GetBytes());
return b64;
}
public static byte[] GetBytes(this string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
Example:
On my computer, "cABhAHMAcwB3AG8AcgBkADEA" decodes to:
'p','\0','a','\0','s','\0','s','\0','w','\0','o','\0','r','\0','d','\0','1','\0'
But on my coworkers computer it is:
'p','a','s','s','w','o','r','d','1'
Edit:
I know that the string I create comes from a textbox, and that the file where I am saving it to is always going to be UTF-8, so everything is pointing to the Convert method causing my encoding switch.
Update:
After digging in further, it appears that my coworker had a very important line commented in his version of the code, the one that saves the value read from file to the hashtable. The default value I was using is a UTF-8 base64 value, so I am going to correct the default, to a utf-16 value, then I can clean up the code removing any UTF8 references.
Also, I had been naively using the UTF-8 base64 encoding I had retrieved from a website, not realizing what I was getting myself into. The funny part is I would never have found that fact if my coworker hadn't commented the line that saves the values from the file.
Final version of the code:
public static string FromBase64(this string input)
{
byte[] b64 = Convert.FromBase64String(input);
return System.Text.Encoding.Unicode.GetString(b64);
}
public static string ToBase64(this string input)
{
String b64 = Convert.ToBase64String(input.GetBytes());
return b64;
}
public static byte[] GetBytes(this string str)
{
return System.Text.Encoding.Unicode.GetBytes(str);
}
First of all, I want to debunk the title of the question:
Convert.FromBase64String() returns Unicode sometimes, or UTF-8
That is not the case. Give then same input, valid base64 encoded text, Convert.FromBase64String() always returns the same output.
Moving on, you cannot determine definitively, just by examining the payload, the encoding used for a string. You attempt to do this with
if (b64[1] == 0)
// encoding must be UTF-16
This is not the case. The overwhelming majority of UTF-16 character elements fail that test. It does not matter how you try to write this test it is doomed to fail. And that is because there exist byte arrays that are well-defined strings when interpreted as different encodings. In other words it is possible, for instance, to construct byte arrays that are valid when considered as either UTF-8 or UTF-16.
So, you have to know a priori whether the payload is encoded as UTF-16, UTF-8 or indeed some other encoding.
The solution will be to keep track of the original encoding, before the base64 encoding. Pass that information along with the base64 encoded payload. Then when you decode, you can determine which Encoding to use to decode back to a string.
It looks to me very much that your strings are all coming from UTF-16 .net strings. In which case you won't have UTF-8 strings ever, and should always decode with UTF-16. That is you use Encoding.Unicode.GetString().
Also, the GetBytes method in your code is poor. It should be:
public static byte[] GetBytes(this string str)
{
return Encoding.Unicode.GetBytes(str);
}
Another oddity:
String corrected = new string(input.ToCharArray());
This is a no-op.
Finally, it is quite likely that your text will be more compact when encoded as UTF-8. So perhaps you should consider doing that before applying the base64 encoding.
Regarding your update, what you state is incorrect. This code:
string str = Encoding.Unicode.GetString(
Convert.FromBase64String("cABhAHMAcwB3AG8AcgBkADEA"));
assigns password1 to str wherever it is run.
Try revising the code to make it a little more readable/accurate. As mentioned in my comment and David Hefferman's answer you're trying to do things that either:
A) do nothing
or
B) demonstrate flawed logic
The following code based upon yours works fine:
class Program
{
static void Main(string[] args)
{
string original = "password1";
string encoded = original.ToBase64();
string decoded = encoded.FromBase64();
Console.WriteLine("Original: {0}", original);
Console.WriteLine("Encoded: {0}", encoded);
Console.WriteLine("Decoded: {0}", decoded);
}
}
public static class Extensions
{
public static string FromBase64(this string input)
{
return System.Text.Encoding.Unicode.GetString(Convert.FromBase64String(input));
}
public static string ToBase64(this string input)
{
return Convert.ToBase64String(input.GetBytes());
}
public static byte[] GetBytes(this string str)
{
return System.Text.Encoding.Unicode.GetBytes(str);
}
}
What you are doing is no different than encoding data in either EBCDIC or ASCII, then trying to figure out which was used during the decode. As you have already discovered, this is not going to work very well.
The only way to get this to work correctly is to have a single encoding format used by all participants. This is a fundamental concept of communications.
Pick an encoding - let's say UTF-8 - and use it for all transformations between String and byte[]. This will ensure that you have accurate knowledge of the format of the payload and how to deal with it, as David Tanner has been telling you.
Here's the basic form:
public static string ToBase64(this string self)
{
byte[] bytes = Encoding.UTF8.GetBytes(self);
return Convert.ToBase64String(bytes);
}
public static string FromBase64(this string self)
{
byte[] bytes = Convert.FromBase64String(self);
return Encoding.UTF8.GetString(bytes);
}
Regardless of whatever weirdness might be happening between your computers, this code will produce the same encoded strings.

German special characters put into Byte array

I'm doing encrypt algotythm right now and I need to encrypt german words also. So I have to encrypt for example characters like: ü,ä or ö.
Inside I've got a function:
private static byte[] getBytesArray(string data)
{
byte[] array;
System.Text.ASCIIEncoding asciiEncoding = new System.Text.ASCIIEncoding();
array = asciiEncoding.GetBytes(data);
return array;
}
But when data is "ü", byte returned in array is 63 (so "?"). How can I return ü byte?
I also tried:
private static byte[] MyGetBytesArray(string data)
{
byte[] array;
System.Text.ASCIIEncoding asciiEncoding = new System.Text.ASCIIEncoding();
Encoding enc = new UTF8Encoding(true, true);
array = enc.GetBytes(data);
return array;
}
but in this case I get 2 bytes in array: 195 and 188.
Please replace System.Text.ASCIIEncoding with System.Text.UTF8Encoding and rename the encoding object accordingly in your first example. ASCII basically does not support german characters, so this is why you'll have to use some other encoding (UTF-8 seems to be the best idea here).
Please take a look here: ASCII Encoding and here: UTF-8 Encoding
You can use this
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
// This is our Unicode string:
string s_unicode = "abcéabc";
// Convert a string to utf-8 bytes.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8Bytes);

Convert UTF-8 to Chinese Simplified (GB2312)

Is there a way to convert UTF-8 string to Chinese Simplified (GB2312) in C#. Any help is greatly appreciated.
Regards
Jyothish George
The first thing to be aware of is that there's no such thing as a "UTF-8 string" in .NET. All strings in .NET are effectively UTF-16. However, .NET provides the Encoding class to allow you to decode binary data into strings, and re-encode it later.
Encoding.Convert can convert a byte array representing text encoded with one encoding into a byte array with the same text encoded with a different encoding. Is that what you want?
Alternatively, if you already have a string, you can use:
byte[] bytes = Encoding.GetEncoding("gb2312").GetBytes(text);
If you can provide more information, that would be helpful.
Try this;
public string GB2312ToUtf8(string gb2312String)
{
Encoding fromEncoding = Encoding.GetEncoding("gb2312");
Encoding toEncoding = Encoding.UTF8;
return EncodingConvert(gb2312String, fromEncoding, toEncoding);
}
public string Utf8ToGB2312(string utf8String)
{
Encoding fromEncoding = Encoding.UTF8;
Encoding toEncoding = Encoding.GetEncoding("gb2312");
return EncodingConvert(utf8String, fromEncoding, toEncoding);
}
public string EncodingConvert(string fromString, Encoding fromEncoding, Encoding toEncoding)
{
byte[] fromBytes = fromEncoding.GetBytes(fromString);
byte[] toBytes = Encoding.Convert(fromEncoding, toEncoding, fromBytes);
string toString = toEncoding.GetString(toBytes);
return toString;
}
source here

Conversion of text to unicode strings

I have to process JSON files that looks like this:
\u0432\u043b\u0430\u0434\u043e\u043c <b>\u043f\u0443\u0442\u0438\u043c<\/b> \u043d\u0430\u0447
Unfortunately, I'm not sure how this encoding is called.
I would like to convert it to .NET Unicode strings. What's the easies way to do it?
This is Unicode characters for Russian alphabet.
try simply put this line in VisualStudio and it will parse it.
string unicodeString = "\u0432\u043b\u0430\u0434\u043e\u043c";
Or if you want to convert this string to another encoding, for example utf8, try this code:
static void Main()
{
string unicodeString = "\u0432\u043b\u0430\u0434\u043e\u043c <b>\u043f\u0443\u0442\u0438\u043c<\b> \u043d\u0430\u0447";
// Create two different encodings.
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] utf8Bytes = Encoding.Convert(unicode, utf8, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
Console.ReadKey();
}
code taken from Convert

Categories

Resources