How to decode UTF8 bytes? - c#

How can I decode UTF8 bytes in a string in C#?
Example: Decode this input:
"Poluci%C3%B3n"
To output this:
"Polución"

This encoding appears to be URL encoding (not UTF-8 encoding). You can unencode it with a number of different methods in .NET:
HttpUtility.UrlDecode("Poluci%C3%B3n"); // returns "Polución"
Uri.UnescapeDataString("Poluci%C3%B3n"); // returns "Polución"

Try this:
Uri.UnescapeDataString("Poluci%C3%B3n")
the problem has nothing to do with UTF8 though. It's just URL encoded.

Related

Correctly handle utf8 string received from json.net

I'm using json.net to read data sent in json format from a server. The server encodes all string-type data it sends in json as utf-8.
Now to read the data in c# I do something like this: string s = json.Value<string>("data");
I assume the string s is now in utf-8 format, whereas the default encoding for strings in c# is utf-16 (unicode).
To convert the string to unicode, would this be correct?
byte[] bytes = Encoding.Unicode.GetBytes(s);
string unicode = Encoding.UTF8.GetString(bytes);
What I want (I think) is the raw bytes from s and then pass that to the utf-8 decoder to get unicode, but I'm not sure what exactly Encoding.Unicode.GetBytes gives me, or what I should use instead.
There is no need to convert anything, since string objects in .NET are encoded in UTF-16.
If there is anything to change, you should change something where JSON.NET deserializes the string: you can't double parse it. The incoming JSON string is already interpreted for a specific encoding. You can't go back from there without the original bytes.

Convert UTF-8 to base64 string

I'm trying to convert UTF-8 to base64 string.
Example: I have "abcdef==" in UTF-8. It's in fact a "representation" of a base64 string.
How can I retrieve a "abcdef==" base64 string (note that I don't want a "abcdef==" "translation" from UTF-8, I want to get a string encoded in base64 which is "abcdef==").
EDIT
As my question seems to be unclear, here is a reformulation:
My byte array (let's say I name it A) is represented by a base64 string. Converting A to base64 gives me "abcdef==".
This string representation is sent through a socket in UTF-8 (note that the string representation is exactly the same in UTF-8 and base64). So I receive an UTF-8 message which contains "whatever/abcdef==/whatever" in UTF-8.
So I need to retrieve the base64 "abcedf==" string from this socket message in order to get A.
I hope this is more clear!
It's a little difficult to tell what you're trying to achieve, but assuming you're trying to get a Base64 string that when decoded is abcdef==, the following should work:
byte[] bytes = Encoding.UTF8.GetBytes("abcdef==");
string base64 = Convert.ToBase64String(bytes);
Console.WriteLine(base64);
This will output: YWJjZGVmPT0= which is abcdef== encoded in Base64.
Edit:
To decode a Base64 string, simply use Convert.FromBase64String(). E.g.
string base64 = "YWJjZGVmPT0=";
byte[] bytes = Convert.FromBase64String(base64);
At this point, bytes will be a byte[] (not a string). If we know that the byte array represents a string in UTF8, then it can be converted back to the string form using:
string str = Encoding.UTF8.GetString(bytes);
Console.WriteLine(str);
This will output the original input string, abcdef== in this case.

C# - converting UTF-8 to Ukranian encoding

I was trying to convert the encoding of this string from utf-8 to ukranian "ÐÑайвеÑ-длÑ-пÑинÑеÑа-Pixma-ip-2000-длÑ-Windows-7-64-биÑ".
whenever I convert it from utf8 to ukranian I get a corrupted string...
the correct string should look like "Драйвер-для-принтера-Pixma-ip-2000-для-Windows-7-64-бит"..
please advice.. thanks
EDIT: here is how I convert it..
private string EncodeUTF8toOther(string inputString, string to)
{
try
{
// Create two different encodings.
byte[] myBytes = Encoding.Unicode.GetBytes(inputString);
// Perform the conversion from one encoding to the other.
byte[] convertedBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(to), myBytes);
return Encoding.GetEncoding("ISO-8859-1").GetString(convertedBytes);
}
catch
{
return inputString;
}
}
ukrainian character set is "KOI8-U"
More Info: I have similar problem to this question:
c# HttpWebResponse Header encoding
the location header is giving me this corrupted string. I need to encode it correctly in order to perform the redirection..
Encoding.Unicode is UTF-16, not UTF-8. If you're sure your source string is encoded in UTF-8, use Encoding.UTF8 instead.
And returning a string doesn't have any sense. string are always encoded in UTF-16. You should worry about the encoding only when reading and writing your string.
When reading, use Encoding.UTF8.GetString to create a UTF-16 string from the binary data.
When writing, either use Encoding.GetEncoding(destinationEncoding).GetBytes to get the binary data and write it directly, or use the overload of your StreamWriter constructor (or whatever object you're using) to specify the encoding.
You need to decode the string properly on input, like so:
StreamReader rdr = new StreamReader( args[0], Encoding.UTF8 );
string str = rdr.ReadToEnd();
rdr.Close();
The stream is physical and you must know what encoding it is in.
The string, on the other hand, is logical.
The encoding used for strings internally is of no concern to you;
other than that what characters it can represent;
and it can represent all characters as the internal encoding is for Unicode.
(If the internal encoding were KOI-8 German or French characters couldn't be represented.)
It is on output that you have to worry again about the encoding.
If you don't specify the encoding on input and output the platform default is assumed.
This might not be what you want.
It's good practice to know and specify the encoding on input and output.
"ÐÑайвеÑ-длÑ-пÑинÑеÑа-Pixma-ip-2000-длÑ-Windows-7-64-биÑ".
Its already UTF-8! You don't have to make any conversion. Just make Windows know its UTF-8. Something like this will do the job:
wb.Encoding = Encoding.UTF8;

Convert UTF8 string to UTF-16 in .net

I have a string from UTF8 and want to convert that to Unicode (UTF16). Please help.
If you have a file and you know that encoding of the file is UTF8 you can use StreamReader to read the file as if it is encoded in UTF8.
Regarding conversion from UTF8 to Unicode, you are comparing 2 different things. Check the link in my comment to your question.
System.Text.UTF8Encoding is UTF8 System.Text.UnicodeEncoding is UTF16. Check this link for conversion. You would be using Encoding.Convert()
Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.

How to convert string to base64 byte array, would this be valid?

I'm trying to write a function that converts a string to a base64 byte array. I've tried with this approach:
public byte[] stringToBase64ByteArray(String input)
{
byte[] ret = System.Text.Encoding.Unicode.GetBytes(input);
string s = Convert.ToBase64String(input);
ret = System.Text.Encoding.Unicode.GetBytes(s);
return ret;
}
Would this function produce a valid result (provided that the string is in unicode)?
Thanks!
You can use:
From byte[] to string:
byte[] array = somebytearray;
string result = Convert.ToBase64String(array);
From string to byte[]:
array = Convert.FromBase64String(result);
Looks okay, although the approach is strange. But use Encoding.ASCII.GetBytes() to convert the base64 string to byte[]. Base64 encoding only contains ASCII characters. Using Unicode gets you an extra 0 byte for each character.
Representing a string as a blob represented as a string is odd... any reason you can't just use the string directly?
The string is always unicode; it is the encoded bytes that change. Since base-64 is always <128, using unicode in the last part seems overkill (unless that is what the wire-format demands). Personally, I'd use UTF8 or ASCII for the last GetBytes so that each base-64 character only takes one byte.
All strings in .NET are unicode. This code will produce valid result but the consumer of the BASE64 string should also be unicode enabled.
Yes, it would output a base64-encoded string of the UTF-16 little-endian representation of your source string. Keep in mind that, AFAIK, it's not really common to use UTF-16 in base64, ASCII or UTF-8 is normally used. However, the important thing here is that the sender and the receiver agree on which encoding must be used.
I don't understand why you reconvert the base64 string in array of bytes: base64 is used to avoid encoding incompatibilities when transmitting, so you should keep is as a string and output it in the format required by the protocol you use to transmit the data. And, as Marc said, it's definitely overkill to use UTF-16 for that purpose, since base64 includes only 64 characters, all under 128.

Categories

Resources