Convert UTF8 string to UTF-16 in .net

Convert UTF8 string to UTF-16 in .net - c#

I have a string from UTF8 and want to convert that to Unicode (UTF16). Please help.

If you have a file and you know that encoding of the file is UTF8 you can use StreamReader to read the file as if it is encoded in UTF8.
Regarding conversion from UTF8 to Unicode, you are comparing 2 different things. Check the link in my comment to your question.
System.Text.UTF8Encoding is UTF8 System.Text.UnicodeEncoding is UTF16. Check this link for conversion. You would be using Encoding.Convert()

Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.

Related

best encode decode for binary file in java and c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.

The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]

You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.

If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.

Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

How to decode UTF8 bytes?

How can I decode UTF8 bytes in a string in C#?
Example: Decode this input:
"Poluci%C3%B3n"
To output this:
"Polución"

This encoding appears to be URL encoding (not UTF-8 encoding). You can unencode it with a number of different methods in .NET:
HttpUtility.UrlDecode("Poluci%C3%B3n"); // returns "Polución"
Uri.UnescapeDataString("Poluci%C3%B3n"); // returns "Polución"

Try this:
Uri.UnescapeDataString("Poluci%C3%B3n")
the problem has nothing to do with UTF8 though. It's just URL encoded.

C# - converting UTF-8 to Ukranian encoding

I was trying to convert the encoding of this string from utf-8 to ukranian "ÐÑÐ°Ð¹Ð²ÐµÑ-Ð´Ð»Ñ-Ð¿ÑÐ¸Ð½ÑÐµÑÐ°-Pixma-ip-2000-Ð´Ð»Ñ-Windows-7-64-Ð±Ð¸Ñ".
whenever I convert it from utf8 to ukranian I get a corrupted string...
the correct string should look like "Драйвер-для-принтера-Pixma-ip-2000-для-Windows-7-64-бит"..
please advice.. thanks
EDIT: here is how I convert it..
private string EncodeUTF8toOther(string inputString, string to)
{
try
{
// Create two different encodings.
byte[] myBytes = Encoding.Unicode.GetBytes(inputString);
// Perform the conversion from one encoding to the other.
byte[] convertedBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(to), myBytes);
return Encoding.GetEncoding("ISO-8859-1").GetString(convertedBytes);
}
catch
{
return inputString;
}
}
ukrainian character set is "KOI8-U"
More Info: I have similar problem to this question:
c# HttpWebResponse Header encoding
the location header is giving me this corrupted string. I need to encode it correctly in order to perform the redirection..

Encoding.Unicode is UTF-16, not UTF-8. If you're sure your source string is encoded in UTF-8, use Encoding.UTF8 instead.
And returning a string doesn't have any sense. string are always encoded in UTF-16. You should worry about the encoding only when reading and writing your string.
When reading, use Encoding.UTF8.GetString to create a UTF-16 string from the binary data.
When writing, either use Encoding.GetEncoding(destinationEncoding).GetBytes to get the binary data and write it directly, or use the overload of your StreamWriter constructor (or whatever object you're using) to specify the encoding.

You need to decode the string properly on input, like so:
StreamReader rdr = new StreamReader( args[0], Encoding.UTF8 );
string str = rdr.ReadToEnd();
rdr.Close();
The stream is physical and you must know what encoding it is in.
The string, on the other hand, is logical.
The encoding used for strings internally is of no concern to you;
other than that what characters it can represent;
and it can represent all characters as the internal encoding is for Unicode.
(If the internal encoding were KOI-8 German or French characters couldn't be represented.)
It is on output that you have to worry again about the encoding.
If you don't specify the encoding on input and output the platform default is assumed.
This might not be what you want.
It's good practice to know and specify the encoding on input and output.

"ÐÑÐ°Ð¹Ð²ÐµÑ-Ð´Ð»Ñ-Ð¿ÑÐ¸Ð½ÑÐµÑÐ°-Pixma-ip-2000-Ð´Ð»Ñ-Windows-7-64-Ð±Ð¸Ñ".
Its already UTF-8! You don't have to make any conversion. Just make Windows know its UTF-8. Something like this will do the job:
wb.Encoding = Encoding.UTF8;

What is the encoding of the string get from StreamReader.ReadLine()

First, let's see the code:
//The encoding of utf8.txt is UTF-8
StreamReader reader = new StreamReader(#"C:\\utf8.txt", Encoding.UTF8, true);
while (reader.Peek() > 0)
{
//What is the encoding of lineFromTxtFile?
string lineFromTxtFile = reader.ReadLine();
}
As Joel said in his famous article:
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
So here comes my question: what is the encoding of the string lineFromTxtFile? UTF-8(because it is from a text file encoded in UTF-8)? or UTF-16(because string in .NET is "Unicode"(UTF-16))?
Thanks.

All .Net string variables are encoded with Encoding.Unicode (UTF-16, little endian). Even better, because you know your text file is utf-8 and told your streamreader the correct encoding in the constructor, any special characters will be handled correctly.

.NET strings are Unicode. Encoding doesn't play a part, then until you need to use it next. If you go to write it out to a file, for example, then you will specify the output encoding. But since .NET handles everything you do with the string via library calls, it doesn't matter how it's represented in memory.

It would be Unicode, because all .NET strings are. Real question: why does it matter?

How to convert string to base64 byte array, would this be valid?

I'm trying to write a function that converts a string to a base64 byte array. I've tried with this approach:
public byte[] stringToBase64ByteArray(String input)
{
byte[] ret = System.Text.Encoding.Unicode.GetBytes(input);
string s = Convert.ToBase64String(input);
ret = System.Text.Encoding.Unicode.GetBytes(s);
return ret;
}
Would this function produce a valid result (provided that the string is in unicode)?
Thanks!

You can use:
From byte[] to string:
byte[] array = somebytearray;
string result = Convert.ToBase64String(array);
From string to byte[]:
array = Convert.FromBase64String(result);

Looks okay, although the approach is strange. But use Encoding.ASCII.GetBytes() to convert the base64 string to byte[]. Base64 encoding only contains ASCII characters. Using Unicode gets you an extra 0 byte for each character.

Representing a string as a blob represented as a string is odd... any reason you can't just use the string directly?
The string is always unicode; it is the encoded bytes that change. Since base-64 is always <128, using unicode in the last part seems overkill (unless that is what the wire-format demands). Personally, I'd use UTF8 or ASCII for the last GetBytes so that each base-64 character only takes one byte.

All strings in .NET are unicode. This code will produce valid result but the consumer of the BASE64 string should also be unicode enabled.

Yes, it would output a base64-encoded string of the UTF-16 little-endian representation of your source string. Keep in mind that, AFAIK, it's not really common to use UTF-16 in base64, ASCII or UTF-8 is normally used. However, the important thing here is that the sender and the receiver agree on which encoding must be used.
I don't understand why you reconvert the base64 string in array of bytes: base64 is used to avoid encoding incompatibilities when transmitting, so you should keep is as a string and output it in the format required by the protocol you use to transmit the data. And, as Marc said, it's definitely overkill to use UTF-16 for that purpose, since base64 includes only 64 characters, all under 128.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.