Convert a string to a utf8 string

Convert a string to a utf8 string - c#

I'm trying to convert a string like
<Root>á</Root>
To it's UTF string representation, like this
<Root>Ã¡</Root>
(Taken from this page: http://www.cafewebmaster.com/online_tools/utf8_encode)
But when I issue Encoding.UTF8.GetBytes(str) I get an array of utf bytes.
How can I convert those bytes to the string representation I'm after?
--
Thanks for pointing that there is no string representation of an utf8 string.
Just to clarify my needs, I have to execute something like this in sql 2008:
xmlAuditoria_Alta
'
<Out>utf8 char: Ã¡</Out>
'
This is the only way I found so far to have this stored precedure correctly save the value
utf8 char: á
That's why I'm trying to convert from á to Ã¡
Perhaps there's a more correct way to do it

Your question is based on an erroneous premise.
<Root>Ã¡</Root>
is not the UTF-8 representation of your string. In fact that string is the UTF-8 bytes re-interpreted in some other single-byte 8 bit character set.
If you want to convert a C# string to UTF-8 then you do indeed write:
Encoding.UTF8.GetBytes(str)

Related

when i decrypt base64 string in c# it shows me ????? question mark and some special character sign

i have a base64 string and i want to decrypt it thru c# code. but when i apply below code, it shows me question mark and differnt special characters.
i dont know why it is showing this.
byte[] data = Convert.FromBase64String(encpStr);
string decodedString = Encoding.UTF8.GetString(data);
i created console project in visual studio, and doing this practice. see my whole code below.
string encStr = "oJe6iooq+PbvArD+C7P7B/cHAAL9Dr2/vvIBFRcVCAYfxxEcygzMFB0eFNEWFC3VKibYLCknMiLeLzU7PC8pOPT19g==";
Console.WriteLine("******************\nEnc String:\n" + encStr + "\n\n\n**********************");
byte[] data = Convert.FromBase64String(encStr);
string decStr = Encoding.UTF8.GetString(data);
Console.WriteLine("\n\n\nDecr String: \n" + decStr + "\n\n");
Console.ReadKey();

By default many systems/languages display byte values that can not be represented as a ?, � or similar glyph.
Base64 is encoding not encryption.
Base64 encoding is generally used when data needs to be handled as a character string but the data values are not representable as characters (or at least not all).
Data is generally a collection of 8-bit bytes such as in an array. Eamples include images, compiled computer code, encrypted data and etc.
Not all, even most, byte values do not have a character representation and do not have a displayable representation. See ASCII, unicode and in particular UTF-8 as well as Base64.

Base64 is used to encode binary data as text. If you decode Base64 data, you therefore get binary data. Most binary data cannot be represented as a string (or any other form of text) because it simply is not text.
Your Base64 is such an example. The result is not text, neither in UTF-8, nor in ASCII nor in any other encoding.
That's why you get a bunch on funny characters. They are used to represent invalid characters.

Strange results when converting from byte array to string

I get strange results when converting byte array to string and then converting the string back to byte array.
Try this:
byte[] b = new byte[1];
b[0] = 172;
string s = Encoding.ASCII.GetString(b);
byte[] b2 = Encoding.ASCII.GetBytes(s);
MessageBox.Show(b2[0].ToString());
And the result for me is not 172 as I'd expect but... 63.
Why does it happen?

Why does it happen?
Because ASCII only contains values up to 127.
When faced with binary data which is invalid for the given encoding, Encoding.GetString can provide a replacement character, or throw an exception. Here, it's using a replacement character of ?.
It's not clear exactly what you're trying to achieve, but:
If you're converting arbitrary binary data to text, use Convert.ToBase64String instead; do not try to use an encoding, as you're not really representing text. You can use Convert.FromBase64String to then decode.
Encoding.ASCII is usually a bad choice, and certainly binary data including a byte of 172 is not ASCII text
You need to work out which encoding you're actually using. Personally I dislike using Encoding.Default unless you really know the data is in the default encoding for the platform you're working on. If you get the choice, using UTF-8 is a good one.

ASCII encoding is a 7-bit encoding. If you take a look into generated string it contains "?" - unrecognized character. You might choose Encoding.Default instead.

ASCII is a seven bit character encoding, so 172 falls out of that range, so when converting to a string, it converts to "?" which is used for characters that cannot be represented.

best encode decode for binary file in java and c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.

The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]

You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.

If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.

Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

How to convert string to base64 byte array, would this be valid?

I'm trying to write a function that converts a string to a base64 byte array. I've tried with this approach:
public byte[] stringToBase64ByteArray(String input)
{
byte[] ret = System.Text.Encoding.Unicode.GetBytes(input);
string s = Convert.ToBase64String(input);
ret = System.Text.Encoding.Unicode.GetBytes(s);
return ret;
}
Would this function produce a valid result (provided that the string is in unicode)?
Thanks!

You can use:
From byte[] to string:
byte[] array = somebytearray;
string result = Convert.ToBase64String(array);
From string to byte[]:
array = Convert.FromBase64String(result);

Looks okay, although the approach is strange. But use Encoding.ASCII.GetBytes() to convert the base64 string to byte[]. Base64 encoding only contains ASCII characters. Using Unicode gets you an extra 0 byte for each character.

Representing a string as a blob represented as a string is odd... any reason you can't just use the string directly?
The string is always unicode; it is the encoded bytes that change. Since base-64 is always <128, using unicode in the last part seems overkill (unless that is what the wire-format demands). Personally, I'd use UTF8 or ASCII for the last GetBytes so that each base-64 character only takes one byte.

All strings in .NET are unicode. This code will produce valid result but the consumer of the BASE64 string should also be unicode enabled.

Yes, it would output a base64-encoded string of the UTF-16 little-endian representation of your source string. Keep in mind that, AFAIK, it's not really common to use UTF-16 in base64, ASCII or UTF-8 is normally used. However, the important thing here is that the sender and the receiver agree on which encoding must be used.
I don't understand why you reconvert the base64 string in array of bytes: base64 is used to avoid encoding incompatibilities when transmitting, so you should keep is as a string and output it in the format required by the protocol you use to transmit the data. And, as Marc said, it's definitely overkill to use UTF-16 for that purpose, since base64 includes only 64 characters, all under 128.

Convert Latin 1 encoded UTF8 to Unicode

I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as Ð°Ð±Ð²Ð³Ð´Ð. When I pull them out of the db into my C# app, into strings, I still see Ð°Ð±Ð²Ð³Ð´Ð. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?

Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(s))
Now you have a normal Unicode string containing Cyrillic.
Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252) instead.
Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.

I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.
//re interpret latin1 in proper utf8 encoding
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));
//convert from utf8 to 1251
str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.