Correctly handle utf8 string received from json.net

Correctly handle utf8 string received from json.net - c#

I'm using json.net to read data sent in json format from a server. The server encodes all string-type data it sends in json as utf-8.
Now to read the data in c# I do something like this: string s = json.Value<string>("data");
I assume the string s is now in utf-8 format, whereas the default encoding for strings in c# is utf-16 (unicode).
To convert the string to unicode, would this be correct?
byte[] bytes = Encoding.Unicode.GetBytes(s);
string unicode = Encoding.UTF8.GetString(bytes);
What I want (I think) is the raw bytes from s and then pass that to the utf-8 decoder to get unicode, but I'm not sure what exactly Encoding.Unicode.GetBytes gives me, or what I should use instead.

There is no need to convert anything, since string objects in .NET are encoded in UTF-16.
If there is anything to change, you should change something where JSON.NET deserializes the string: you can't double parse it. The incoming JSON string is already interpreted for a specific encoding. You can't go back from there without the original bytes.

Related

Converting a string rappresentation of a file (byte array) back to a file in C#

As in the title I'm trying to convert back a string rappresentation of a bytearray to the original file where the bytes where taken.
What I've done:
I've a web service that gets a whole file and sends it:
answer.FileByte = File.ReadAllBytes(#"C:\QRY.txt");
After the serialization in the transmitted result xml I've this line:
<a:FileByte>TVNIfGF8MjAxMzAxMDF8YQ1QSUR8YXxhfGF8YXxhfGF8YXwyMDEzMDEwMXxhfGF8YXxhfGF8YXxhfGF8YXxhfGF8YXxhDVBWMXxhfGF8YXxhfGF8YXxhfGF8YXxhfDIwMTMwMTAxfDIwMTMwMTAxfDB8MHxhDQo=</a:FileByte>
I've tried to convert it back with this line in another simple application:
//filepath is the path of the file created
//bytearray is the string from the xml (copypasted)
File.WriteAllBytes(filepath, Encoding.UTF8.GetBytes(bytearray));
I've used UTF8 as enconding since the xml declares to use this charset. Keeping the datatype is not an option since I'm writing a simple utility to check the file conversion.
Maybe I'm missing something very basic but I'm not able to come up with a working solution.

This certainly isn't UTF8, the serializer probably converted it to Base64.
Use Convert.FromBase64String() to get your bytes back
Assuming that bytearray is the "TVNIfGF8M..." string, try:
string bytearray = ...;
File.WriteAllBytes(filepath, Convert.FromBase64String(bytearray));

UTF8 is a way to convert arbitrary text into bytes.
It was used by ReadAllText() to turn the bytes on disk back into XML.
You're seeing a mechanism to convert arbitrary bytes into text that can fit into XML. (that text is then convert to different bytes using UTF8).
It's probably Base64; use Convert.FromBase64String().

best encode decode for binary file in java and c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.

The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]

You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.

If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.

Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

.NET 3.5 C# StreamReader Reading ISO-8859-1 Characters Incorrectly

In summary I retrieve a HTTP Web Response containing JSON formatted data with unicode characters such as "\u00c3\u00b1" which should translate to "ñ". Instead these characters are converted to "Ã±" by the JSON parser I am using. The behavior I'm looking for is for those characters to be translated to "ñ".
Taking the following code and executing it...
string nWithAccent = "\u00c3\u00b1";
Encoding iso = Encoding.GetEncoding("iso8859-1");
byte[] isoBytes = iso.GetBytes(nWithAccent);
nWithAccent = Encoding.UTF8.GetString(isoBytes);
nWithAccent outputs "ñ". This is the result I am looking for. I took the above code and used it on the "response_body" variable below which contained the same characters as above (from what I could see using the Visual Studio 2008 Text Analyzer) and did not get the same result... it does nothing with the characters "\u00c3\u00b1".
In my application I execute the following code against an external system retrieving data in JSON format. Upon examining the "response_body" variable using the text analyzer in Visual Studio I see "\u00c3\u00b1" instead of ñ. E.g. the word "niño" would be seen in the Text Analyzer as "ni\u00c3\u00b1o".
using (HttpWResponse = (HttpWebResponse)this.HttpWRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
{
// token will expire 60 min from now.
this.TimeTillTokenExpiration = DateTime.Now.AddMinutes(60);
// read response data
response_body = reader.ReadToEnd();
}
}
I then use an open source JSON parser which replaces "\u00c3" with "Ã" and "\u00b1" with "±" with an end result of "Ã±" instead of "ñ". Is something wrong with the JSON parser or am I applying the wrong encoding to the response stream? The headers in the response indicate the charset as being UTF-8. Thanks for any replies!

The JSON response you are receiving is invalid. "\u00c3\u00b1" isn't the correct encoding for ñ.
Instead it's a sort of double encoding. It has first been encoded as an UTF-8 byte sequence and then the bytes above 128 have been escaped with the \u sequence.
Since a JSON response is usally UTF-8 anyway, there's no need to escape the two byte sequence for ñ. If escaping is used, it must not be applied to the two byte sequence but rather to the single Unicode character itself. It would then result in "\u00f1".
You can test it with an online JSON validator (such as JSONLint or JSON Format) by pasting the following JSON data:
{
"unescaped": "ñ",
"escaped": "\u00f1",
"wrong": "\u00c3\u00b1"
}

Replace
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
with
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.GetEncoding("iso8859-1")))

What happens if you pass this string to the JSON parser?
string s = "\\u00c3\\u00b1";
I suspect you'll get "Ã±".
Is there a way you can tell your JSON parser to interpret characters in the string as though they're UTF-8 bytes?
You're probably better off reading raw bytes from the response stream and passing that to the JSON parser.
I think the problem is that you're converting the raw bytes to a string, which contains the encoded characters. The JSON parser doesn't know if you want that "\u00c3\u00b1" converted to a single UTF-8 character, or two characters.

C# - converting UTF-8 to Ukranian encoding

I was trying to convert the encoding of this string from utf-8 to ukranian "ÐÑÐ°Ð¹Ð²ÐµÑ-Ð´Ð»Ñ-Ð¿ÑÐ¸Ð½ÑÐµÑÐ°-Pixma-ip-2000-Ð´Ð»Ñ-Windows-7-64-Ð±Ð¸Ñ".
whenever I convert it from utf8 to ukranian I get a corrupted string...
the correct string should look like "Драйвер-для-принтера-Pixma-ip-2000-для-Windows-7-64-бит"..
please advice.. thanks
EDIT: here is how I convert it..
private string EncodeUTF8toOther(string inputString, string to)
{
try
{
// Create two different encodings.
byte[] myBytes = Encoding.Unicode.GetBytes(inputString);
// Perform the conversion from one encoding to the other.
byte[] convertedBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(to), myBytes);
return Encoding.GetEncoding("ISO-8859-1").GetString(convertedBytes);
}
catch
{
return inputString;
}
}
ukrainian character set is "KOI8-U"
More Info: I have similar problem to this question:
c# HttpWebResponse Header encoding
the location header is giving me this corrupted string. I need to encode it correctly in order to perform the redirection..

Encoding.Unicode is UTF-16, not UTF-8. If you're sure your source string is encoded in UTF-8, use Encoding.UTF8 instead.
And returning a string doesn't have any sense. string are always encoded in UTF-16. You should worry about the encoding only when reading and writing your string.
When reading, use Encoding.UTF8.GetString to create a UTF-16 string from the binary data.
When writing, either use Encoding.GetEncoding(destinationEncoding).GetBytes to get the binary data and write it directly, or use the overload of your StreamWriter constructor (or whatever object you're using) to specify the encoding.

You need to decode the string properly on input, like so:
StreamReader rdr = new StreamReader( args[0], Encoding.UTF8 );
string str = rdr.ReadToEnd();
rdr.Close();
The stream is physical and you must know what encoding it is in.
The string, on the other hand, is logical.
The encoding used for strings internally is of no concern to you;
other than that what characters it can represent;
and it can represent all characters as the internal encoding is for Unicode.
(If the internal encoding were KOI-8 German or French characters couldn't be represented.)
It is on output that you have to worry again about the encoding.
If you don't specify the encoding on input and output the platform default is assumed.
This might not be what you want.
It's good practice to know and specify the encoding on input and output.

"ÐÑÐ°Ð¹Ð²ÐµÑ-Ð´Ð»Ñ-Ð¿ÑÐ¸Ð½ÑÐµÑÐ°-Pixma-ip-2000-Ð´Ð»Ñ-Windows-7-64-Ð±Ð¸Ñ".
Its already UTF-8! You don't have to make any conversion. Just make Windows know its UTF-8. Something like this will do the job:
wb.Encoding = Encoding.UTF8;

Convert Latin 1 encoded UTF8 to Unicode

I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as Ð°Ð±Ð²Ð³Ð´Ð. When I pull them out of the db into my C# app, into strings, I still see Ð°Ð±Ð²Ð³Ð´Ð. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?

Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(s))
Now you have a normal Unicode string containing Cyrillic.
Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252) instead.
Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.

I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.
//re interpret latin1 in proper utf8 encoding
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));
//convert from utf8 to 1251
str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.