Now I am using C # to implement a JSON serialization tool. The standard I refer to is RFC7159. I don't understand the content of item 8(String and Character Issues) in this document.
8. String and Character Issues
8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of a
JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
8.2. Unicode Characters
When all the strings represented in a JSON text are composed entirely
of Unicode characters [UNICODE] (however escaped), then that JSON
text is interoperable in the sense that all software implementations
that parse it will agree on the contents of names and of string
values in objects and arrays.
However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.
8.3. String Comparison
Software implementations are typically required to test names of
object members for equality. Implementations that transform the
textual representation into sequences of Unicode code units and then
perform the comparison numerically, code unit by code unit, are
interoperable in the sense that implementations will agree in all
cases on equality or inequality of two strings. For example,
implementations that compare strings with escaped characters
unconverted may incorrectly find that "a\\b" and "a\u005Cb" are not
equal.
The JSON serialization tool I've built is very simple. It can turn objects into strings or strings into objects.
I only exposed 2 APIs
public string SerializeToJson(object obj);
public object DeserializeToObj(string json);
For serialization, I'm only responsible for generating a string (utf16), and I don't care what binary you encode it into.
For deserialization, at the code level, I accept only one string (I don't care where the string is read from), and I read every char of the string ,Throw an error if the format is incorrect
So I don't understand Json RFC 7159 8.1, 8.2 very well. Why does it appear in the JSON standard?
For a C# JSON serialization tool, should I consider it? If I get it wrong, can you tell me what it means or give me a scenario?
For 8.3, when serialized, I will change it into \\\\ for \\ and \\u005c for \u005c.In deserialization, I read their codes unit for comparison, so \\ and \u005c are certainly the same, so what does 8.3 mean? Is it for serialization or deserialization? Is my understanding correct?
Or does it require me, when serializing, whether \\ or \u005c, to eventually output the same(such as \\u005c)?
You could do the serialization in two steps. First to String and then UTF-8 bytes. UTF-8 is required for interchangeable JSON documents (RFC 8259, which obsoletes RFC 7159).
The only escapes you need are inside JSON strings.
The characters that MUST be escaped:
quotation mark,
reverse solidus,
and the control characters (U+0000 through U+001F)
You can choose the manner of escaping quotation mark, reverse solidus, solidus, backspace, formfeed, newline, carriage return and horizontal tab upon serialization.
www.json.org has a very appealing "railroad diagram" illustrating this.
(ASCII has nothing to do with anything here.)
Related
I'm struggling to find an effective way to serialize a string that could contain both unicode and non-unicode characters into a binary array which I then serialize to a file that I have to deserialize using C++.
I have already implemented a serializer/deserializer in C++ which I use to do most of my serialization which can handle both unicode and non-unicode characters (basically I convert non-unicode characters into their unicode equivalent and serialize everything as a unicode string, not the most effective way since every string now has 2 bytes per character but works).
What I'm trying to achieve is to transform an arbitrary string into a 2 byte per character string that I can then deserialize from C++.
What would be the most effective effective way to achieve what I'm looking for?
Also, any suggestion regarding the way I'm serializing strings is well accepted of course.
Encoding.Unicode.GetBytes("my string") encodes the string as UTF-16, which has a size of 2 Bytes for each character. So if you are searching still an alternative consider the encoding.
I have a COM server app (App_A) that only supports native data types. I send the parameters over the COM server to a C# app (App_B) that then sends on the data as a web request.
My problem is that the String data read by App_A is Unicode, but App_A does not support non-UTF-8 encoding for its COM String values, so the data can be sent as a byte array or char array.
If I use the byte array, the generic App_B is now broken as I now have to handle this single data update differently to all the others (and I fear there will be more), so I would like to keep the App_B handling of values generic (obj.ToString).
If I hard code an App_B C# String as a literal, e.g. "\u5f90", the String contains a Unicode character and the HttpUtility.UrlEncode call in App_B works exactly as expected. If the String is passed in as a value (obj.ToString() = "\u5f90") the '\' is escaped and the UrlEncode does not UTF-8-encode a Unicode character as the '\u' escape sequence is lost.
I guess my question comes down to:
So far I have manipulated the byte array in App_A to replace the Unicode values (xxxx) with '\uxxxx': - is there any way I can use a String variable as a format string in the C# App_B?
Alternatively, if I'm going about this the wrong way, what would anyone suggest?
Please bear in mind that I have approx 300 data value updates that all use a generic o.ToString for part of the UrlEncode argument and I would like to keep this if possible.
Is it an option for you to support different encodings in your deserialization of the byte arrays in App_B? I'd suggest modifying App_A so that each sent string has an additional first byte which defines the encoding, which then has to be respected by App_B. That way it doesn't matter which encoding you use, as long as both apps support it.
I'd strongly suggest not modifying the strings as you've described by preceeding it with \u, that's just gonna be a mess of code later on which needs to be documented well and needs to be understood again if you come back to it later etc.
Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.
Is the data stored in String object always encoded with UTF16?
I am asking this because my database does stores non English in non Unicode. and I assumed that the data will not be readable because it is read in wrong encoding.
Thanks
Internally .NET strings are in UTF-16, yes... but what's important is how the data is transferred between .NET and your database.
So long as the characters can be represented in Unicode, and the driver performs the appropriate conversion, you should be fine. If you're trying to represent text which can't be represented in Unicode, you may well run into some interesting behaviour.
Yes, .NET strings are always encoded in UTF-16 - with the exception of surrogate pairs that means 2 byte characters.
.NET Strings are ALWAYS Unicode. If your database is unicode you are fine, otherwise you will need to convert the text from whatever format it is in to unicode.
The internal storage of characters (and therefore strings) in .NET is done in UTF-16.
You will need to re-encode the string to the encoding used by your database.
See the Encoding class - this is what you can use to convert a string from one encoding to another.
If you are using ADO.NET with SqlDataCommands (or other types of DataCommands), any required conversion should be handled for you, and you won't need to worry about it.
I have Unicode strings stored in a database. Some of the character encodings are wrong and instead of displaying actual characters for the language, it's now displaying characters that make no sense. How do I fix this issue? Is there a way to detect if strings have a wrong encoding?
The problem with mojibake (the Japanese slang "mojibake" gets used in English because the historical status of Japan as a non-Western country with heavy early computer use meant the issue was encountered a lot there) is that the characters will generally be valid in themselves, but nonsense, which is much harder to detect with 100% accuracy.
The first thing you need to do is identify the encoding that the data was really in, the encoding the data was read as being in, and write a converter to undo that.
For example, if UTF-8 had been mis-interpreted as ISO 8859-1, then you would want to read through the stream, and create the binary stream of encoding it back into ISO 8859-1, and then create the text stream of reading that binary stream as UTF-8, as should have been done in the first place.
Now for the hard part, finding the incorrect streams. If you can do this by some means that isn't heuristic, then this is the way to go (e.g. if you knew that every record added within a particular range of id numbers was invalid, just use that).
Failing that, your best bet is to do some heuristics as follows:
If a character in the text is not a graphical character, then its probably caused by this mojibake issue.
Certain sequences will be common in the given case of mojibake. For example, é in UTF-8 mis-interpreted as ISO 8859-1 will become é. Since é is an extremely rare combination in real data (about the only time you'll see it deliberately is in a case like this when someone is talking about how it can appear by mistake), then any text containing it is almost certainly one that needs to be fixed. If you have some of the original data, you can find the sequences you need to look for by identifying those characters in the original data that differ in the two encodings, and producing the sequence necessary (e.g. if we find that ç appears in the data, and we find that this would have the sequence ç, then we know that's a sequence to look for.
Note that we can compute such sequences if we have System.Text.Encoding objects that correspond to the mojikbake. If for example you had read as your system's default encoding when you should have read as UTF-8 then you could use:
Encoding.Default.GetString(Encoding.UTF8.GetBytes(testString))
For example:
Encoding.Default.GetString(Encoding.UTF8.GetBytes("ç"))
returns "ç".