Why is my array serialising into a string [duplicate] - c#

This question already has an answer here:
Newtonsoft JSON serialization for byte[] property [duplicate]
(1 answer)
Closed 2 years ago.
I have created the following test vector class:
public class TestVector
{
public UInt16 MaxBlockSize { get; }
public byte[] Payload { get; set; }
public TestVector(ushort maxBlockSize, byte[] payload)
{
MaxBlockSize = maxBlockSize;
Payload = payload;
}
}
In my test, I am populating a list of vectors defined as per:
private static HashSet<TestVector> myVectors = new HashSet<TestVector>();
And then serialising "myVectors" using JsonConvert and write the result to a file as per:
var jsonOuput = JsonConvert.SerializeObject(myVectors , new JsonSerializerSettings{ObjectCreationHandling = ObjectCreationHandling.Replace})
File.WriteAllText(#"e:\MyJson.json", jsonOuput);
Here is a Json typical output (with a list/Hashset composed of 2 vectors):
[
{
"MaxBlockSize": 256,
"Payload": "bjQSAAAAAABvNBIAAAAA..."
},
{
"MaxBlockSize": 256,
"Payload": "VjQSVzQSWDQS...."
},
]
Now what I do not get is why "Payload" is serialised as a string and not as an array.
My questions are:
What is this string format (ASCII code maybe?) and why is it used instead of a byte[] type of representation?
Is there a way to get the "Payload" byte[] to be printed in a more readable way?

What is this string format (ASCII code maybe?) and why is it used instead of a byte[] type of representation?
See json.Net documentation for primitive types:
Byte[] String (base 64 encoded)
So the format is base64. This is probably used since it is a reasonably efficient encoding of binary data, encoding 6 bits per character. Encoding values as an array would use much more space.
It is somewhat common to encode images or similar chunks of data as byte arrays. Since these can be large it is useful to keep the size down as much as possible.
Is there a way to get the "Payload" byte[] to be printed in a more readable way?
There are various base64 converters online that can convert it to hex, oct, string, or whatever format you prefer to view your binary data in. But for many applications it is not very useful since the binary data often represents something that is already serialized in some way.

Related

Fixing a mis-encoded string after the fact

Main problem and question:
Given a garbled string for which the actual text is known, is it possible to consistently repair the garbled string?
According to Nyerguds comment on this answer:
If the string is an incorrect decoding done with a simply 8-bit
Encoding and you have the Encoding used to decode it, you can
usually get the bytes back without any corruption, though.
(emphases mine)
Which suggests that there are cases when it is not possible to derive the original bytes back. This leads me to the following question: are there cases when (mis)encoding an array of bytes is a lossy and irreversible operation?
Background:
I am calling an external C++ library that calls a web API somewhere. Sometimes this library gives me slightly garbled text. In my C# project, I am trying to find a way to consistently reverse the miscoding, but I only seem to be able to do so part of the time.
What I've tried:
It seems clear that the C++ library is wrongly encoding the original bytes, which it later passes to me as a string. My approach has been to guess at the encoding that the C++ library used to interpret the original source bytes. Then, I iterate through all possible encodings, reinterpreting the hopefully "original" bytes with another encoding.
class TestCase
{
public string Original { get; set; }
public string Actual { get; set; }
public List<string> Matches { get;} = new List<string>();
}
void Main()
{
var testCases = new List<TestCase>()
{
new TestCase {Original = "窶弑-shaped", Actual = "“U-shaped"},
new TestCase {Original = "窶廡窶・Type", Actual = "“F” Type"},
new TestCase {Original = "Ko窶冩lau", Actual = "Ko’olau"},
new TestCase {Original = "窶彗s is", Actual = "“as is"},
new TestCase {Original = "窶從ew", Actual = "“new"},
new TestCase {Original = "faテァade", Actual = "façade"}
};
var encodings = Encoding.GetEncodings().Select(x => x.GetEncoding()).ToList();
foreach (var testCase in testCases)
{
foreach (var from in encodings)
{
foreach (var to in encodings)
{
// Guess the original bytes of the string
var guessedSourceBytes = from.GetBytes(testCase.Original);
// Guess what the bytes should have been interpreted as
var guessedActualString = to.GetString(guessedSourceBytes);
if (guessedActualString == testCase.Actual)
{
testCase.Matches.Add($"Reversed using \"{from.CodePage} {from.EncodingName}\", reinterpreted as: \"{to.CodePage} {to.EncodingName}\"");
}
}
}
}
}
As we can see above, out of the six test cases, all but one (窶廡窶・) was successful. In the successful cases, Shift-JIS (codepage 932) seemed to result in the correct "original" byte sequence for UTF8.
Getting the Shift-JIS bytes for 窶廡窶・ yields: E2 80 9C 46 E2 80 81 45.
E2 80 9C coincides with the UTF8 bytes for left double quotation mark, which is correct. However, E2 80 81 is em quad in UTF8, not the right double quotation mark I am expecting. Reinterpreting the whole byte sequence in UTF8 results in “F EType
No matter which encoding I use to derive the "original" bytes, and no matter what encoding I use to reinterpret said bytes, no combination seems to be able to successfully convert 窶廡窶・ to “F”.
Interestingly if I derive the UTF8 bytes for “F” Type, and purposely misinterpret those bytes as Shift-JIS, I get back 窶廡窶・Type
Encoding.GetEncoding(932).GetString(Encoding.UTF8.GetBytes("“F” Type"))
This leads me to believe that encoding can actually lead to data loss. I'm not well versed on encoding though, so could someone confirm whether my conclusion is correct, and if so, why this data loss occurs?
Yes, there are encodings that don't support all characters. One most common example is ASCIIEncoding that replaces all characters outside of standard ASCII range with ?.
...Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. … characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Parse byte as byte, not string

I receive some JSON from a Java third-party system that contains Avro schemas in JSON format. An example looks like this:
{"type":"record", "name":"AvroRecord", "namespace":"Parent.Namespace", "fields": [{"name":"AvroField", "type":"bytes", "default":"\u00FF"}]}
I parse this JSON to do some C# code generation. The result would look like this:
public partial class AvroRecord
{
[AvroField(Name = "AvroField", Type = "bytes", DefaultValueText = "ÿ")]
public byte[] AvroField { get; set; }
public AvroRecord() { this.AvroField = new byte[] { 255 }; }
}
Eventually, from the C# representation of the schema, I need to infer back the original schema. Once I get that inferred schema, it will be sent over to the original system for comparison. That is why I want to keep the original string value for the default value, since I don't know if:
{"type":"record", "name":"AvroRecord", "namespace":"Parent.Namespace", "fields": [{"name":"AvroField", "type":"bytes", "default":"\u00FF"}]}
and
{"type":"record", "name":"AvroRecord", "namespace":"Parent.Namespace", "fields": [{"name":"AvroField", "type":"bytes", "default":"ÿ"}]}
will result in an exact match or it will have a problem.
I use JSON.NET to convert from the raw schema as a string to something more useful that I can work with:
JToken token = JToken.Parse(schema);
Is there a way in JSON.NET or any other JSON parsing library to control the parsing and copy a value without being parsed? Basically, a way to avoid "\u00FF" becoming "ÿ"

Easiest way to use JSON.Net for both BSON and JSON?

I have some pieces of data that are byte arrays byte[], and I need to render them as base64 in JSON, but as raw byte arrays in BSON.
How can I easily do this in JSON.Net?
So, far I have something like so:
class Data
{
public byte[] Bytes{get;set;}
}
Converting to BSON is fine, but when converting to JSON, it is of course not base64 encoded and treated as a string
Hmm, using the following code with Json.Net 6.0.1, it appears to work just as you want with no special treatment: byte arrays are converted to base-64 strings and vice versa. Are you serializing your objects in a different way, or using an old version? If not, can you provide some code that demonstrates the problem?
string s = "Foo Bar Baz Quux";
Data data = new Data
{
Bytes = Encoding.UTF8.GetBytes(s)
};
string json = JsonConvert.SerializeObject(data);
Console.WriteLine(json);
data = JsonConvert.DeserializeObject<Data>(json);
Console.WriteLine(Encoding.UTF8.GetString(data.Bytes));
Output:
{"Bytes":"Rm9vIEJhciBCYXogUXV1eA=="}
Foo Bar Baz Quux

How to convert binary string to bytes[] array?

Since Mozilla's btoa and atob aren't compatible with IE, Im using Nick Galbreath's solution that works across the board.
In my JS, I have this snippet:
reader.onload = function (e)
{
var base64str = e.target.result.split(';')[1].split(',')[1];
var binaryData = base64.decode(base64str);
// binaryData looks like: 3!1AQa"q2¡±B#$RÁb34rÑC%Sðáñcs5¢²&DTdE£t
// 6ÒUâeò³ÃÓuãóF'¤´ÄÔäô¥µÅÕåõVfv¦¶ÆÖæö7GWgw§·Ç×ç÷5!1AQaq"2¡±B#ÁRÑð
// 3$bárCScs4ñ%¢²&5ÂÒDT£dEU6teâò³ÃÓuãóF¤´ÄÔäô¥µÅÕåõVfv¦¶ÆÖæö'7GWgw
// §·ÇÿÚ?õTI%)$IJI$RIrÿ[múÙxÝ^«ÝKØrþk²ïÑûíGóß÷¿ÑþÄY«ÍÓ±×úN //...
// Is this even binary data?
Ajax.SendToHandler(binaryData);
}
How do I convert binaryData, which is sent to my ashx derived IHttpHandler as a string, into a bytes[] array?
Ask me to clarify where needed!
Your data string seems to contain only extended ASCII characters (probably either Windows-1252 characters or ISO 8859-1 characters). You should try using a System.Text.Encoding to convert it to bytes.

Bit Array to String and back to Bit Array

Possible Duplicate Converting byte array to string and back again in C#
I am using Huffman Coding for compression and decompression of some text from here
The code in there builds a huffman tree to use it for encoding and decoding. Everything works fine when I use the code directly.
For my situation, i need to get the compressed content, store it and decompress it when ever need.
The output from the encoder and the input to the decoder are BitArray.
When I tried convert this BitArray to String and back to BitArray and decode it using the following code, I get a weird answer.
Tree huffmanTree = new Tree();
huffmanTree.Build(input);
string input = Console.ReadLine();
BitArray encoded = huffmanTree.Encode(input);
// Print the bits
Console.Write("Encoded Bits: ");
foreach (bool bit in encoded)
{
Console.Write((bit ? 1 : 0) + "");
}
Console.WriteLine();
// Convert the bit array to bytes
Byte[] e = new Byte[(encoded.Length / 8 + (encoded.Length % 8 == 0 ? 0 : 1))];
encoded.CopyTo(e, 0);
// Convert the bytes to string
string output = Encoding.UTF8.GetString(e);
// Convert string back to bytes
e = new Byte[d.Length];
e = Encoding.UTF8.GetBytes(d);
// Convert bytes back to bit array
BitArray todecode = new BitArray(e);
string decoded = huffmanTree.Decode(todecode);
Console.WriteLine("Decoded: " + decoded);
Console.ReadLine();
The Output of Original code from the tutorial is:
The Output of My Code is:
Where am I wrong friends? Help me, Thanks in advance.
You cannot stuff arbitrary bytes into a string. That concept is just undefined. Conversions happen using Encoding.
string output = Encoding.UTF8.GetString(e);
e is just binary garbage at this point, it is not a UTF8 string. So calling UTF8 methods on it does not make sense.
Solution: Don't convert and back-convert to/from string. This does not round-trip. Why are you doing that in the first place? If you need a string use a round-trippable format like base-64 or base-85.
I'm pretty sure Encoding doesn't roundtrip - that is you can't encode an arbitrary sequence of bytes to a string, and then use the same Encoding to get bytes back and always expect them to be the same.
If you want to be able to roundtrip from your raw bytes to string and back to the same raw bytes, you'd need to use base64 encoding e.g.
http://blogs.microsoft.co.il/blogs/mneiter/archive/2009/03/22/how-to-encoding-and-decoding-base64-strings-in-c.aspx

Categories

Resources