I'm using binary serialization (BinaryFormatter) as a temporary mechanism to store state information in a file for a relatively complex (game) object structure; the files are coming out much larger than I expect, and my data structure includes recursive references - so I'm wondering whether the BinaryFormatter is actually storing multiple copies of the same objects, or whether my basic "number of objects and values I should have" arithmentic is way off-base, or where else the excessive size is coming from.
Searching on stack overflow I was able to find the specification for Microsoft's binary remoting format:
http://msdn.microsoft.com/en-us/library/cc236844(PROT.10).aspx
What I can't find is any existing viewer that enables you to "peek" into the contents of a binaryformatter output file - get object counts and total bytes for different object types in the file, etc;
I feel like this must be my "google-fu" failing me (what little I have) - can anyone help? This must have been done before, right??
UPDATE: I could not find it and got no answers so I put something relatively quick together (link to downloadable project below); I can confirm the BinaryFormatter does not store multiple copies of the same object but it does print quite a lot of metadata to the stream. If you need efficient storage, build your own custom serialization methods.
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
Vasiliy is right in that I will ultimately need to implement my own formatter/serialization process to better handle versioning and to output a much more compact stream (before compression).
I did want to understand what was happening in the stream, however, so I wrote up a (relatively) quick class that does what I wanted:
parses its way through the stream, building a collections of object names, counts and sizes
once done, outputs a quick summary of what it found - classes, counts and total sizes in the stream
It's not useful enough for me to put it somewhere visible like codeproject, so I just dumped the project in a zip file on my website: http://www.architectshack.com/BinarySerializationAnalysis.ashx
In my specific case it turns out that the problem was twofold:
The BinaryFormatter is VERY verbose (this is known, I just didn't realize the extent)
I did have issues in my class, it turned out I was storing objects that I didn't want
Hope this helps someone at some point!
Update: Ian Wright contacted me with a problem with the original code, where it crashed when the source object(s) contained "decimal" values. This is now corrected, and I've used the occasion to move the code to GitHub and give it a (permissive, BSD) license.
Our application operates massive data. It can take up to 1-2 GB of RAM, like your game. We met same "storing multiple copies of the same objects" problem. Also binary serialization stores too much meta data. When it was first implemented the serialized file took about 1-2 GB. Nowadays I managed to decrease the value - 50-100 MB. What did we do.
The short answer - do not use the .Net binary serialization, create your own binary serialization mechanism. We have own BinaryFormatter class, and ISerializable interface (with two methods Serialize, Deserialize).
Same object should not be serialized more than once. We save it's unique ID and restore the object from cache.
I can share some code if you ask.
EDIT: It seems you are correct. See the following code - it proves I was wrong.
[Serializable]
public class Item
{
public string Data { get; set; }
}
[Serializable]
public class ItemHolder
{
public Item Item1 { get; set; }
public Item Item2 { get; set; }
}
public class Program
{
public static void Main(params string[] args)
{
{
Item item0 = new Item() { Data = "0000000000" };
ItemHolder holderOneInstance = new ItemHolder() { Item1 = item0, Item2 = item0 };
var fs0 = File.Create("temp-file0.txt");
var formatter0 = new BinaryFormatter();
formatter0.Serialize(fs0, holderOneInstance);
fs0.Close();
Console.WriteLine("One instance: " + new FileInfo(fs0.Name).Length); // 335
//File.Delete(fs0.Name);
}
{
Item item1 = new Item() { Data = "1111111111" };
Item item2 = new Item() { Data = "2222222222" };
ItemHolder holderTwoInstances = new ItemHolder() { Item1 = item1, Item2 = item2 };
var fs1 = File.Create("temp-file1.txt");
var formatter1 = new BinaryFormatter();
formatter1.Serialize(fs1, holderTwoInstances);
fs1.Close();
Console.WriteLine("Two instances: " + new FileInfo(fs1.Name).Length); // 360
//File.Delete(fs1.Name);
}
}
}
Looks like BinaryFormatter uses object.Equals to find same objects.
Have you ever looked inside the generated files? If you open "temp-file0.txt" and "temp-file1.txt" from the code example you'll see it has lots of meta data. That's why I recommended you to create your own serialization mechanism.
Sorry for being cofusing.
Maybe you could run your program in debug mode and try adding a control point.
If that is impossible due to the size of the game or other dependencies you can always coade a simple/small app that includes the deserialization code and peek from the debug mode there.
Related
I'm communicating to a device that returns uuencoded data:
ASCII: EZQAEgETAhMQIBwIAUkAAABj
HEX: 45-5A-51-41-45-67-45-54-41-68-4D-51-49-42-77-49-41-55-6B-41-41-41-42-6A
The documentation for this device states the above is uuencoded but I can't figure out how to decode it. The final result won't be a human readable string but the first byte reveals the number of bytes for the following product data. (Which would be 23 or 24?)
I've tried using Crypt2 to decode it; it doesn't seem to match 644, 666, 744 modes.
I've tried to hand write it out following the Wiki: https://en.wikipedia.org/wiki/Uuencoding#Formatting_mechanism
Doesn't make sense! How do I decode this uuencoded data?
I agree with #canton7 that it looks like it's base64 encoded. You can decode it like this
byte[] decoded = Convert.FromBase64String("EZQAEgETAhMQIBwIAUkAAABj");
and if you want, you can print the hex values like this
Console.WriteLine(BitConverter.ToString(decoded));
which prints
11-94-00-12-01-13-02-13-10-20-1C-08-01-49-00-00-00-63
As #HansKilian says in the comments, this is not uuencoded.
If you base64-decode it you get (in hex):
11 94 00 12 01 13 02 13 10 20 1c 08 01 49 00 00 00 63
The first number, 17 in decimal, is the same as the number of bytes following it, which matches:
The final result won't be a human readable string but the first byte reveals the number of bytes for the following product data.
(#HansKilian made the original call that it was base64-encoded. This answer provides confirmation of that by looking at the first decoded byte, but please accept his answer)
Good day. I am trying to find a way to read the surrogate key state file to get what is its current value and how to change it. The problem being that the database is being constantly refreshed and I am needing a mechanism where I can get the max value from the table and then set the surrogate key state file.
From what I have been reading its not like the dataset (.ds) files where you can use the DataStage Designer tool to read it. I tried making a small C# application where it would read it as a binary file. Various articles explain that it is an unsigned 64 bit integer. Still when I try to read it, it gives a random set of numbers. It starts with one, then numbers ending in 999, and then it repeats. I tried reading it with the bit converter class but no luck either.
So far the only solution I have seen is to create a parallel or sequential job that gets the max number from the database and then creates the surrogate key with it as explained in http://it.toolbox.com/blogs/infosphere/datastage-8-tutorial-surrogate-key-state-files-17403.
I am not the first one to try changing it through code and was curious if there was some way to do it.
using DataStage 8.7
Tried with C# BinaryReader.ReadUInt64, BinaryReader.ReadInt64 and BitConverter.ConvertToUInt64
Update 2016-10-19:
The partial answer is that it can be read as a binary file. It is divided in 4 sets of 8 bytes. Something like this (you can see it with a hex editor.
01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 05
00 00 00 00 00 00 00 08
00 00 00 00 00 00 00 08
The first set I think is the incremental number (+1, +5, etc)
Second Set is the initial value
Third set is the next number to assign
Fourth set I think is the end of the batch to assign. If you are doing 10 by 10 batches then third is 10 and fourth is 20 or that is how I think it works.
So for reading it by code you need to read it with a binary reader and get sets of 8 bytes to convert to UINT64.
The question still stands because I am not sure what they mean.
Why do you think you need the Surrogate Key Generator stage?
An alternative and much simpler solution would be Transformaer stage to do the numbering and a Sequential File (or parameter) to initialize it. The new number - after processing - could then be written back to the database sequence. So you just need to handle flat files with no programming.
To generate unique numbers in a transformer (parallel) you have to consider the partitions - this formula would do
(#NUMPARTITIONS * ( #INROWNUM - 1)) + #PARTITIONNUM + Max_Field1
The reason for all this is that I am looking for a bug in the Surrogate Key State File.
So when you go to the complex transformer (sorry there are no pictures), you will go to the properties and see it has a Surrogate Key Tab. You have three settings. One is for the file, one is for the initial value and another is for the block size.
The file is where current surrogate key is stored. I will explain soon how it is formatted. The Initial value is from what number you wish to start in, and the block size is to reserve a group of numbers for your transformer.
The file is formatted in 16 byte increments, first one is the current number. The number to assign is this number + 1, and the second is the end of the block size. It will only be 16 bytes when you do not define the initial value or the block size. if you define these it will be 32 bytes. Where the last two values are the current number and the block end.
So when you have two transformers or more using a the same file. It will assign a block that has numbers available before getting a new block and increasing the file size by 16 more bytes if its needed.
So what was the error, when you do not define the block size but define an initial value, the system block size will be around 1000 or so. Lets say you do a small example where all you have is a Row Generator connected to a transformer that ends in a sequential file. All you need is one row. Execute it many times, and lets say your initial value is 200. It will be 200,201,203,204,(1),205. For some reason it bugs in DataStage 8.7 and when you do not define the block size it returns back to one.
I hope this research on the subject will help someone because I looked and looked and there was not much on how to best use Surrogate Keys.
If you wish for the error to happen faster just delete the file and create a new one with C#, assign 4 UINT64 values saved as BYTES. first two values 1,1,200,300. Eventually it will do what I described.
I am getting some serialized .NET class string data from a source and I just need to turn it into something readable in PHP. Doesn't necessarily have to be turned into an "object" or JSON but I need to read it somehow. I think the .NET string is just a class with some set properties but it is binary and not portable obviously. I'm not looking to convert .NET code to PHP code. Here is an example of the data:
U:?�S�#��-��v�Y��?������An�#AMAUI������
I realize this is actually binary and not printable text. I'm just using this as an example of what I see when catting the file.
Short answer:
I would really suggest NOT implementing the interpretation of the binary representation yourself. I would use another format instead (JSON, XML, etc.).
Long answer:
However, if this is not possible there is of course a way...
The actual question is: What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
I have a task to take millions of floats and store them in the database in batches of 5,000, as binary. This is forcing me to learn interesting things about serialization performance.
One of the things that surprises me is the size of the serialized data, which is a factor of ten above what I expected. This test shows me that a four-byte float is serialized to 55 bytes and an eight-byte double to 59 bytes.
What is happening here? I expected it to simply split the float value into its four bytes. What are the other 51 bytes?
private void SerializeFloat()
{
Random rnd = new Random();
IFormatter iFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream(10000000))
{
memoryStream.Capacity = 0;
iFormatter.Serialize(memoryStream, (Single)rnd.NextDouble());
iFormatter.Serialize(memoryStream, rnd.NextDouble());
}
}
Serialization is more than simply blitting bits and bytes to a stream. Serialization is structured output. This structure accounts for your actual differences. The Framework encodes additional information which lets it know the type and number of objects in the serialized data, among many other possibilities. It is an implementation detail best left alone.
If you need unstructured output, you could use BinaryWriter instead.
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
Binary serialization is type safe. It makes sure that when you deserialize the data, you'll get the exact same object back.
To make that work, BinaryFormatter adds additional data about the types of the objects that you serialize. You are seeing that extra overhead. You can see it by serializing to a FileStream and looking at the generated file with a hex viewer. You'll see strings back, like "System.Single", the type name, and "m_value", the name of the field where the value is stored. A good way to cut down on the overhead is to, say, serialize an array instead.
BinaryWriter is the exact opposite, very compact but not type-safe. Plenty of alternatives are available in between.
.NET serialization throws in a bunch of information other than the actual 8 bytes of your double (type information, etc.). You could use a file Stream and then write the bytes gotten by byte[] BitConverter.GetBytes(double) or the BinaryWriter class.
There are many alternatives to .NET serialization:
Text formats
XML
JSON
Binary formats
Google Protocol Buffers
MessagePack
These all have their pros and cons. I especially like MessagePack and encourage you to take a look at it. For example, it will use 9 bytes to store a self-describing double.
How to convert a simple string to a null-terminated one?
Example:
Example string: "Test message"
Here are the bytes:
54 65 73 74 20 6D 65 73 73 61 67 65
I need string with bytes like follows:
54 00 65 00 73 00 74 00 20 00 6D 00 65 00 73 00 73 00 61 00 67 00 65 00 00
I could use loops, but will be too ugly code. How can I make this conversion by native methods?
It looks like you want a null-terminated Unicode string. If the string is stored in a variable str, this should work:
var bytes = System.Text.Encoding.Unicode.GetBytes(str + "\0");
(See it run.)
Note that the resulting array will have three zero bytes at the end. This is because Unicode represents characters using two bytes. The first zero is half of the last character in the original string, and the next two are how Unicode encodes the null character '\0'. (In other words, there is one extra null character using my code than what you originally specified, but this is probably what you actually want.)
A little background on c# strings is a good place to start.
The internal structure of a C# string is different from a C string.
a) It is unicode, as is a 'char'
b) It is not null terminated
c) It includes many utility functions that in C/C++ you would require for.
How does it get away with no null termination? Simple! Internally a C# String manages a char array. C# arrays are structures, not pointers (as in C/C++). As such, they are aware of their own length. The Null termination in C/C++ is required so that string utility functions like strcmp() are able to detect the end of the string in memory.
The null character does exist in c#.
string content = "This is a message!" + '\0';
This will give you a string that ends with a null terminator. Importantly, the null character is invisible and will not show up in any output. It will show in the debug windows. It will also be present when you convert the string to a byte array (for saving to disk and other IO operations) but if you do Console.WriteLine(content) it will not be visible.
You should understand why you want that null terminator, and why you want to avoid using a loop construct to get what you are after. A null terminated string is fairly useless in c# unless you end up converting to a byte array. Generally you will only do that if you want to send your string to a native method, over a network or to a usb device.
It is also important to be aware of how you are getting your bytes. In C/C++, a char is stored as 1 bytes (8bit) and the encoding is ANSI. In C# the encoding is unicode, it is two bytes (16bit). Jon Skeet's answer shows you how to get the bytes in unicode.
Tongue in cheek but potentially useful answer.
If you are after output on your screen in hex as you have shown there you want to follow two steps:
Convert string (with null character '\0' on the end) to byte array
Convert bytes strings representations encoded in hex
Interleave with spaces
Print to screen
Try this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace stringlulz
{
class Program
{
static void Main(string[] args)
{
string original = "Test message";
byte[] bytes = System.Text.Encoding.Unicode.GetBytes(original + '\0');
var output = bytes.Aggregate(new StringBuilder(), (s, p) => s.Append(p.ToString("x2") + ' '), s => { s.Length--; return s; });
Console.WriteLine(output.ToString().ToUpper());
Console.ReadLine();
}
}
}
The output is:
54 00 65 00 73 00 74 00 20 00 6D 00 65 00 73 00 73 00 61 00 67 00 65 00 00 00
Here's a tested C# sample of an xml command null terminated and works great.
strCmd = #"<?xml version=""1.0"" encoding=""utf-8""?><Command name=""SerialNumber"" />";
sendB = System.Text.Encoding.UTF8.GetBytes(strCmd+"\0");
sportin.Send = sendB;