I'm analyzing a memory dump of a .NET process using WinDbg and I've noticed it reports the size of all System.Int32 variables on the heap as 24 bytes. Here's an example of a relevant DumpObj call on one of the variables:
0:000> !DumpObj /d 00000061c81c0e80
Name: System.Int32
MethodTable: 00007fff433f37c8
EEClass: 00007fff42e30130
Size: 24(0x18) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007fff433f37c8 4000456 8 System.Int32 1 instance 141 m_value
As far as I know, the size System.Int32 is supposed to be 4 bytes. What is the source of this discrepancy?
There's an overhead to any object on the heap. On 32-bit MS.NET runtime, this is 8 bytes, and on 64-bit, 16 bytes (disclaimer: this isn't strictly contractual and may change in the future or in a conforming implementation of a .NET runtime).
Since your int is boxed, it will have the 16-byte overhead. So you might expect 20 bytes to be used total. Well, except that on 64-bit systems, objects (and structures) are padded to 8-byte boundaries, so you actually get 24-bytes per int.
In contrast, when you use a struct with 16 integers, you'll only use 16 + 4 * 16 = 80 bytes of memory, for a total of 5 bytes per integer.
And again, most of this is an implementation detail, so not something you can rely on; it's perfectly possible for a valid .NET runtime to store a single int in 1 MiB of memory if it desires to do so, and it could also store it in some compact representation or with interning, as long as it conforms to all the contractual behaviour of the type. It's also quite simplified even compared to actual MS runtime implementations - for example, if your object gets large enough, it will need more overhead.
it is not size of int32 do dd or dq address and see your int32 stuck in the second dword or qword
there is an implicit overhead of 12 bytes or 24 bytes per object for x86 / x64 respectively
0:004> .shell -ci "!DumpObj /d 01c72360" grep -i size
Size: 12(0xc) bytes
.shell: Process exited
0:004> dd 01c72360 l4
01c72360 5890c770 000001b5 80000000 5890afb0
0:004> .shell -ci "!DumpObj /d 01c72360" grep -i method
MethodTable: 5890c770
.shell: Process exited
0:004> .shell -ci "!DumpObj /d 01c72360" grep -i value
MT Field Offset Type VT Attr Value Name
5890c770 400044f 4 System.Int32 1 instance 437 m_value
.shell: Process exited
0:004> ? 1b5
Evaluate expression: 437 = 000001b5
leaving int32 apart lets dissect a widestring "stream" in x86
actualsizereqdfor(L"stream\0") = 7 * sizeof(wchar_t) == 7 * 2 == 0n14;
sizeof(method table ) == 0n04;
sizeof(sizeof(L"stream)) == 0n04;
sizeof(padding ?? terminator ?? whatever ?? ) == 0n04;
so total size == 0n26
result of dumpobj
0:004> !DumpObj /d 01c73ad0
Name: System.String
MethodTable: 5890afb0
Size: 26(0x1a) bytes
String: stream
Fields:
MT Field Offset Type VT Attr Value Name
5890c770 40000aa 4 System.Int32 1 instance 6 m_stringLength
5890b9a8 40000ab 8 System.Char 1 instance 73 m_firstChar
5890afb0 40000ac c System.String 0 shared static Empty
raw display
0:004> db 01c73ad0 l1a
01c73ad0 b0 af 90 58 06 00 00 00-73 00 74 00 72 00 65 00 ...X....s.t.r.e.
01c73ae0 61 00 6d 00 00 00 00 00-00 00 a.m.......
Related
If I execute the below code and do not press any key on the console
class Program
{
static void Main(string[] args)
{
List<Test> list = new List<Test>(1) {new Test()};
Console.ReadKey();
GC.KeepAlive(list);
var x = list[0];
Console.WriteLine((x.ToString()));
}
}
class Test
{
public override string ToString()
{
return "Empty object";
}
}
Then upon analyzing the array in windbg I see the list do not contain the test object I added.
The element in the 0'th position is something else which i am not sure of
However if i add a string property to my Test class like so
class Test
{
public string Name = "Rohit";
public override string ToString()
{
return "Empty object";
}
}
Then this time windbg reveals the object
Can someone please help explain what is going on? I test the above using Visual Studio 2015 (.net 4 compatability mode) on x64 windows 7
Aside: Even though i requested for the list size to be one (for my test), i see it is going with the default size of 128. So basically the initial capacity is based on some heuristics etc?
From List to the object
From the comments I see some confusion whether a List<T> stores its items in an Object[] or T[].
I compiled the program in VS 2013.4 for .NET 4.0 and I'm debugging in WinDbg 6.2.9600:
0:007> !dumpheap -stat
Statistics:
MT Count TotalSize Class Name
000007feed598130 1 24 System.Security.HostSecurityManager
000007feed597158 1 24 System.Collections.Generic.ObjectEqualityComparer`1[[System.Type, mscorlib]]
000007fe91bd40c0 1 24 ConsoleWriteLine.Test
000007feed592090 1 28 System.Char[]
000007feed5980b8 1 32 System.Security.Policy.Evidence+EvidenceLockHolder
000007feed5975e8 1 32 System.Security.Policy.AssemblyEvidenceFactory
000007feed5974a0 1 32 Microsoft.Win32.SafeHandles.SafePEFileHandle
000007feed594810 1 32 System.Text.DecoderReplacementFallback
000007feed594780 1 32 System.Text.EncoderReplacementFallback
000007feed536fd8 1 40 Microsoft.Win32.Win32Native+InputRecord
000007fe91bd4150 1 40 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000007feed591480 1 48 System.SharedStatics
000007feed5945a8 1 56 System.Text.UnicodeEncoding
000007feed5943e0 1 56 System.Reflection.RuntimeAssembly
000007feed598038 1 64 System.Threading.ReaderWriterLock
000007feed597548 1 64 System.Security.Policy.PEFileEvidenceFactory
000007feed592610 1 64 System.Security.PermissionSet
000007feed593af0 1 72 System.RuntimeFieldInfoStub
000007feed592478 1 72 System.Security.Policy.Evidence
000007feed5913e8 3 72 System.Object
000007feeced7d10 1 80 System.Collections.Generic.Dictionary`2[[System.Type, mscorlib],[System.Security.Policy.EvidenceTypeDescriptor, mscorlib]]
000007feed591e00 1 128 System.AppDomainSetup
000007feed591310 1 160 System.ExecutionEngineException
000007feed591298 1 160 System.StackOverflowException
000007feed591220 1 160 System.OutOfMemoryException
000007feed591038 1 160 System.Exception
000007feed591540 1 216 System.AppDomain
00000000002e9e60 8 216 Free
000007feed591388 2 320 System.Threading.ThreadAbortException
000007feed593920 4 492 System.Int32[]
000007feed597fd8 3 720 System.Collections.Generic.Dictionary`2+Entry[[System.Type, mscorlib],[System.Security.Policy.EvidenceTypeDescriptor, mscorlib]][]
000007feed592eb8 21 1176 System.RuntimeType
000007feed590e08 37 2786 System.String
000007feed524918 8 34808 System.Object[]
Total 112 objects
Compared to your output, I have no String[], no Type[]and no Test[]. Instead I have 8 Object[]. Similar to you, I have only 1 List<T>. Let's find out which array it uses. The steps should be the same on your machine, but you might get Test[] as the result.
Step 1: dump all List<T>:
0:007> !dumpheap -mt 000007fe91bd4150
Address MT Size
00000000021a2de8 000007fe91bd4150 40
Statistics:
MT Count TotalSize Class Name
000007fe91bd4150 1 40 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
Total 1 objects
Step 2: dump the only List<T> which is there:
0:007> !do 00000000021a2de8
Name: System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
MethodTable: 000007fe91bd4150
EEClass: 000007feecf7ea08
Size: 40(0x28) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
000007feed524918 4000cd1 8 System.Object[] 0 instance 00000000021a2e10 _items
000007feed593980 4000cd2 18 System.Int32 1 instance 1 _size
000007feed593980 4000cd3 1c System.Int32 1 instance 1 _version
000007feed5913e8 4000cd4 10 System.Object 0 instance 0000000000000000 _syncRoot
000007feed524918 4000cd5 8 System.Object[] 0 static <no information>
See the property _items of type Object[] here.
Step 3: dump the backing Object[]:
0:007> !da 00000000021a2e10
Name: ConsoleWriteLine.Test[]
MethodTable: 000007feed524918
EEClass: 000007feecf77f58
Size: 40(0x28) bytes
Array: Rank 1, Number of elements 1, Type CLASS
Element Methodtable: 000007fe91bd40c0
[0] 00000000021a2e38
There is an interesting finding: when dumping the array, it figured out it is actually a Test[]. It might be that your version of SOS is just smarter than mine and displays types correctly in !dumpheap -stat. My version is 4.0.30319.34209.
Step 4: dump the only object of the array
0:007> !do 00000000021a2e38
Name: ConsoleWriteLine.Test
MethodTable: 000007fe91bd40c0
EEClass: 000007fe91ce23d8
Size: 24(0x18) bytes
File: E:\...\bin\Debug\ConsoleWriteLine.exe
Fields:
None
This is a the test object with no fields, just as expected.
What else could be helpful?
You have picked the second Object[] that was listed by !dumpheap -mt 7fee816f150. I don't know why you have chosen that one, but probably it's because you detected the difference in that array. You have then dumped the object in that array, which was an empty string. The second time, this list contained one more string. I could reproduce this.
To find out more about this, use !gcroot to see where an object is used. Since .NET objects are like pointers, two arrays can point to the same object (string), therefore use -all as parameter.
0:007> !gcroot -all 00000000021a1420
Thread 109c:
000000000013e8c0 000007feedbca151 System.Console.ReadKey(Boolean)
rdi: (interior)
-> 00000000121a1038 System.Object[]
-> 00000000021a1420 System.String
HandleTable:
00000000008517e8 (pinned handle)
-> 00000000121a32e8 System.Object[]
-> 00000000021a1420 System.String
00000000008517f8 (pinned handle)
-> 00000000121a1038 System.Object[]
-> 00000000021a1420 System.String
Found 3 roots.
Another helpful command might be !dso to display objects referenced by the stack (e.g. local variables):
0:007> ~0s
0:000> !dso
OS Thread Id: 0x109c (0)
RSP/REG Object Name
000000000013E950 00000000021a2f10 System.Object
000000000013E9F0 00000000021a2e38 ConsoleWriteLine.Test
000000000013EA00 00000000021a2de8 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000000000013EA10 00000000021a2de8 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000000000013EA28 00000000021a2de8 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000000000013EA30 00000000021a2de8 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000000000013EA38 00000000021a2de8 System.Collections.Generic.List`1[[ConsoleWriteLine.Test, ConsoleWriteLine]]
000000000013EA40 00000000021a2e38 ConsoleWriteLine.Test
000000000013EA48 00000000021a2e38 ConsoleWriteLine.Test
000000000013EA80 00000000021a2dc8 System.Object[] (System.String[])
000000000013EBB8 00000000021a2dc8 System.Object[] (System.String[])
000000000013ECD8 00000000021a2dc8 System.Object[] (System.String[])
000000000013EEA8 00000000021a2dc8 System.Object[] (System.String[])
000000000013F478 00000000021a1440 System.SharedStatics
The alternative way to look at this would be from the other end: decompile the code of List<T> (or look at it on GitHub if you wanted the .NET 5 implementation).
Using JetBrains dotPeek against 64-bit mscorlib I see this (with omissions for readability):
public class List<T> : IList<T>, ICollection<T>, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T>, IEnumerable<T>, IEnumerable
{
private T[] _items;
From intuition, I'd suggest that the difference between your two code samples is because that the compiler is smart: your first Test class has no state, and so it's essentially equivalent to a static class - there can never be any difference between multiple instances of the class. I'd imagine the compiler will optimise that away entirely for this reason. Once you add the field, you have state to store in the object and so the optimisation can't be applied.
As Henk Holterman pointed out in comments, the fact that your string literal might appear in an object array is incidental: a string literal in code will be interned by the compiler and will appear in data structures other than the one you've declared as an optimisation. That an object[] exists in memory with your string in it doesn't mean that it's being used as backing store by List<T>.
Regarding the default size of your list: you specify size in items, not bytes. Your single instance of array ConsoleApplication1.Test[] has a size of 32 bits, which would be a single 32-bit pointer: the one item you specified.
I am getting some serialized .NET class string data from a source and I just need to turn it into something readable in PHP. Doesn't necessarily have to be turned into an "object" or JSON but I need to read it somehow. I think the .NET string is just a class with some set properties but it is binary and not portable obviously. I'm not looking to convert .NET code to PHP code. Here is an example of the data:
U:?�S�#��-��v�Y��?������An�#AMAUI������
I realize this is actually binary and not printable text. I'm just using this as an example of what I see when catting the file.
Short answer:
I would really suggest NOT implementing the interpretation of the binary representation yourself. I would use another format instead (JSON, XML, etc.).
Long answer:
However, if this is not possible there is of course a way...
The actual question is: What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
I have a task to take millions of floats and store them in the database in batches of 5,000, as binary. This is forcing me to learn interesting things about serialization performance.
One of the things that surprises me is the size of the serialized data, which is a factor of ten above what I expected. This test shows me that a four-byte float is serialized to 55 bytes and an eight-byte double to 59 bytes.
What is happening here? I expected it to simply split the float value into its four bytes. What are the other 51 bytes?
private void SerializeFloat()
{
Random rnd = new Random();
IFormatter iFormatter = new BinaryFormatter();
using (MemoryStream memoryStream = new MemoryStream(10000000))
{
memoryStream.Capacity = 0;
iFormatter.Serialize(memoryStream, (Single)rnd.NextDouble());
iFormatter.Serialize(memoryStream, rnd.NextDouble());
}
}
Serialization is more than simply blitting bits and bytes to a stream. Serialization is structured output. This structure accounts for your actual differences. The Framework encodes additional information which lets it know the type and number of objects in the serialized data, among many other possibilities. It is an implementation detail best left alone.
If you need unstructured output, you could use BinaryWriter instead.
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
Binary serialization is type safe. It makes sure that when you deserialize the data, you'll get the exact same object back.
To make that work, BinaryFormatter adds additional data about the types of the objects that you serialize. You are seeing that extra overhead. You can see it by serializing to a FileStream and looking at the generated file with a hex viewer. You'll see strings back, like "System.Single", the type name, and "m_value", the name of the field where the value is stored. A good way to cut down on the overhead is to, say, serialize an array instead.
BinaryWriter is the exact opposite, very compact but not type-safe. Plenty of alternatives are available in between.
.NET serialization throws in a bunch of information other than the actual 8 bytes of your double (type information, etc.). You could use a file Stream and then write the bytes gotten by byte[] BitConverter.GetBytes(double) or the BinaryWriter class.
There are many alternatives to .NET serialization:
Text formats
XML
JSON
Binary formats
Google Protocol Buffers
MessagePack
These all have their pros and cons. I especially like MessagePack and encourage you to take a look at it. For example, it will use 9 bytes to store a self-describing double.
I'm using binary serialization (BinaryFormatter) as a temporary mechanism to store state information in a file for a relatively complex (game) object structure; the files are coming out much larger than I expect, and my data structure includes recursive references - so I'm wondering whether the BinaryFormatter is actually storing multiple copies of the same objects, or whether my basic "number of objects and values I should have" arithmentic is way off-base, or where else the excessive size is coming from.
Searching on stack overflow I was able to find the specification for Microsoft's binary remoting format:
http://msdn.microsoft.com/en-us/library/cc236844(PROT.10).aspx
What I can't find is any existing viewer that enables you to "peek" into the contents of a binaryformatter output file - get object counts and total bytes for different object types in the file, etc;
I feel like this must be my "google-fu" failing me (what little I have) - can anyone help? This must have been done before, right??
UPDATE: I could not find it and got no answers so I put something relatively quick together (link to downloadable project below); I can confirm the BinaryFormatter does not store multiple copies of the same object but it does print quite a lot of metadata to the stream. If you need efficient storage, build your own custom serialization methods.
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.
Class A looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A containing abc and 123 as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
RecordTypeEnum (1 byte)
RootId (4 bytes)
HeaderId (4 bytes)
MajorVersion (4 bytes)
MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.
01 00 00 00 represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF represents the HeaderId
01 00 00 00 represents the MajorVersion
00 00 00 00 represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
RecordTypeEnum (1 byte)
LibraryId (4 bytes)
LibraryName (variable number of bytes (which is a LengthPrefixedString))
As stated in 2.1.1.6 LengthPrefixedString...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:
0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.
02 00 00 00 represents the LibraryId which is 2 in our case.
Now the LengthPrefixedString follows:
42 represents the length information of the LengthPrefixedString which contains the LibraryName.
In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.
As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:
05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
RecordTypeEnum (1 byte)
ClassInfo (variable number of bytes)
MemberTypeInfo (variable number of bytes)
LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo the record consists of:
ObjectId (4 bytes)
Name (variable number of bytes (which is again a LengthPrefixedString))
MemberCount(4 bytes)
MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)
Back to the raw data, step by step:
01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.
02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.
MemberTypeInfo:
After the ClassInfo the MemberTypeInfo follows.
Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:
BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).
Again, back to the raw data of the complete MemberTypeInfo record:
01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.
00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:
02 00 00 00 represents the LibraryId which is 2.
The values:
As specified in 2.3 Class Records:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).
According to 2.5.7 BinaryObjectString it contains:
RecordTypeEnum (1 byte)
ObjectId (4 bytes)
Value (variable length, represented as a LengthPrefixedString)
So knowing that, we can clearly identify that
03 00 00 00 represents the ObjectId.
03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.
Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that
must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes record:
MessageEnd:
Finally the last byte 0B represents the MessageEnd record.
Vasiliy is right in that I will ultimately need to implement my own formatter/serialization process to better handle versioning and to output a much more compact stream (before compression).
I did want to understand what was happening in the stream, however, so I wrote up a (relatively) quick class that does what I wanted:
parses its way through the stream, building a collections of object names, counts and sizes
once done, outputs a quick summary of what it found - classes, counts and total sizes in the stream
It's not useful enough for me to put it somewhere visible like codeproject, so I just dumped the project in a zip file on my website: http://www.architectshack.com/BinarySerializationAnalysis.ashx
In my specific case it turns out that the problem was twofold:
The BinaryFormatter is VERY verbose (this is known, I just didn't realize the extent)
I did have issues in my class, it turned out I was storing objects that I didn't want
Hope this helps someone at some point!
Update: Ian Wright contacted me with a problem with the original code, where it crashed when the source object(s) contained "decimal" values. This is now corrected, and I've used the occasion to move the code to GitHub and give it a (permissive, BSD) license.
Our application operates massive data. It can take up to 1-2 GB of RAM, like your game. We met same "storing multiple copies of the same objects" problem. Also binary serialization stores too much meta data. When it was first implemented the serialized file took about 1-2 GB. Nowadays I managed to decrease the value - 50-100 MB. What did we do.
The short answer - do not use the .Net binary serialization, create your own binary serialization mechanism. We have own BinaryFormatter class, and ISerializable interface (with two methods Serialize, Deserialize).
Same object should not be serialized more than once. We save it's unique ID and restore the object from cache.
I can share some code if you ask.
EDIT: It seems you are correct. See the following code - it proves I was wrong.
[Serializable]
public class Item
{
public string Data { get; set; }
}
[Serializable]
public class ItemHolder
{
public Item Item1 { get; set; }
public Item Item2 { get; set; }
}
public class Program
{
public static void Main(params string[] args)
{
{
Item item0 = new Item() { Data = "0000000000" };
ItemHolder holderOneInstance = new ItemHolder() { Item1 = item0, Item2 = item0 };
var fs0 = File.Create("temp-file0.txt");
var formatter0 = new BinaryFormatter();
formatter0.Serialize(fs0, holderOneInstance);
fs0.Close();
Console.WriteLine("One instance: " + new FileInfo(fs0.Name).Length); // 335
//File.Delete(fs0.Name);
}
{
Item item1 = new Item() { Data = "1111111111" };
Item item2 = new Item() { Data = "2222222222" };
ItemHolder holderTwoInstances = new ItemHolder() { Item1 = item1, Item2 = item2 };
var fs1 = File.Create("temp-file1.txt");
var formatter1 = new BinaryFormatter();
formatter1.Serialize(fs1, holderTwoInstances);
fs1.Close();
Console.WriteLine("Two instances: " + new FileInfo(fs1.Name).Length); // 360
//File.Delete(fs1.Name);
}
}
}
Looks like BinaryFormatter uses object.Equals to find same objects.
Have you ever looked inside the generated files? If you open "temp-file0.txt" and "temp-file1.txt" from the code example you'll see it has lots of meta data. That's why I recommended you to create your own serialization mechanism.
Sorry for being cofusing.
Maybe you could run your program in debug mode and try adding a control point.
If that is impossible due to the size of the game or other dependencies you can always coade a simple/small app that includes the deserialization code and peek from the debug mode there.
I want to create a very simple piece of software in C# .NET that I can pass a folder's path to and detect all files with a frequency of below a given threshold. Any pointers on how I would do this?
You have to read mp3 files. To do that you have to find specifications for them.
Generally mp3 file is wrapped into ID3 tag, so that you have to read it, find its length and skip it. Let's take ID3v2.3 for example:
ID3v2/file identifier "ID3"
ID3v2 version $03 00
ID3v2 flags %abc00000
ID3v2 size 4 * %0xxxxxxx
so bytes 6,7,8,9 store header length in big-endian form. Here is sample of some file:
0 1 2 3 4 5 6 7 8 9 A B C D E F
49 44 33 03 00 00 00 00 07 76 54 43 4f 4e 00 00
07 76 - is the size. You need to shift left first byte so that actual size is 3F6. Then add 10 (A) to get the offset = 400. This is address of start of mp3 header.
Then you take description of mp3 header:
bits are: AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM, we need FF , sampling frequency and convert t to actual frequency:
bits MPEG1 MPEG2 MPEG2.5
00 44100 22050 11025
01 48000 24000 12000
10 32000 16000 8000
11 reserv. reserv. reserv.
You can use UltraID3Lib to get mp3 metadata (bitrate, frequency)
Check value of frequency bits in a file. There is some info about mp3 format.