I have a big chunck of data containing ~1.5 million entries. Each entry is an instance of a class like this:
public class Element
{
public Guid ID { get; set; }
public string name { get; set; }
public property p... p1... p2...
}
I have a list of Guids (~4 millions) that I need to get the names based on a list of instances of the Element class.
I'm storing the Element objects in a Dictionary, but it takes ~90 seconds to populate the data. Is there any way to improve performance when adding items to the dictionary? The data doesn't have duplicates but I know that the dictionary checks for duplicates when adding a new item.
The structure doesn't need to be a dictionary if there's a better one. I tried to put the Element objects in a List which performed a lot better (~9sec) when adding. But then when I need to look for the item with a certain Guid it takes more than 10min to find all the 4million elements. I tried that using List.Find() and manually iterating through the list.
Also, if instead of using System.Guid I convert them all to String and store their string representation on the data structures the whole both operations of populating the dictionary and filling the names on the other list takes only 10s, but then my application consumes 1.2Gb of RAM, instead of 600mb when I store them as System.Guid.
Any ideas on how to perform it better?
Your problem is perhaps connected to "sequential" Guid, like:
c482fbe1-9f16-4ae9-a05c-383478ec9d11
c482fbe1-9f16-4ae9-a05c-383478ec9d12
c482fbe1-9f16-4ae9-a05c-383478ec9d13
c482fbe1-9f16-4ae9-a05c-383478ec9d14
c482fbe1-9f16-4ae9-a05c-383478ec9d15
The Dictionary<,> has a problem with those, because they often have the same GetHashCode(), so it has to do some tricks that change the search time from O(1) to O(n)... You can solve it by using a custom equality comparer that calculates the hash in a different way, like:
public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();
#region IEqualityComparer<Guid> Members
public bool Equals(Guid x, Guid y)
{
return x.Equals(y);
}
public int GetHashCode(Guid obj)
{
var bytes = obj.ToByteArray();
uint hash1 = (uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24);
uint hash2 = (uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24);
uint hash3 = (uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24);
uint hash4 = (uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24);
int hash = 37;
unchecked
{
hash = hash * 23 + (int)hash1;
hash = hash * 23 + (int)hash2;
hash = hash * 23 + (int)hash3;
hash = hash * 23 + (int)hash4;
}
return hash;
}
#endregion
}
Then you simply declare the dictionary like this:
var dict = new Dictionary<Guid, Element>(ReverseGuidEqualityComparer.Default);
a little test to see the difference:
private static void Increment(byte[] x)
{
for (int i = x.Length - 1; i >= 0; i--)
{
if (x[i] != 0xFF)
{
x[i]++;
return;
}
x[i] = 0;
}
}
and
// You can try timing this program with the default GetHashCode:
//var dict = new Dictionary<Guid, object>();
var dict = new Dictionary<Guid, object>(ReverseGuidEqualityComparer.Default);
var hs1 = new HashSet<int>();
var hs2 = new HashSet<int>();
{
var guid = Guid.NewGuid();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1500000; i++)
{
hs1.Add(ReverseGuidEqualityComparer.Default.GetHashCode(guid));
hs2.Add(guid.GetHashCode());
dict.Add(guid, new object());
var bytes = guid.ToByteArray();
Increment(bytes);
guid = new Guid(bytes);
}
sw.Stop();
Console.WriteLine("Milliseconds: {0}", sw.ElapsedMilliseconds);
}
Console.WriteLine("ReverseGuidEqualityComparer distinct hashes: {0}", hs1.Count);
Console.WriteLine("Guid.GetHashCode() distinct hashes: {0}", hs2.Count);
With sequential Guid the difference in the number of distinct hash codes is staggering:
ReverseGuidEqualityComparer distinct hashes: 1500000
Guid.GetHashCode() distinct hashes: 256
Now... If you don't want to use ToByteArray() (because it allocates useless memory), there is a solution that uses reflection and expression trees... It should work correctly with Mono, because Mono "aligned" its implementation of Guid to the one of Microsoft in 2004, that is ancient time :-)
public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();
public static readonly Func<Guid, int> GetHashCodeFunc;
static ReverseGuidEqualityComparer()
{
var par = Expression.Parameter(typeof(Guid));
var hash = Expression.Variable(typeof(int));
var const23 = Expression.Constant(23);
var const8 = Expression.Constant(8);
var const16 = Expression.Constant(16);
var const24 = Expression.Constant(24);
var b = Expression.Convert(Expression.Convert(Expression.Field(par, "_b"), typeof(ushort)), typeof(uint));
var c = Expression.Convert(Expression.Convert(Expression.Field(par, "_c"), typeof(ushort)), typeof(uint));
var d = Expression.Convert(Expression.Field(par, "_d"), typeof(uint));
var e = Expression.Convert(Expression.Field(par, "_e"), typeof(uint));
var f = Expression.Convert(Expression.Field(par, "_f"), typeof(uint));
var g = Expression.Convert(Expression.Field(par, "_g"), typeof(uint));
var h = Expression.Convert(Expression.Field(par, "_h"), typeof(uint));
var i = Expression.Convert(Expression.Field(par, "_i"), typeof(uint));
var j = Expression.Convert(Expression.Field(par, "_j"), typeof(uint));
var k = Expression.Convert(Expression.Field(par, "_k"), typeof(uint));
var sc = Expression.LeftShift(c, const16);
var se = Expression.LeftShift(e, const8);
var sf = Expression.LeftShift(f, const16);
var sg = Expression.LeftShift(g, const24);
var si = Expression.LeftShift(i, const8);
var sj = Expression.LeftShift(j, const16);
var sk = Expression.LeftShift(k, const24);
var body = Expression.Block(new[]
{
hash
},
new Expression[]
{
Expression.Assign(hash, Expression.Constant(37)),
Expression.MultiplyAssign(hash, const23),
Expression.AddAssign(hash, Expression.Field(par, "_a")),
Expression.MultiplyAssign(hash, const23),
Expression.AddAssign(hash, Expression.Convert(Expression.Or(b, sc), typeof(int))),
Expression.MultiplyAssign(hash, const23),
Expression.AddAssign(hash, Expression.Convert(Expression.Or(d, Expression.Or(se, Expression.Or(sf, sg))), typeof(int))),
Expression.MultiplyAssign(hash, const23),
Expression.AddAssign(hash, Expression.Convert(Expression.Or(h, Expression.Or(si, Expression.Or(sj, sk))), typeof(int))),
hash
});
GetHashCodeFunc = Expression.Lambda<Func<Guid, int>>(body, par).Compile();
}
#region IEqualityComparer<Guid> Members
public bool Equals(Guid x, Guid y)
{
return x.Equals(y);
}
public int GetHashCode(Guid obj)
{
return GetHashCodeFunc(obj);
}
#endregion
// For comparison purpose, not used
public int GetHashCodeSimple(Guid obj)
{
var bytes = obj.ToByteArray();
unchecked
{
int hash = 37;
hash = hash * 23 + (int)((uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24));
hash = hash * 23 + (int)((uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24));
hash = hash * 23 + (int)((uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24));
hash = hash * 23 + (int)((uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24));
return hash;
}
}
}
Other solution, based on "undocumented but working" programming (tested on .NET and Mono):
public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();
#region IEqualityComparer<Guid> Members
public bool Equals(Guid x, Guid y)
{
return x.Equals(y);
}
public int GetHashCode(Guid obj)
{
GuidToInt32 gtoi = new GuidToInt32 { Guid = obj };
unchecked
{
int hash = 37;
hash = hash * 23 + gtoi.Int32A;
hash = hash * 23 + gtoi.Int32B;
hash = hash * 23 + gtoi.Int32C;
hash = hash * 23 + gtoi.Int32D;
return hash;
}
}
#endregion
[StructLayout(LayoutKind.Explicit)]
private struct GuidToInt32
{
[FieldOffset(0)]
public Guid Guid;
[FieldOffset(0)]
public int Int32A;
[FieldOffset(4)]
public int Int32B;
[FieldOffset(8)]
public int Int32C;
[FieldOffset(12)]
public int Int32D;
}
}
It uses the StructLayout "trick" to superimpose a Guid to a bunch of int, write to the Guid and read the int.
Why does Guid.GetHashCode() has problems with sequential ids?
Very easy to explain: from the reference source, the GetHashCode() is:
return _a ^ (((int)_b << 16) | (int)(ushort)_c) ^ (((int)_f << 24) | _k);
so the _d, _e, _g, _h, _i, _j bytes aren't part of the hash code. When incremented a Guid is first incremented in the _k field (256 values), then on overflow in the _j field (256 * 256 values, so 65536 values), then on the _i field (16777216 values). Clearly by not hashing the _h, _i, _j fields the hash of a sequential Guid will only show 256 different values for non-huge range of Guid (or at maximum 512 different values, if the _f field is incremented once, like if you start with a Guid similar to 12345678-1234-1234-1234-aaffffffff00, where aa (that is "our" _f) will be incremented to ab after 256 increments of the Guid)
I'm not, the dictionary Key is the ID property of the Element class, and not the Element class itself. This property is of type System.Guid.
The problem with Guid is that it's a very specialized construct. For one thing it's a struct, not a class. Moving this thing around isn't as simple as moving a pointer (technically a handle, but same thing), it involves moving the entire memory block around. Keep in mind the .NET memory model makes everything be compact, so that also involves moving other blocks around to make room.
Also looking at the source code, it stores all the parts as separate fields, 11 of them! That's a lot of comparisons for 1.5 million entries.
What I'd do is create a sort of alternate Guid implementation (class, not struct!) tailor made for efficient comparisons. All the fancy parsing isn't needed, just focus on speed. Guids are 16 bytes in length, that means 4 long fields. You need to implement Equals as usual (compare the 4 fields) and GetHashCode as something like XORing the fields. I'm certain that's good enough.
Edit: Note that I'm not saying the framework-provided implementation is bad, it's just not made for what you're trying to do with it. In fact it's terrible for your purpose.
If your data is pre-sorted, You can use List<T>.BinarySearch to quickly search in the list. You will need to create a comparer class, and use it for looking up.
class ElementComparer : IComparer<Element>
{
public int Compare(Element x, Element y)
{
// assume x and y is not null
return x.ID.CompareTo(y.ID);
}
}
Then use it
var comparer = new ElementComparer();
var elements = new List<Element>(1500000); // specify capacity might help a bit
//... (load your list here. Sort it with elements.Sort(comparer) if needed)
Guid guid = elements[20]; // for the sake of testing
int idx = elements.BinarySearch(new Element { ID = guid }, comparer);
You can wrap this whole thing in IReadOnlyDictionary<Guid, Element> if you want, but maybe you don't need it this case.
Related
If I run this:
Console.WriteLine("Foo".GetHashCode());
Console.WriteLine("Foo".GetHashCode());
it will print the same number twice but if I run the program again it will print a different number.
According to Microsoft and other places on the internet we cannot rely on GetHashCode function to return the same value. But if I plan on using it on strings only how can I make use of it and expect to always return the same value for the same string? I love how fast it is. It will be great if I could get the source code of it and use it on my application.
Reason why I need it (you may skip this part)
I have a lot of complex objects that I need to serialize and send them between inter process communication. As you know BinaryFormatter is now obsolete so I then tried System.Text.Json to serialize my objects. That was very fast but because I have a lot of complex objects deserialization did not work well because I am making heavy use of polymorphism. Then I tried Newtonsoft (json.net) and that work great with this example: https://stackoverflow.com/a/71398251/637142. But it was very slow. I then decided I will use the best option and that is ProtoBuffers. So I was using protobuf-net and that worked great but the problem is that I have some objects that are very complex and it was a pain to place thousands of attributes. For example I have a base class that was being used by 70 other classes I had to place an attribute of inheritance for every single one it was not practical. So lastly I decided to implement my own algorithm it was not that complicated. I just have to traverse the properties of each object and if one property was not a value type then traverse them again recursively. But in order for this custom serialization that I build to be fast I needed to store all reflection objects in memory. So I have a dictionary with the types and propertyInfos. So the first time I serialize it will be slow but then it is even faster than ProtoBuf! So yes this approach is fast but every process must have the same exact object otherwise it will not work. Another tradeoff is that it's size is larger than protobuf because every time I serialize a property I include the full name of that property before. As a result I want to hash the full name of the property into an integer (4 bytes) and the GetHashCode() function does exactly that!
A lot of people may suggest that I should use MD5 or a different alternative but take a look at the performance difference:
// generate 1 million random GUIDS
List<string> randomGuids = new List<string>();
for (int i = 0; i < 1_000_000; i++)
randomGuids.Add(Guid.NewGuid().ToString());
// needed to measure time
var sw = new Stopwatch();
sw.Start();
// using md5 (takes aprox 260 ms)
using (var md5 = MD5.Create())
{
sw.Restart();
foreach (var guid in randomGuids)
{
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(guid);
byte[] hashBytes = md5.ComputeHash(inputBytes);
// make use of hashBytes to make sure code is compiled
if (hashBytes.Length == 44)
throw new Exception();
}
var elapsed = sw.Elapsed.TotalMilliseconds;
Console.WriteLine($"md5: {elapsed}");
}
// using .net framework 4.7 source code (takes aprox 65 ms)
{
[System.Security.SecuritySafeCritical] // auto-generated
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
static int GetHashCodeDotNetFramework4_7(string str)
{
#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING
unsafe
{
fixed (char* src = str)
{
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char* s = src;
while ((c = s[0]) != 0)
{
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= -484733382;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
sw.Restart();
foreach (var guid in randomGuids)
if (GetHashCodeDotNetFramework4_7(guid) == 1234567)
throw new Exception("this will probably never happen");
var elapsed = sw.Elapsed.TotalMilliseconds;
Console.WriteLine($".NetFramework4.7SourceCode: {elapsed}");
}
// using .net 6 built in GetHashCode function (takes aprox: 22 ms)
{
sw.Restart();
foreach (var guid in randomGuids)
if (guid.GetHashCode() == 1234567)
throw new Exception("this will probably never happen");
var elapsed = sw.Elapsed.TotalMilliseconds;
Console.WriteLine($".net6: {elapsed}");
}
Running this in release mode these where my results:
md5: 254.7139
.NetFramework4.7SourceCode: 74.2588
.net6: 23.274
I got the source code from .NET Framework 4.8 from this link: https://referencesource.microsoft.com/#mscorlib/system/string.cs,8281103e6f23cb5c
Anyways searching on the internet I have found this helpful article:
https://andrewlock.net/why-is-string-gethashcode-different-each-time-i-run-my-program-in-net-core/
and I have done exactly what it tells you to do and I have added:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<runtime>
<UseRandomizedStringHashAlgorithm enabled="1" />
</runtime>
</configuration>
to my app.config file and still I get different values for "foo".GetHashCode() every time I run my application.
How can I make the GetHashcode() method return always the same value for the string "foo" in .net 6?
Edit
I will just use the solution of .net framework 4.8 source code that took 73ms to execute and move on. I was just curios to understand why was the build in hashcode so much faster.
At least I understand now why the hash is different every time. By looking at the source code of .net 6 the reason why it has a different hash every time is because of this:
namespace System
{
internal static partial class Marvin
{
... .net source code
....
public static ulong DefaultSeed { get; } = GenerateSeed();
private static unsafe ulong GenerateSeed()
{
ulong seed;
Interop.GetRandomBytes((byte*)&seed, sizeof(ulong));
return seed;
}
}
}
As a result I have tried this just for fun and still did not work:
var ass = typeof(string).Assembly;
var marvin = ass.GetType("System.Marvin");
var defaultSeed = marvin.GetProperty("DefaultSeed");
var value = defaultSeed.GetValue(null); // returns 3644491462759144438
var field = marvin.GetField("<DefaultSeed>k__BackingField", BindingFlags.NonPublic | BindingFlags.Static);
ulong v = 3644491462759144438;
field.SetValue(null, v);
but on the last line I get the exception: System.FieldAccessException: 'Cannot set initonly static field '<DefaultSeed>k__BackingField' after type 'System.Marvin' is initialized.'
But still even if this worked it would be very unsfafe. I rader have something execute 3 times slower and move on.
Why not to use the implementation suggested on the article you shared?
I'm copying it for reference:
static int GetDeterministicHashCode(this string str)
{
unchecked
{
int hash1 = (5381 << 16) + 5381;
int hash2 = hash1;
for (int i = 0; i < str.Length; i += 2)
{
hash1 = ((hash1 << 5) + hash1) ^ str[i];
if (i == str.Length - 1)
break;
hash2 = ((hash2 << 5) + hash2) ^ str[i + 1];
}
return hash1 + (hash2 * 1566083941);
}
}
Currently I code client-server junk and deal a lot with C++ structs passed over network.
I know about ways provided here Reading a C/C++ data structure in C# from a byte array, but they all about making a copy.
I want to have something like that:
struct/*or class*/ SomeStruct
{
public uint F1;
public uint F2;
public uint F3;
}
Later in my code I want to have something like that:
byte[] Data; //16 bytes that I got from network
SomeStruct PartOfDataAsSomeStruct { get { return /*make SomeStruct instance based on this.Data starting from index 4, without copying it. So when I do PartOfDataAsSomeStruct.F1 = 132465; it also changes bytes 4, 5, 6 and 7 in this.Data.*/; } }
If this is possible, please, tell how?
Like so?
byte[] data = new byte[16];
// 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
Console.WriteLine(BitConverter.ToString(data));
ref SomeStruct typed = ref Unsafe.As<byte, SomeStruct>(ref data[4]);
typed.F1 = 42;
typed.F2 = 3;
typed.F3 = 9;
// 00-00-00-00-2A-00-00-00-03-00-00-00-09-00-00-00
Console.WriteLine(BitConverter.ToString(data));
This coerces the data from the middle of the byte-array using a ref-local that is an "interior managed pointer" to the data. Zero copies.
If you need multiple items (like how a vector would work), you can do the same thing with spans and MemoryMarshal.Cast
Note that it uses CPU-endian rules for the elements - little endian in my case.
For spans:
byte[] data = new byte[256];
// create a span of some of it
var span = new Span<byte>(data, 4, 128);
// now coerce the span
var typed = MemoryMarshal.Cast<byte, SomeStruct>(span);
Console.WriteLine(typed.Length); // 10 of them fit
typed[3].F1 = 3; // etc
Thank you for the correction, Marc Gravell. And thank you for the example.
Here is a way using Class and Bitwise Operators, without pointers, to do the samething:
class SomeClass
{
public byte[] Data;
public SomeClass()
{
Data = new byte[16];
}
public uint F1
{
get
{
uint ret = (uint)(Data[4] << 24 | Data[5] << 16 | Data[6] << 8 | Data[7]);
return ret;
}
set
{
Data[4] = (byte)(value >> 24);
Data[5] = (byte)(value >> 16);
Data[6] = (byte)(value >> 8);
Data[7] = (byte)value;
}
}
}
Testing:
SomeClass sc = new SomeClass();
sc.F1 = 0b_00000001_00000010_00000011_00000100;
Console.WriteLine(sc.Data[3].ToString() + " " + sc.Data[4].ToString() + " " + sc.Data[5].ToString() + " " + sc.Data[6].ToString());
Console.WriteLine(sc.F1.ToString());
//Output:
//1 2 3 4
//16909060
I have an object that has the following variables:
bool firstBool;
float firstFloat; (0.0 to 1.0)
float secondFloat (0.0 to 1.0)
int firstInt; (0 to 10,000)
I was using a ToString method to get a string that I can send over the network. Scaling up I have encountered issues with the amount of data this is taking up.
the string looks like this at the moment:
"false:1.0:1.0:10000" this is 19 characters at 2 bytes per so 38 bytes
I know that I can save on this size by manually storing the data in 4 bytes like this:
A|B|B|B|B|B|B|B
C|C|C|C|C|C|C|D
D|D|D|D|D|D|D|D
D|D|D|D|D|X|X|X
A = bool(0 or 1), B = int(0 to 128), C = int(0 to 128), D = int(0 to 16384), X = Leftover bits
I convert the float(0.0 to 1.0) to int(0 to 128) since I can rebuild them on the other end and the accuracy isn't super important.
I have been experimenting with BitArray and byte[] to convert the data into and out of the binary structure.
After some experiments I ended up with this serialization process(I know it needs to be cleaned up and optimized)
public byte[] Serialize() {
byte[] firstFloatBytes = BitConverter.GetBytes(Mathf.FloorToInt(firstFloat * 128)); //Convert the float to int from (0 to 128)
byte[] secondFloatBytes = BitConverter.GetBytes(Mathf.FloorToInt(secondFloat * 128)); //Convert the float to int from (0 to 128)
byte[] firstIntData = BitConverter.GetBytes(Mathf.FloorToInt(firstInt)); // Get the bytes for the int
BitArray data = new BitArray(32); // create the size 32 bitarray to hold all the data
int i = 0; // create the index value
data[i] = firstBool; // set the 0 bit
BitArray ffBits = new BitArray(firstFloatBytes);
for(i = 1; i < 8; i++) {
data[i] = ffBits[i-1]; // Set bits 1 to 7
}
BitArray sfBits = new BitArray(secondFloatBytes);
for(i = 8; i < 15; i++) {
data[i] = sfBits[i-8]; // Set bits 8 to 14
}
BitArray fiBits = new BitArray(firstIntData);
for(i = 15; i < 29; i++) {
data[i] = fiBits[i-15]; // Set bits 15 to 28
}
byte[] output = new byte[4]; // create a byte[] to hold the output
data.CopyTo(output,0); // Copy the bits to the byte[]
return output;
}
Getting the information back out of this structure is much more complicated than getting it into this form. I figure I can probably workout something using the bitwise operators and bitmasks.
This is proving to be more complicated than I was expecting. I thought it would be very easy to access the bits of a byte[] to manipulate the data directly, extract ranges of bits, then convert back to the values required to rebuild the object. Are there best practices for this type of data serialization? Does anyone know of a tutorial or example reference I could read?
Standard and efficient serialization methods are:
Using BinaryWriter / BinaryReader:
public byte[] Serialize()
{
using(var s = new MemoryStream())
using(var w = new BinaryWriter(s))
{
w.Write(firstBool);
w.Write(firstFloat);
...
return s.ToArray();
}
}
public void Deserialize(byte[] bytes)
{
using(var s = new MemoryStream(bytes))
using(var r = new BinaryReader(s))
{
firstBool = r.ReadBool();
firstFload = r.ReadFloat();
...
}
}
Using protobuf.net
BinaryWriter / BinaryReader is much faster (around 7 times). Protobuf is more flexible, easy to use, very popular and serializes into around 33% fewer bytes. (of course these numbers are orders of magnitude and depend on what you serialize and how).
Now basically BinaryWriter will write 1 + 4 + 4 + 4 = 13 bytes. You shrink it to 5 bytes by converting the values to bool, byte, byte, short first by rounding it the way you want. Finally it's easy to merge the bool with one of your bytes to get 4 bytes if you really want to.
I don't really discourage manual serialization. But it has to be worth the price in terms of performance. The code is quite unreadable. Use bit masks and binary shifts on bytes directly but keep it as simple as possible. Don't use BitArray. It's slow and not more readable.
Here is a simple method for pack/unpack. But you loose accuracy converting a float to only 7/8 bits
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
foreach (Data data in Data.input)
{
Data.Print(data);
Data results = Data.Unpack(Data.Pack(data));
Data.Print(results);
}
Console.ReadLine();
}
}
public class Data
{
public static List<Data> input = new List<Data>() {
new Data() { firstBool = true, firstFloat = 0.2345F, secondFloat = 0.432F, firstInt = 12},
new Data() { firstBool = true, firstFloat = 0.3445F, secondFloat = 0.432F, firstInt = 11},
new Data() { firstBool = false, firstFloat = 0.2365F, secondFloat = 0.432F, firstInt = 9},
new Data() { firstBool = false, firstFloat = 0.545F, secondFloat = 0.432F, firstInt = 8},
new Data() { firstBool = true, firstFloat = 0.2367F, secondFloat = 0.432F, firstInt = 7}
};
public bool firstBool { get; set; }
public float firstFloat {get; set; } //(0.0 to 1.0)
public float secondFloat {get; set; } //(0.0 to 1.0)
public int firstInt { get; set; } //(0 to 10,000)
public static byte[] Pack(Data data)
{
byte[] results = new byte[4];
results[0] = (byte)((data.firstBool ? 0x80 : 0x00) | (byte)(data.firstFloat * 128));
results[1] = (byte)(data.secondFloat * 256);
results[2] = (byte)((data.firstInt >> 8) & 0xFF);
results[3] = (byte)(data.firstInt & 0xFF);
return results;
}
public static Data Unpack(byte[] data)
{
Data results = new Data();
results.firstBool = ((data[0] & 0x80) == 0) ? false : true;
results.firstFloat = ((float)(data[0] & 0x7F)) / 128.0F;
results.secondFloat = (float)data[1] / 256.0F;
results.firstInt = (data[2] << 8) | data[3];
return results;
}
public static void Print(Data data)
{
Console.WriteLine("Bool : '{0}', 1st Float : '{1}', 2nd Float : '{2}', Int : '{3}'",
data.firstBool,
data.firstFloat,
data.secondFloat,
data.firstInt
);
}
}
}
I have a struct that gets used all over the place and that I store as byteArray on the hd and also send to other platforms.
I used to do this by getting a string version of the struct and using getBytes(utf-8) and getString(utf-8) during serialization. With that I guess I avoided the little and big endian problems?
However that was quite a bit of overhead and I am now using this:
public static explicit operator byte[] (Int3 self)
{
byte[] int3ByteArr = new byte[12];//4*3
int x = self.x;
int3ByteArr[0] = (byte)x;
int3ByteArr[1] = (byte)(x >> 8);
int3ByteArr[2] = (byte)(x >> 0x10);
int3ByteArr[3] = (byte)(x >> 0x18);
int y = self.y;
int3ByteArr[4] = (byte)y;
int3ByteArr[5] = (byte)(y >> 8);
int3ByteArr[6] = (byte)(y >> 0x10);
int3ByteArr[7] = (byte)(y >> 0x18);
int z = self.z;
int3ByteArr[8] = (byte)z;
int3ByteArr[9] = (byte)(z >> 8);
int3ByteArr[10] = (byte)(z >> 0x10);
int3ByteArr[11] = (byte)(z >> 0x18);
return int3ByteArr;
}
public static explicit operator Int3(byte[] self)
{
int x = self[0] + (self[1] << 8) + (self[2] << 0x10) + (self[3] << 0x18);
int y = self[4] + (self[5] << 8) + (self[6] << 0x10) + (self[7] << 0x18);
int z = self[8] + (self[9] << 8) + (self[10] << 0x10) + (self[11] << 0x18);
return new Int3(x, y, z);
}
It works quite well for me, but I am not quite sure how little/big endian works,. do I still have to take care of something here to be safe when some other machine receives an int I sent as a bytearray?
Your current approach will not work for the case when your application running on system which use Big-Endian. In this situation you don't need reordering at all.
You don't need to reverse byte arrays by your self
And you don't need check for endianess of the system by your self
Static method IPAddress.HostToNetworkOrder will convert integer to the integer with big-endian order.
Static method IPAddress.NetworkToHostOrder will convert integer to the integer with order your system using
Those methods will check for Endianness of the system and will do/or not reordering of integers.
For getting bytes from integer and back use BitConverter
public struct ThreeIntegers
{
public int One;
public int Two;
public int Three;
}
public static byte[] ToBytes(this ThreeIntegers value )
{
byte[] bytes = new byte[12];
byte[] bytesOne = IntegerToBytes(value.One);
Buffer.BlockCopy(bytesOne, 0, bytes, 0, 4);
byte[] bytesTwo = IntegerToBytes(value.Two);
Buffer.BlockCopy(bytesTwo , 0, bytes, 4, 4);
byte[] bytesThree = IntegerToBytes(value.Three);
Buffer.BlockCopy(bytesThree , 0, bytes, 8, 4);
return bytes;
}
public static byte[] IntegerToBytes(int value)
{
int reordered = IPAddress.HostToNetworkOrder(value);
return BitConverter.GetBytes(reordered);
}
And converting from bytes to struct
public static ThreeIntegers GetThreeIntegers(byte[] bytes)
{
int rawValueOne = BitConverter.ToInt32(bytes, 0);
int valueOne = IPAddress.NetworkToHostOrder(rawValueOne);
int rawValueTwo = BitConverter.ToInt32(bytes, 4);
int valueTwo = IPAddress.NetworkToHostOrder(rawValueTwo);
int rawValueThree = BitConverter.ToInt32(bytes, 8);
int valueThree = IPAddress.NetworkToHostOrder(rawValueThree);
return new ThreeIntegers(valueOne, valueTwo, valueThree);
}
If you will use BinaryReader and BinaryWriter for saving and sending to another platforms then BitConverter and byte array manipulating can be dropped off.
// BinaryWriter.Write have overload for Int32
public static void SaveThreeIntegers(ThreeIntegers value)
{
using(var stream = CreateYourStream())
using (var writer = new BinaryWriter(stream))
{
int reordredOne = IPAddress.HostToNetworkOrder(value.One);
writer.Write(reorderedOne);
int reordredTwo = IPAddress.HostToNetworkOrder(value.Two);
writer.Write(reordredTwo);
int reordredThree = IPAddress.HostToNetworkOrder(value.Three);
writer.Write(reordredThree);
}
}
For reading value
public static ThreeIntegers LoadThreeIntegers()
{
using(var stream = CreateYourStream())
using (var writer = new BinaryReader(stream))
{
int rawValueOne = reader.ReadInt32();
int valueOne = IPAddress.NetworkToHostOrder(rawValueOne);
int rawValueTwo = reader.ReadInt32();
int valueTwo = IPAddress.NetworkToHostOrder(rawValueTwo);
int rawValueThree = reader.ReadInt32();
int valueThree = IPAddress.NetworkToHostOrder(rawValueThree);
}
}
Of course you can refactor methods above and get more cleaner solution.
Or add as extension methods for BinaryWriter and BinaryReader.
Yes you do. With changes endianness your serialization which preserves bit ordering will run into trouble.
Take the int value 385
In a bigendian system it would be stored as
000000000000000110000001
Interpreting it as littleendian would read it as
100000011000000000000000
And reverse translate to 8486912
If you use the BitConverter class there will be a book property desiring the endianness of the system. The bitconverter can also produce the bit arrays for you.
You will have to decide to use either endianness and reverse the byte arrays according to the serializing or deserializing systems endianness.
The description on MSDN is actually quite detailed. Here they use Array.Reverse for simplicity. I am not certain that your casting to/from byte in order to do the bit manipulation is in fact the fastest way of converting, but that is easily benchmarked.
I'm making a tile based 2d platformer and every byte of memory is precious. I have one byte field that can hold values from 0 to 255, but what I need is two properties with values 0~15. How can I turn one byte field into two properties like that?
do you mean just use the lower 4 bits for one value and the upper 4 bits for the other?
to get two values from 1 byte use...
a = byte & 15;
b = byte / 16;
setting is just the reverse as
byte = a | b * 16;
Using the shift operator is better but the compiler optimizers usually do this for you nowadays.
byte = a | (b << 4);
To piggy back off of sradforth's answer, and to answer your question about properties:
private byte _myByte;
public byte LowerHalf
{
get
{
return (byte)(_myByte & 15);
}
set
{
_myByte = (byte)(value | UpperHalf * 16);
}
}
public byte UpperHalf
{
get
{
return (byte)(_myByte / 16);
}
set
{
_myByte = (byte)(LowerHalf | value * 16);
}
}
Below are some properties and some backing store, I've tried to write them in a way that makes the logic easy to follow.
private byte HiAndLo = 0;
private const byte LoMask = 15; // 00001111
private const byte HiMask = 240; // 11110000
public byte Lo
{
get
{
// ----&&&&
return (byte)(this.hiAndLo & LoMask);
}
set
{
if (value > LoMask) //
{
// Values over 15 are too high.
throw new OverflowException();
}
// &&&&0000
// 0000----
// ||||||||
this.hiAndLo = (byte)((this.hiAndLo & HiMask) | value);
}
}
public byte Hi
{
get
{
// &&&&XXXX >> 0000&&&&
return (byte)((this.hiAndLo & HiMask) >> 4);
}
set
{
if (value > LoMask)
{
// Values over 15 are too high.
throw new OverflowException();
}
// -------- << ----0000
// XXXX&&&&
// ||||||||
this.hiAndLo = (byte)((hiAndLo & LoMask) | (value << 4 ));
}
}