Hashtable architecture design for UInt32 (ulong) keys as indexes

Hashtable architecture design for UInt32 (ulong) keys as indexes - c#

I am designing a hashtable. Current design:
public static Dictionary<UInt64, bool> d1 = new();
Keys are of UInt64 type, values are boolean.
Value space of the keys (UInt64) is much lower than their uint64 min/max values.
And I know the exact values range beforehand.
The number of keys is known beforehand as well.
To the best of my knowledge:
hasttable idx = Key.GetHashCode % hashtable_records_count
As far as I understand, to make RAM usage lower, it is possible to
override .GetHashCode function of the used key. I that so?
Under the hood UInt64.GetHashCode is used, and this is not optimal.
My question is:
Can overriding GetHashCode achieve a better distribution of keys, and thus lower RAM usage?
Just found a SO topic stating:
Use a Dictionary<,> instead of HashTable. In a HashTable both the key
and value are objects, so if they are value types they will be boxed.
A Dictionary can have value types as key and/or value, which uses less
memory. If you for example use an int as key, each will use 28 bytes
in a HashTable while they only use 4 bytes in a Dictionary.
dictionary vs hashtable memory usage
I guess I stick to dictionary :-(.
Still is any way to lower the RAM usage - i.e. manipulating load factor or overriding default GetHashCode function?
updated:
#Iridium thx for the hint for the type constructor!
I try to explain my idea -
current design:
my_key(UInt64) -> Disctionary (under the hood).GetHashCode -> % dic.count = underlying ht idx
I may be wrong, but as far as I understand:
if keys count is low, everything woorks +- ok
as long as default GetHashCode() produces even distribution in UInt32 max space
in my case with items count is around UInt32 max vaue (2bln items), this what happens:
(I remind - my input UInt64 keys are clustered in specific range), and GetHashCode will produce
"uneven" distribution of resulting "GetHashCode -> % dic.count" index pointers that may and will result
in ht rehash - as it will stroe more items in specific buckets and will hit max items per bucket
I think storage of UInt32 is less tan UInt64 - I'm looking for a way to map UInt64 to UInt32
more efficient way - to exclude double hashing.
--
My thoughts may be wrong - so I appeal to the SO wizdom!

Related

Looking up floats with some epsilon from a data structure in C#, with search and insert both O(lg n) time

In C++ I was able to use std::map<double, T> which is an ordered dictionary for its keys, but is a Red-Black tree which gives me O(lg n) for both insert and search. I was able to look up whether a value existed within some epsilon by using std::lower_bound and std::upper_bound together.
I have not been able to find the same thing while using C# 7+/.NET Core. Does such a thing exist?
In pseudocode, I'd like to do something like this
Map<float, T> map = ...
// key epsilon newValue
map.Insert(0.5f, 0.1f, someObj); // No values in the map, inserts fine
map.Get( 0.45f, 0.1f); // 0.45 +/- 0.1 contains 0.5, would return someObj
map.Get( 0.3f, 0.1f); // 0.3 +/- 0.1 does not include 0.5, it is not found
map.Insert(0.55f, 0.1f, anotherObj); // 0.55 +/- 0.1 includes 0.5, replace someObj
map.Insert(0.35f, 0.1f, anObj); // 0.35 +/- 0.1 doesn't overlap, insert new value
The way I'd have to do it would be to roll my own self-balancing binary search tree, but I'd rather not reinvent the wheel if such a thing exists.
I've been looking at SortedDictionary, however its Keys field is a collection so I can't jump around in it. Same issue for OrderedDictionary, unless I missed something.
I may not be able to use a SortedList since there will be more insertions than lookups, and due to the random order I'm worried that I'll end up getting a lot of O(n) swaps that need to be done when insertions. I'm assuming a uniform distribution in my input (which is very likely the case because of the data I'm working with), which means the insertions towards the middle and the front would cause a lot of shifting if it implements it the way I think it does... which would give me on average a cost of n/2 insertions and leave me at O(n). At least with a binary search tree, I'm getting O(lg n). Therefore the good solution here may not be applicable.
Most importantly, this is an algorithm that is used in a very hot section of the code. Performance is extremely important, choosing something that is not fast will likely drastically damage the performance of the application. I really need O(lg n) or some novel way of doing this that I didn't think of before.

My idea is to combine two data structures, SortedSet and a regular map.
SortedSet has GetViewBetween method, which has expected performance.
https://github.com/dotnet/corefx/pull/30921
Note: the expected performance of this method is met only in .NET core, it was much slower in the past: Why SortedSet<T>.GetViewBetween isn't O(log N)?
In this set you keep only the float keys.
Additionally, you have a Map from float to your desired type. You perform operations on the map only after checking your SortedSet.
I realize there are some rough edges (when an interval gives a few entries in the SortedSet), but I believe this is equivalent to the cpp implementation.
Hope you find this helpful, good luck with the implementation.

Now while this answer I'm about to give is a C++ profiled answer and not with C#, it solves the problem in a much better and faster way.
The better way to solve this is multiplying the floating point by the inverse of the epsilon. For example if your epsilon is 0.25, then you'd want to multiply all your floats/doubles by 4 and then cast it to an integer (or floor/ceil it if you care about things collecting around zero). The following uses int as the key but this would be fine for longs as well. My data fits in the +/- 2^31 range after quantizing (on computers with at least sizeof int being 4 bytes) so this is sufficient for me.
// Consider using std::is_floating_point_v for K
template <typename K, typename V>
class QuantizedGrid {
int quantizer;
std::unordered_map<int, V> map;
public:
explicit QuantizedGrid(const double epsilon) {
quantizer = 1.0 / epsilon;
}
V& operator[](const K k) {
return map[static_cast<int>(quantizer * k)];
}
bool contains(const K k) const {
int key = static_cast<int>(quantizer * k);
return map.count(key) > 0;
}
};
Compared to using upper/lower bound checks, the performance from that to the above code is as follows:
or rather it was 650% faster to convert to an integer and insert into a dictionary that supports O(1) amortized insertion/lookup/delete.
It is also way less code than implementing a custom upper/lower bound.
My guess is the O(lg n) BST lookup time is much worse by the O(1) dictionary time, and the cost of casting a float to and int is small enough to make this bound by data structure lookups/cache issues.

Relation between Dictionary's Key Hashcode and index of bucket where value is storing

Wiki says (as I understand it) that every time when I'm adding an item to the dictionary, the system calculates hash code (by calling GetHashCode). The system then using the hash code to find a bucket where my value will be saved.
Explain me please the logic of finding relation between hash code and index in the bucket array where my value will be stored by a Dictionary.
Imagine situation when I creating a Dictiornary and adding an object to it who's GetHashCode returns value 1000000.
Does it mean that inside Dictionary will create an Array with 1000000 elements and store my object at index 999999999 ?
If that's assumption is correct, what's the point of having array of such a big size to store only one value.

Your assumption isn't correct, luckily. If it were, they wouldn't actually be buckets, just an index-accessible array of objects. That might be fine for O(1) lookup if your hashcodes are guaranteed to be unique, but that's not the case - in fact, hashcodes are guaranteed not to be unique. You can't map every possible value of an Int64 into a unique Int32 hashcode. That's not what hashcodes are for.
Instead, the dictionary initializes a smaller array of buckets, and then uses a simple Modulo operation to find the bucket. (From the .NET Reference Source)
int targetBucket = hashCode % buckets.Length;
That means that if there are 10 buckets, for instance, it will get the remainder of dividing the hashcode by 10. If your hash algorithm does its job well, it means that the hashcodes follow a standard distribution, meaning that any n items, for a big enough n, will probably be evenly divided between the buckets.
As for how many buckets are initialized, the number will be the first prime number that's higher than the capacity passed in the ctor, or 0 (See here). If this causes too many hash collisions, it will be automatically expanded, jumping to the next prime number each time, until stable.

Why do "int" and "sbyte" GetHashCode functions generate different values?

We have the following code:
int i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 1
This make sense and the same happen whit all integral types in C# except sbyte and short.
That is:
sbyte i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 257
Why is this?

Because the source of that method (SByte.GetHashCode) is
public override int GetHashCode()
{
return (int)this ^ ((int)this << 8);
}
As for why, well someone at Microsoft knows that..

Yes it's all about values distribution. As the GetHashCode method return type is int for the type sbyte the values are going to be distributed in intervals of 257. For this same reason for the long type will be colisions.

The reason is that it is probably done to avoid clustering of hash values.
As GetHashCode documentation says:
For the best performance, a hash function must generate a random
distribution for all input.
Providing a good hash function on a class can significantly affect the
performance of adding those objects to a hash table. In a hash table with
a good implementation of a hash function, searching for an element takes
constant time (for example, an O(1) operation).
Also, as this excellent article explains:
Guideline: the distribution of hash codes must be "random"
By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be "clustered"; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

Is GetHashCode strong enough or do I need another hash function?

I'am implementing data serialization and I've encounter a problem.
I've got:
4 byte fields:
Values range 0-255
Values range 0- 4
Values range 0-255
Values range 0- 100
and 1 int field(only positive values)
I've got an idea to convet all to byte array(lenght 8) or int array(lenght 2) and get C# GetHashCode method
Is GetHashCode strong enough to use as identifier for this data?
Or someone has better idea, maybe?
EOG

GetHashCode isn't meant to create a unique identifier - its primary use is for assigning values to buckets in hashed data structures (like HashTable) - see http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/. When I need a unique identifier for an object, and for some reason the object itself doesn't provide one, I usually just fall back on GUIDs. They are trivial to generate in C# and guaranteed to be unique within the scope of whatever you're doing.

GetHashCode is purely for hashing in dictionary. You should not use it as identifier anywhere because of possible hash collisions. It returns Int32 and for String for example it is clearly possible to have more than 2,147,483,647 unique strings. Two different strings can have the same hash code. Having said that you have two options:
1) If you need your identifier to be derived from the actual values. For example if you need to quickly tell if you already have new Object persisted without deserializing all objects and comparing them to object in question. You can use ComputeHash on SHA1 for example.
2) If you don't need identifier to be derived from actual values you can simply generate Guid like bbogovich have suggested.

The GetHashCode() value for ints and longs (< int.MaxValue) is the same as the value, But for array's the value is not stable. So don't use it.
Why not convert the entire structure to a long as use that?

Implementing a sparse array in C# / fastest way to map integer to a specific bucket/range number

My initial problem is that I need to implement a very fast, sparse array in C#. Original idea was to use a normal Dictionary<uint, TValue> and wrap it in my own class to only expose the TValue type parameter. Turns out this is pretty slow.
So my next idea was to map each integer in the needed range (UInt32.MinValue to UInt32.MaxValue) to a bucket, of some size and use that. So I'm looking for a good way to map an unsigned integer X to a bucket Y, for example:
Mapping the numbers 0-1023 to 8 different buckets holding 128 numbers each, 0-127, 128-255.
But if someone has a better way of implementing a fast sparse array in C#, that would be most appreciated also.

I, too, noticed that Dictionary<K,V> is slow when the key is an integer. I don’t know exactly why this is the case, but I wrote a faster hash-table implementation for uint and ulong keys:
Efficient32bitHashTable and Efficient64bitHashTable
Caveats/downsides:
The 64-bit one (key is ulong) is generic, but the other one (key is uint) assumes int values because that’s all I needed at the time; I’m sure you can make this generic easily.
Currently the capacity determines the size of the hashtable forever (i.e. it doesn’t grow).

There are a 101 different ways to implement sparse arrays depending on factors like:
How many items will be in the array
How are the items clustered together
Space / speed trade of
etc
Most textbooks have a section on sparse array, just doing a Google comes up with lots of hits. You will then have to translate the code into C#, or just use the code someone else has written, I have found two without much effort (I don't know how good these are)
Use Specialty Arrays to Accelerate Your Code
SparseArray for C#

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.