How double hashing works in case of the .NET Dictionary? - c#

The other day I was reading that article on CodeProject
And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):
Note: If will add more items than the maximum number in the table
(i.e 7199369), the resize method will manually search the next prime
number that is larger than twice the old size.
Note: The reason that the sizes are being doubled while resizing the
array is to make the inner-hash table operations to have asymptotic
complexity. The prime numbers are being used to support
double-hashing.
So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:
Open Addressing
Separate Chaining
Double Hashing
But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).
First, If you double the size with or without using a prime, is it not really the same?
Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.
I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.

I got my answer on Reddit, so I am gonna try to summarize here:
Collision Resolution Technique
First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:
The code goes as follows:
private struct Entry
{
public int hashCode; // Lower 31 bits of hash code, -1 if unused
public int next; // Index of next entry, -1 if last
public TKey key; // Key of entry
public TValue value; // Value of entry
}
It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.
Prime Number
About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this
be achieved? By choosing m to be a number that has very few factors: a
prime number.
Doubling the underlying entries array size
Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.
See that answer: https://stackoverflow.com/a/2369504/4636721
Hash-tables could not claim "amortized constant time insertion" if,
for instance, the resizing was by a constant increment. In that case
the cost of resizing (which grows with the size of the hash-table)
would make the cost of one insertion linear in the total number of
elements to insert. Because resizing becomes more and more expensive
with the size of the table, it has to happen "less and less often" to
keep the amortized cost of insertion constant.

Related

Lookup time of Dicionary.ContainsKey() [duplicate]

This question already has an answer here:
What is performance of ContainsKey and TryGetValue?
(1 answer)
Closed 9 years ago.
As I have read on wikipedia that hash tables have on average O(1) search time.
So lets say I have a very large dictionary that contains maybe tens of millions of records.
If I use Dicionary.ContainsKey to extract the value against a given key will it's lookup time be really 1 or would it be like log n or something else due to some different internal implementation by .NET.
Big Oh notation doesn't tell you how long something takes. It tells you how it scales.
Easiest one to envision is searching for an item in a List<>, it has O(n) complexity. If it takes, on average, 2 milliseconds to find an item in a list with a million elements then you can expect it to take 4 milliseconds if the list has two million elements. It scales linearly with the size of the list.
O(1) predicts constant time for finding an element in a dictionary. In other words, it doesn't depend on the size of the dictionary. If the dictionary is twice as big, it doesn't take twice as long to find the element, it takes (roughly) as much time. The "roughly" means that it actually does take a bit longer, it is amortized O(1).
It would still be close to O(1), because it would still not depend on the number of the entries, but on the numbers of the collisions you have. Indexing an array is still O(1), no matter how many items you have.
Also, there seems to be a top limit on size of Dictionary caused by the implementation: How is the c#/.net 3.5 dictionary implemented?
Once we pass this size, the next step falls outside the internal array, and it will manually search for larger primes. This will be quite slow. You could initialize with 7199369 (the largest value in the array), or consider if having more than about 5 million entries in a Dictionary might mean that you should reconsider your design.
What is the key? If the key is Int32 then yes it will be close to order 1.
You only get less than order 1 if there are hash collisions.
Int32 as a key will have zero hash collisions but that does not guarantee zero hash bucket collisions.
Be careful of keys that produce hash collisions.
KVP and tuple can create a lot of hash collisions and are not good candidates for key.

Compact data structure for storing a large set of integral values

I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:
Add a new value
Iterate over all of the values.
There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.
I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.
EDIT: To answer your questions:
No, the items don't need to be sorted
By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
There could be a decent number of these sets in memory at once (~100).
Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.
As for wasting space: you waste 6MB in the worst case.
EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)
For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.
You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:
Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
A custom hash table, perhaps Google sparsehash through C++/CLI
BSTs storing ranges/intervals
Supernode BSTs
It would depend on the distribution of the sizes of your sets. Unless you expect most of the sets to be (close to) the minimum you've specified, I'd probably use a bitset. To cover a range up to 50,000,000, a bitset ends up ~6 megabytes.
Compared to storing the numbers directly, this is marginally larger for the minimum size set you've specified (~6 megabytes instead of ~4), but considerably smaller for the maximum size set (1/32nd the size).
The second possibility would be to use a delta encoding. For example, instead of storing each number directly, store the difference between that number and the previous number that was included. Given a maximum magnitude of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~50. This means you can theoretically store the difference in <6 bits on average. I'd probably use the 7 least significant bits directly, and if you need to encode a larger gap, set the msb and (for example) store the size of the gap in the lower 7 bits plus the next three bytes. That can't happen very often, so in most cases you're using only one byte per number, for about 4:1 compression compared to storing numbers directly. In the best case this would use ~1 megabyte for a set, and in the worst about 50 megabytes -- 4:1 compression compared to storing numbers directly.
If you don't mind a little bit of extra code, you could use an adaptive scheme -- delta encoding for small sets (up to 6,000,000 numbers), and a bitmap for larger sets.
I think the answer depends on what you mean by "passing around" and what you're trying to accomplish. You say you are only adding to the list: how often do you add? How fast will the list grow? What is an acceptable overhead for memory use, versus the time to reallocate memory?
In your worst case, 50,000,000 32-bit numbers = 200 megabytes using the most efficient possible data storage mechanism. Assuming you may end up with this much use in your worst case scenario, is it OK to use this much memory all the time? Is that better than having to reallocate memory frequently? What's the distribution of typical usage patterns? You could always just use an int[] that's pre-allocated to the whole 50 million.
As far as access speed for your operations, nothing is faster than iterating and adding to a pre-allocated chunk of memory.
From OP edit: There could be a decent number of these sets in memory at once (~100).
Hey now. You need to store 100 sets of 1 to 50 million numbers in memory at once? I think the bitset method is the only possible way this could work.
That would be 600 megabytes. Not insignificant, but unless they are (typically) mostly empty, it seems very unlikely that you would find a more efficient storage mechanism.
Now, if you don't use bitsets, but rather use dynamically sized constructs, and they could somehow take up less space to begin with, you're talking about a real ugly memory allocation/deallocation/garbage collection scenario.
Let's assume you really need to do this, though I can only imagine why. So your server's got a ton of memory, just allocate as many of these 6 megabyte bitsets as you need and recycle them. Allocation and garbage collection are no longer a problem. Yeah, you're using a ton of memory, but that seems inevitable.

Random.Next() - finding the Nth .Next()

Given a consistently seeded Random:
Random r = new Random(0);
Calling r.Next() consistently produces the same series; so is there a way to quickly discover the N-th value in that series, without calling r.Next() N times?
My scenario is a huge array of values created via r.Next(). The app occasionally reads a value from the array at arbitrary indexes. I'd like to optimize memory usage by eliminating the array and instead, generating the values on demand. But brute-forcing r.Next() 5 million times to simulate the 5 millionth index of the array is more expensive than storing the array. Is it possible to short-cut your way to the Nth .Next() value, without / with less looping?
I don't know the details of the PRNG used in the BCL, but my guess is that you will find it extremely difficult / impossible to find a nice, closed-form solution for N-th value of the series.
How about this workaround:
Make the seed to the random-number generator the desired index, and then pick the first generated number. This is equally 'deterministic', and gives you a wide range to play with in O(1) space.
static int GetRandomNumber(int index)
{
return new Random(index).Next();
}
In theory if you knew the exact algorithm and the initial state you'd be able to duplicate the series but the end result would just be identical to calling r.next().
Depending on how 'good' you need your random numbers to be you might consider creating your own PRNG based on a Linear congruential generator which is relatively easy/fast to generate numbers for. If you can live with a "bad" PRNG there are likely other algorithms that may be better to use for your purpose. Whether this would be faster/better than just storing a large array of numbers from r.next() is another question.
No, I don't believe there is. For some RNG algorithms (such as linear congruential generators) it's possible in principle to get the n'th value without iterating through n steps, but the Random class doesn't provide a way of doing that.
I'm not sure whether the algorithm it uses makes it possible in principle -- it's a variant (details not disclosed in documentation) of Knuth's subtractive RNG, and it seems like the original Knuth RNG should be equivalent to some sort of polynomial-arithmetic thing that would allow access to the n'th value, but (1) I haven't actually checked that and (2) whatever tweaks Microsoft have made might break that.
If you have a good enough "scrambling" function f then you can use f(0), f(1), f(2), ... as your sequence of random numbers, instead of f(0), f(f(0)), f(f(f(0))), etc. (the latter being roughly what most RNGs do) and then of course it's trivial to start the sequence at any point you please. But you'll need to choose a good f, and it'll probably be slower than a standard RNG.
You could build your own on-demand dictionary of 'indexes' & 'random values'. This assumes that you will always 'demand' indexes in the same order each time the program runs or that you don't care if the results are the same each time the program runs.
Random rnd = new Random(0);
Dictionary<int,int> randomNumbers = new Dictionary<int,int>();
int getRandomNumber(int index)
{
if (!randomNumbers.ContainsKey(index))
randomNumbers[index] = rnd.Next();
return randomNumbers[index];
}

First n positions of true values from a bit pattern

I have a bit pattern of 100 bits. The program will change the bits in the pattern as true or false. At any given time I need to find the positions of first "n" true values. For example,if the patten is as follows
10011001000
The first 3 indexes where bits are true are 0, 3, 4
The first 4 indexes where bits are true are 0, 3, 4, 7
I can have a List, but the complexity of firstntrue(int) will be O(n). Is there anyway to improve the performance?
I'm assuming the list isn't changing while you are searching, but that it changes up until you decide to search, and then you do your thing.
For each byte there are 2^8 = 256 combinations of 0 and 1. Here you have 100/8 = 13 bytes to examine.
So you can build a lookup table of 256 entries. The key is the current real value of the byte you're examining in the bit stream, and the value is the data you seek (a tuple containing the positions of the 1 bits). So, if you gave it 5 it would return {0,2}. The cost of this lookup is constant and the memory usage is very small.
Now as you go through the bit stream you can proces the data a byte at a time (instead of a bit at a time) and just keep track of the current byte number (starting at 0, of course) and add 8*current-byte-number to the values in the tuple returned. So now you've essentially reduced the problem to O(n/8) by using the precomputed lookup table.
You can build a larger look-up table to get more speed but that will use more memory.
Though I can't imagine that an O(n) algorithm where n=100 is really the source of some performance issue for you. Unless you're calling it a lot inside some inner loop?
No there is no way to improve O(n). That can be proven mathematically
No.
Well, not unless you intercept the changes as they occur, and maintain a "first 100" list.
The complexity cannot be reduced without additional data structures, because in the worst case you need to scan the whole list.
For "n" Items you have to check at most "n" times I.E O(n)!
How can you expect to reduce that without any interception and any knowledge of how they've changed?!
No, you can not improve the complexity if you just have a plain array.
If you have few 1:s to many 0:s you can improve the performance by a constant factor, but it will still be O(n).
If you can treat you bit array as an byte array (or even int32 array) you can check each byte if the byte > 0 before checking each individual bit.
If you have less 1-bits than 1:8 you could implement it as a sparse array instead List<byte> where you store the index of all 1:s.
As others have said, to find the n lowest set bits in the absence of further structures is an O(n) operation.
If you're looking to improve performance, have you looked at the implementation side of the problem?
Off the top of my head, q & ~(q-1) will leave only the lowest bit set in the number q, since subtracting 1 from any binary number fills in 1s to the right up to the first digit that was set, changes that digit into a 0 and leaves the rest alone. In a number with 1 bit set, shifting to the right and testing against zero gives a simple test to distinguish whether a potential answer is less than the real answer or is greater than or equal. So you can binary search from there.
To find the next one, remove the lowest digit and use a smaller initial binary search window. There are probably better solutions, but this should be better than going through every bit from least to most and checking if it's set.
That implementation stuff that doesn't affect the complexity, but may help with performance.

C# random number generator

I'm looking for a random number that always generates the same "random" number for a given seed. The seed is defined by x + (y << 16), where x and y are positions on a heightmap.
I could create a new instance of System.Random every time with my seed, but thats a lot of GC pressure. Especially since this will be called a lot of times.
EDIT:
"A lot" means half a million times.
Thanks to everyone that answered! I know I was unclear, but I learned here that a hash function is exactly what I want.
Since a hash function is apparently closer to what you want, consider a variation of the following:
int Hash(int n) {
const int prime = 1031;
return (((n & 0xFFFF) * prime % 0xFFFF)) ^ (n >> 16);
}
This XORs the least significant two bytes with the most significant two bytes of a four-byte number after shuffling the least significant two byte around a little bit by multiplication with a prime number. The result is thus in the range 0 < 0x10000 (i.e. it fits in an Int16).
This should “shuffle” the input number a bit, reliably produces the same value for the same input and looks “random”. Now, I haven’t done a stochastic analysis of the distribution and if ever a statistician was to look at it, he would probably go straight into anaphylactic shock. (In fact, I have really written this implementation off the top of my head.)
If you require something less half-baked, consider using an established check sum (such as CRC32).
I could create a new instance of System.Random every time with my seed
Do that.
but thats a lot of GC pressure. Especially since this will be called a lot of times.
How many times do you call it? Does it verifiably perform badly? Notice, the GC is optimized to deal with lots of small objects with short life time. It should deal with this easily.
And, what would be the alternative that takes a seed but doesn’t create a new instance of some object? That sounds rather like a badly designed class, in fact.
See Simple Random Number Generation for C# source code. The state is just two unsigned integers, so it's easy to keep up with between calls. And the generator passes standard tests for quality.
What about storing a Dictionary<int, int> the provides the first value returned by a new Random object for a given seed?
class RandomSource
{
Dictionary<int, int> _dictionary = new Dictionary<int, int>();
public int GetValue(int seed)
{
int value;
if (!_dictionary.TryGetValue(seed, out value))
{
value = _dictionary[seed] = new Random(seed).Next();
}
return value;
}
}
This incurs the GC pressue of constructing a new Random instance the first time you want a value for a particular seed, but every subsequent call with the same seed will retrieve a cached value instead.
I don't think a "random number generator" is actually what you're looking for. Simply create another map and pre-populate it with random values. If your current heightmap is W x H, the simplest solution would be to create a W x H 2D array and just fill each element with a random value using System.Random. You can then look up the pre-populated random value for a particular (x, y) coordinate whenever you need it.
Alternatively, if your current heighmap actually stores some kind of data structure, you could modify that to store the random value in addition to the height value.
A side benefit that this has is that later, if you need to, you can perform operations over the entire "random" map to ensure that it has certain properties. For example, depending on the context (is this for a game?) you may find later that you want to smooth the randomness out across the map. This is trivial if you precompute and store the values as I've described.
CSharpCity provides source to several random number generators. You'll have to experiment to see whether these have less impact on performance than System.Random.
ExtremeOptimization offers a library with several generators. They also discuss quality and speed of the generators and compare against System.Random.
Finally, what do you mean by GC pressure? Do you really mean memory pressure, which is the only context I've seen it used in? The job of the GC is to handle the creation and destruction of gobs of objects very efficiently. I'm concerned that you're falling for the premature optimization temptation. Perhaps you can create a test app that gives some cold, hard numbers.

Categories

Resources