This question already has an answer here:
What is performance of ContainsKey and TryGetValue?
(1 answer)
Closed 9 years ago.
As I have read on wikipedia that hash tables have on average O(1) search time.
So lets say I have a very large dictionary that contains maybe tens of millions of records.
If I use Dicionary.ContainsKey to extract the value against a given key will it's lookup time be really 1 or would it be like log n or something else due to some different internal implementation by .NET.
Big Oh notation doesn't tell you how long something takes. It tells you how it scales.
Easiest one to envision is searching for an item in a List<>, it has O(n) complexity. If it takes, on average, 2 milliseconds to find an item in a list with a million elements then you can expect it to take 4 milliseconds if the list has two million elements. It scales linearly with the size of the list.
O(1) predicts constant time for finding an element in a dictionary. In other words, it doesn't depend on the size of the dictionary. If the dictionary is twice as big, it doesn't take twice as long to find the element, it takes (roughly) as much time. The "roughly" means that it actually does take a bit longer, it is amortized O(1).
It would still be close to O(1), because it would still not depend on the number of the entries, but on the numbers of the collisions you have. Indexing an array is still O(1), no matter how many items you have.
Also, there seems to be a top limit on size of Dictionary caused by the implementation: How is the c#/.net 3.5 dictionary implemented?
Once we pass this size, the next step falls outside the internal array, and it will manually search for larger primes. This will be quite slow. You could initialize with 7199369 (the largest value in the array), or consider if having more than about 5 million entries in a Dictionary might mean that you should reconsider your design.
What is the key? If the key is Int32 then yes it will be close to order 1.
You only get less than order 1 if there are hash collisions.
Int32 as a key will have zero hash collisions but that does not guarantee zero hash bucket collisions.
Be careful of keys that produce hash collisions.
KVP and tuple can create a lot of hash collisions and are not good candidates for key.
Related
The other day I was reading that article on CodeProject
And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):
Note: If will add more items than the maximum number in the table
(i.e 7199369), the resize method will manually search the next prime
number that is larger than twice the old size.
Note: The reason that the sizes are being doubled while resizing the
array is to make the inner-hash table operations to have asymptotic
complexity. The prime numbers are being used to support
double-hashing.
So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:
Open Addressing
Separate Chaining
Double Hashing
But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).
First, If you double the size with or without using a prime, is it not really the same?
Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.
I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.
I got my answer on Reddit, so I am gonna try to summarize here:
Collision Resolution Technique
First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:
The code goes as follows:
private struct Entry
{
public int hashCode; // Lower 31 bits of hash code, -1 if unused
public int next; // Index of next entry, -1 if last
public TKey key; // Key of entry
public TValue value; // Value of entry
}
It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.
Prime Number
About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this
be achieved? By choosing m to be a number that has very few factors: a
prime number.
Doubling the underlying entries array size
Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.
See that answer: https://stackoverflow.com/a/2369504/4636721
Hash-tables could not claim "amortized constant time insertion" if,
for instance, the resizing was by a constant increment. In that case
the cost of resizing (which grows with the size of the hash-table)
would make the cost of one insertion linear in the total number of
elements to insert. Because resizing becomes more and more expensive
with the size of the table, it has to happen "less and less often" to
keep the amortized cost of insertion constant.
Wiki says (as I understand it) that every time when I'm adding an item to the dictionary, the system calculates hash code (by calling GetHashCode). The system then using the hash code to find a bucket where my value will be saved.
Explain me please the logic of finding relation between hash code and index in the bucket array where my value will be stored by a Dictionary.
Imagine situation when I creating a Dictiornary and adding an object to it who's GetHashCode returns value 1000000.
Does it mean that inside Dictionary will create an Array with 1000000 elements and store my object at index 999999999 ?
If that's assumption is correct, what's the point of having array of such a big size to store only one value.
Your assumption isn't correct, luckily. If it were, they wouldn't actually be buckets, just an index-accessible array of objects. That might be fine for O(1) lookup if your hashcodes are guaranteed to be unique, but that's not the case - in fact, hashcodes are guaranteed not to be unique. You can't map every possible value of an Int64 into a unique Int32 hashcode. That's not what hashcodes are for.
Instead, the dictionary initializes a smaller array of buckets, and then uses a simple Modulo operation to find the bucket. (From the .NET Reference Source)
int targetBucket = hashCode % buckets.Length;
That means that if there are 10 buckets, for instance, it will get the remainder of dividing the hashcode by 10. If your hash algorithm does its job well, it means that the hashcodes follow a standard distribution, meaning that any n items, for a big enough n, will probably be evenly divided between the buckets.
As for how many buckets are initialized, the number will be the first prime number that's higher than the capacity passed in the ctor, or 0 (See here). If this causes too many hash collisions, it will be automatically expanded, jumping to the next prime number each time, until stable.
I have some code that I added a nested dictionary to, of the following format
Dictionary<string, Dictionary<string, Dictionary<string, float>>>
After doing so I noticed the memory usage of my application shot up SIGNIFICANTLY. These dictionaries are keyed on strings that are often repeated, and there are many of these dictionaries, on the order of 10's of thousands.
In order to address this problem I hypothesized that the repeated strings were eating up a significant amount of memory. My solution was to hash the strings and use an integer instead (I would keep one copy of a rainbow table so I could reverse the hash when necessary)
Dictionary<int, Dictionary<int, Dictionary<int, float>>>
So I went to a memory profiler to see what kind of size reduction I could get. To my shock I actually found that the string storage was actually smaller in size (both normal and inclusive).
This doesn't make intuitive sense to me. Even if the compiler was smart enough to only store one copy of the string and use a reference, I would think that reference would be a pointer which is double the size of an int. I also didn't use any String.Intern methods so I don't know how this would have been accomplished (also is String.Intern the right method here?)
I'm very confused as to what's happening under the hood, any help would be appreciated
If your keys and values are objects, there's approximately 20 bytes of overhead for each element of a dictionary, plus several more bytes per dictionary. This is in addition to the space consumed by the keys and values themselves. if you have value types as keys and values, then it's 12 bytes plus the space consumed by the key and value for each item in the dictionary. This is if the number of elements equals the internal dictionary capacity. But typically there is more capacity than elements, so there is wasted space.
The wasted space will generally be a higher relative percentage if you have lots of dictionaries with a small number of elements than if you had one dictionary with many elements. If I go by your comment, your dictionaries with 8 elements will have a capacity of 11, those with 2 elements will have a capacity of 3, and those with 10 will have a capacity of 11.
If I understand your nesting counts, then a single top level dictionary will represent 184 dictionary elements. But if we count unused capacity, it's closer to 200 as far as space consumption. 200 * 20 = 4000 bytes for each top level dictionary. How many of those do you have? You say 10's of thousands of them in thousand of objects. Every 10,000 is going to consume about 38 MB of dictionary overhead. Add to that the objects stored in the dictionary.
A possible explanation of why your attempt to make it smaller by managing the hash codes would be if there are not a lot of duplicated references to your keys. Replacing an object reference key with an int key doesn't change the dictionary overhead amount, and you're adding the storage of your new collection of hash codes.
I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:
Add a new value
Iterate over all of the values.
There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.
I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.
EDIT: To answer your questions:
No, the items don't need to be sorted
By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
There could be a decent number of these sets in memory at once (~100).
Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.
As for wasting space: you waste 6MB in the worst case.
EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)
For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.
You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:
Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
A custom hash table, perhaps Google sparsehash through C++/CLI
BSTs storing ranges/intervals
Supernode BSTs
It would depend on the distribution of the sizes of your sets. Unless you expect most of the sets to be (close to) the minimum you've specified, I'd probably use a bitset. To cover a range up to 50,000,000, a bitset ends up ~6 megabytes.
Compared to storing the numbers directly, this is marginally larger for the minimum size set you've specified (~6 megabytes instead of ~4), but considerably smaller for the maximum size set (1/32nd the size).
The second possibility would be to use a delta encoding. For example, instead of storing each number directly, store the difference between that number and the previous number that was included. Given a maximum magnitude of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~50. This means you can theoretically store the difference in <6 bits on average. I'd probably use the 7 least significant bits directly, and if you need to encode a larger gap, set the msb and (for example) store the size of the gap in the lower 7 bits plus the next three bytes. That can't happen very often, so in most cases you're using only one byte per number, for about 4:1 compression compared to storing numbers directly. In the best case this would use ~1 megabyte for a set, and in the worst about 50 megabytes -- 4:1 compression compared to storing numbers directly.
If you don't mind a little bit of extra code, you could use an adaptive scheme -- delta encoding for small sets (up to 6,000,000 numbers), and a bitmap for larger sets.
I think the answer depends on what you mean by "passing around" and what you're trying to accomplish. You say you are only adding to the list: how often do you add? How fast will the list grow? What is an acceptable overhead for memory use, versus the time to reallocate memory?
In your worst case, 50,000,000 32-bit numbers = 200 megabytes using the most efficient possible data storage mechanism. Assuming you may end up with this much use in your worst case scenario, is it OK to use this much memory all the time? Is that better than having to reallocate memory frequently? What's the distribution of typical usage patterns? You could always just use an int[] that's pre-allocated to the whole 50 million.
As far as access speed for your operations, nothing is faster than iterating and adding to a pre-allocated chunk of memory.
From OP edit: There could be a decent number of these sets in memory at once (~100).
Hey now. You need to store 100 sets of 1 to 50 million numbers in memory at once? I think the bitset method is the only possible way this could work.
That would be 600 megabytes. Not insignificant, but unless they are (typically) mostly empty, it seems very unlikely that you would find a more efficient storage mechanism.
Now, if you don't use bitsets, but rather use dynamically sized constructs, and they could somehow take up less space to begin with, you're talking about a real ugly memory allocation/deallocation/garbage collection scenario.
Let's assume you really need to do this, though I can only imagine why. So your server's got a ton of memory, just allocate as many of these 6 megabyte bitsets as you need and recycle them. Allocation and garbage collection are no longer a problem. Yeah, you're using a ton of memory, but that seems inevitable.
I have a bit pattern of 100 bits. The program will change the bits in the pattern as true or false. At any given time I need to find the positions of first "n" true values. For example,if the patten is as follows
10011001000
The first 3 indexes where bits are true are 0, 3, 4
The first 4 indexes where bits are true are 0, 3, 4, 7
I can have a List, but the complexity of firstntrue(int) will be O(n). Is there anyway to improve the performance?
I'm assuming the list isn't changing while you are searching, but that it changes up until you decide to search, and then you do your thing.
For each byte there are 2^8 = 256 combinations of 0 and 1. Here you have 100/8 = 13 bytes to examine.
So you can build a lookup table of 256 entries. The key is the current real value of the byte you're examining in the bit stream, and the value is the data you seek (a tuple containing the positions of the 1 bits). So, if you gave it 5 it would return {0,2}. The cost of this lookup is constant and the memory usage is very small.
Now as you go through the bit stream you can proces the data a byte at a time (instead of a bit at a time) and just keep track of the current byte number (starting at 0, of course) and add 8*current-byte-number to the values in the tuple returned. So now you've essentially reduced the problem to O(n/8) by using the precomputed lookup table.
You can build a larger look-up table to get more speed but that will use more memory.
Though I can't imagine that an O(n) algorithm where n=100 is really the source of some performance issue for you. Unless you're calling it a lot inside some inner loop?
No there is no way to improve O(n). That can be proven mathematically
No.
Well, not unless you intercept the changes as they occur, and maintain a "first 100" list.
The complexity cannot be reduced without additional data structures, because in the worst case you need to scan the whole list.
For "n" Items you have to check at most "n" times I.E O(n)!
How can you expect to reduce that without any interception and any knowledge of how they've changed?!
No, you can not improve the complexity if you just have a plain array.
If you have few 1:s to many 0:s you can improve the performance by a constant factor, but it will still be O(n).
If you can treat you bit array as an byte array (or even int32 array) you can check each byte if the byte > 0 before checking each individual bit.
If you have less 1-bits than 1:8 you could implement it as a sparse array instead List<byte> where you store the index of all 1:s.
As others have said, to find the n lowest set bits in the absence of further structures is an O(n) operation.
If you're looking to improve performance, have you looked at the implementation side of the problem?
Off the top of my head, q & ~(q-1) will leave only the lowest bit set in the number q, since subtracting 1 from any binary number fills in 1s to the right up to the first digit that was set, changes that digit into a 0 and leaves the rest alone. In a number with 1 bit set, shifting to the right and testing against zero gives a simple test to distinguish whether a potential answer is less than the real answer or is greater than or equal. So you can binary search from there.
To find the next one, remove the lowest digit and use a smaller initial binary search window. There are probably better solutions, but this should be better than going through every bit from least to most and checking if it's set.
That implementation stuff that doesn't affect the complexity, but may help with performance.