GetHashCode and Buckets

GetHashCode and Buckets - c#

I am trying to get a better understanding how the internas of hashed sets, e.g. HashSet<T> do work and why they are performant. I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/.
As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. One bucket is represented by the hashcode, namely by GetHashCode which is called on the element. I thought the better performance is based on the fact that there are less buckets than elements.
Now I have written following naive test-code:
public class CustomHashCode
{
public int Id { get; set; }
public override int GetHashCode()
{
//return Id.GetHashCode(); // Way better performance
return Id % 40; // Bad performance! But why?
}
public override bool Equals(object obj)
{
return ((CustomHashCode) obj).Id == Id;
}
}
And here the profiler:
public static void TestNoCustomHashCode(int iterations)
{
var hashSet = new HashSet<NoCustomHashCode>();
for (int j = 0; j < iterations; j++)
{
hashSet.Add(new NoCustomHashCode() { Id = j });
}
var chc = hashSet.First();
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < iterations; j++)
{
hashSet.Contains(chc);
}
stopwatch.Stop();
Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
}
My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. But it is terrible (on my system it takes about 4 seconds with 50000 iterations). I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. But then again, how do the bucket lists work?

A Contains check basically:
Gets the hashcode of the item.
Finds the corresponding bucket - this is a direct array lookup based on the hashcode of the item.
If the bucket exists, tries to find the item in the bucket - this iterates over all the items in the bucket.
By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. Thus it takes longer to see if a given item exists.
You've probably decreased the memory footprint of the hashset; you may even have decreased the insertion time, although I doubt it. You haven't decreased the existence-check time.

Reducing the number of buckets will not increase the performance. Actually, the GetHashCode method of Int32 returns the integer value itself, which is ideal for the performance as it will produce as many buckets as possible.
The thing that gives a hash table performance, is the conversion from the key to the hash code, which means that it can quickly elliminate most of the items in the collection. The only items it has to consider is the ones in the same bucket. If you have few buckets, it means that it can elliminate a lot fewer items.
The worst possible implementation of GetHashCode will cause all items to go in the same bucket:
public override int GetHashCode() {
return 0;
}
This is still a valid implementation, but it means that the hash table gets the same performance as a regular list, i.e. it has to loop through all items in the collection to find a match.

A simple HashSet<T> could be implemented like this(just a sketch, doesn't compile)
class HashSet<T>
{
struct Element
{
int Hash;
int Next;
T item;
}
int[] buckets=new int[Capacity];
Element[] data=new Element[Capacity];
bool Contains(T item)
{
int hash=item.GetHashCode();
// Bucket lookup is a simple array lookup => cheap
int index=buckets[(uint)hash%Capacity];
// Search for the actual item is linear in the number of items in the bucket
while(index>=0)
{
if((data[index].Hash==hash) && Equals(data[index].Item, item))
return true;
index=data[index].Next;
}
return false;
}
}
If you look at this, the cost of searching in Contains is proportional to the number of items in the bucket. So having more buckets makes the search cheaper, but once the number of buckets exceeds the number of items, the gain of additional buckets quickly diminishes.
Having diverse hashcodes also serves as early out for comparing objects within a bucket, avoiding potentially costly Equals calls.
In short GetHashCode should be as diverse as possible. It's the job of HashSet<T> to reduce that large space to an appropriate number of buckets, which is approximately the number of items in the collection (Typically within a factor of two).

Related

Why my implementation of linked list quick sort is much slower than array one?

I been algorithm problem that requires me to do implementation of quick sort algorithm for linked list and array.
I have done both parts , algorithms are working, but it seems there is some bug in my quick-sort linked list implementation.
Here is my Quick sort linked list implementation.
public static void SortLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
if( low != null && low !=high)
{
DataList.Node p = _PartitionLinkedList(items, low, high);
SortLinkedList(items, low, p);
SortLinkedList(items, p.Next(), null);
}
}
private static DataList.Node _PartitionLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
DataList.Node pivot = low;
DataList.Node i = low;
for (DataList.Node j = i.Next(); j != high; j=j.Next())
{
if (j.Value().CompareTo(pivot.Value()) <= 0)
{
items.Swap(i.Next(),j);
i = i.Next();
}
}
items.Swap(pivot, i);
return i;
}
Here is Quick Sort array implementation
public static void SortData(DataArray items, int low, int high)
{
if (low < high)
{
int pi = _PartitionData(items, low, high);
SortData(items, low, pi - 1);
SortData(items, pi + 1, high);
}
}
static int _PartitionData(DataArray arr, int low, int high)
{
double pivot = arr[high];
int i = (low - 1);
for (int j = low; j <= high - 1; j++)
{
if (arr[j].CompareTo(pivot)<=0)
{
i++;
arr.Swap(i,j);
}
}
arr.Swap(i + 1, high);
return i + 1;
}
Here is Quick sort array and linked list performance. (left n, right time)
Picture
As you can see qs linked list took 10 min to sort 6400 elements. I dont think that its normal..
Also I dont think that its because of the data structure, because I was using same structure for selection sort and performance for both linked list and array were similar.
GitHub repo in case i forgot to provide some code. Repo

10 minutes is a very long time for 6400 elements. It would normally require 2 or 3 horrible mistakes, not just one.
Unfortunately, I only see one horrible mistake: Your second recursive call to SortLinkedList(items, p.Next(), null); goes all the way to the end of the list. You meant for it to stop at high.
That might account for the 10 minutes, but it seems a little unlikely.
It also looks to me like your sort is incorrect, even after you fix the above bug -- be sure to test the output!

I would look at your linked list, particularly the swap method. Unless we see the implementation of the linked list, I think the problem area is there.
Is there a reason why you're using linked lists? They have o(n) search which makes quicksort n^2lg(n) sort.
A different way to do it is to add all the items in your linked lists to a list, sort that list, and recreate your linkedlist. List.Sort() uses quick sort.
public static void SortLinkedList(DataList items)
{
list<object> actualList = new list<object>();
for (DataList.Node j = i.Next(); j != null; j=j.Next())
{
list.add(j.Value());
}
actualList.Sort();
items.Clear();
for (int i = 0; i < actualList.Count;i++)
{
items.Add(actualList[i]);
}
}

Quick sort for linked list is normally slightly different than quick sort for arrays. Use the first node's data value as the pivot value. Then the code creates 3 lists, one for values < pivot, one for values == pivot, one for values > pivot. It then does a recursive calls for the < pivot and > pivot lists. When the recursive call returns those 3 lists are now sorted, so the code only needs to concatenate the 3 lists.
To speed up concatenation of lists, keep track of a pointer to the last node. To simplify this, use circular lists, and use a pointer to the last node as the main way to access a list. This makes appending (joining) list simpler (no scanning). Once inside a function, use last->next in order to get a pointer to the first node of a list.
Two of the worst case data patterns are already sorted data or already reverse sorted data. If the circular list with pointer to last node method is used, then the average of last and first nodes could be used as a median of 2 which could help (note the list for nodes == pivot could end up empty).
Worst case time complexity is O(n^2). Worst case stack usage is O(n). The stack usage could be reduced by using recursion on the smaller of the list < pivot and list > pivot. After return, the now sorted smaller list would be concatenated with the list == pivot and saved in a 4th list. Then the sort process would iterate on the remaining unsorted list, then merging (or perhaps joining) with the saved list.
Sorting a linked list, using any method, including bottom up merge sort , will be slower than moving the linked list to an array, sorting the array, then creating a linked list from the sorted array. However the quick sort method I describe will be much faster than using an array oriented algorithm with a linked list.

Take & remove elements from collection

What's the most performant way to remove n elements from a collection and add those removed n elements to an already existing, different, collection?
Currently I've got this:
var entries = collection.Take(5).ToList();
foreach(var entry in entries)
collection.Remove(entry);
otherCollection.AddRange(entries);
However, this doesn't look performant at all to me (multiple linear algorithms instead of only one).
A possible solution may of course change the collection implementation - as long as the following requirements are met:
otherCollection must implement IEnumerable<T>, it is currently of type List<T>
collection must implement ICollection<T>, it is currently of type LinkedList<T>
Hint: entries do not necessarily implement Equals() or GetHashCode().
What's the most performant way to reach my goal?
As it has been obviously too hard to understand my performance considerations, here once more my code example:
var entries = collection.Take(1000).ToList(); // 1000 steps
foreach(var entry in entries) // 1000 * 1 steps (as Remove finds the element always immediately at the beginning)
collection.Remove(entry);
otherCollection.AddRange(entries); // another 1000 steps
= 3000 steps in total => I want to reduce it to a single 1000 steps.

The previous function only returns half results. You should use:
public static IEnumerable<T> TakeAndRemove<T>(Queue<T> queue, int count)
{
for (int i = 0; i < count && queue.Count > 0; i++)
yield return queue.Dequeue();
}

With your use case the best data structure seems to be a queue. When using a queue your method can look this this:
public static IEnumerable<T> TakeAndRemove<T>(Queue<T> queue, int count)
{
count = Math.Min(queue.Count, count);
for (int i = 0; i < count; i++)
yield return queue.Dequeue();
}

Simple Dictionary Lookup is Slow in .Net Compared to Flat Array

I found that dictionary lookup could be very slow if compared to flat array access. Any idea why? I'm using Ants Profiler for performance testing. Here's a sample function that reproduces the problem:
private static void NodeDisplace()
{
var nodeDisplacement = new Dictionary<double, double[]>();
var times = new List<double>();
for (int i = 0; i < 6000; i++)
{
times.Add(i * 0.02);
}
foreach (var time in times)
{
nodeDisplacement.Add(time, new double[6]);
}
var five = 5;
var six = 6;
int modes = 10;
var arrayList = new double[times.Count*6];
for (int i = 0; i < modes; i++)
{
int k=0;
foreach (var time in times)
{
for (int j = 0; j < 6; j++)
{
var simpelCompute = five * six; // 0.027 sec
nodeDisplacement[time][j] = simpelCompute; //0.403 sec
arrayList[6*k+j] = simpelCompute; //0.0278 sec
}
k++;
}
}
}
Notice the relative magnitude between flat array access and dictionary access? Flat array is about 20 times faster than dictionary access ( 0.403/0.0278), after taking into account of the array index manipulation ( 6*k+j).
As weird as it sounds, but dictionary lookup is taking a major portion of my time, and I have to optimize it.

Yes, I'm not surprised. The point of dictionaries is that they're used to look up arbitrary keys. Consider what has to happen for a single array dereference:
Check bounds
Multiply index by element size
Add index to pointer
Very, very fast. Now for a dictionary lookup (very rough; depends on implementation):
Potentially check key for nullity
Take hash code of key
Find the right slot for that hash code (probably a "mod prime" operation)
Probably dereference an array element to find the information for that slot
Compare hash codes
If the hash codes match, compare for equality (and potentially go on to the next hash code match)
If you've got "keys" which can very easily be used as array indexes instead (e.g. contiguous integers, or something which can easily be mapped to contiguous integers) then that will be very, very fast. That's not the primary use case for hash tables. They're good for situations which can't easily be mapped that way - for example looking up by string, or by arbitrary double value (rather than doubles which are evenly spaced, and can thus be mapped to integers easily).
I would say that your title is misleading - it's not that dictionary lookup is slow, it's that when arrays are a more suitable approach, they're ludicrously fast.

In addition the Jon's answer I would like to add that your inner loop does not do very much, normally you do a least some more work in the inner loop and then the relative performance loss of the dictionary is somewhat lower.
If you look at the code for Double.GetHashCode() in Reflector you'll find that it is executing 4 lines of code (assuming your double is not 0), just that is more than the body of your inner loop. Dictionary<TKey, TValue>.Insert() (called by the set indexer) is even more code, almost a screen full.
The thing with Dictionary compared to a flat array is that you don't waste to much memory when your keys are not dense (as they are in your case) and that read and write are ~O(1) like arrays (but with a higher constant).
As a side note you can use a multi dimensional array instead of the 6*k+j trick.
Declare it this way
var arrayList = new double[times.Count, 6];
and use it this way
arrayList[k ,j] = simpelCompute;
It won't be faster, but it is easier to read.

HashTable or Dictionary lookup time

Is the Lookup Time for a HashTable or Dictionary Always O(1) as long as it has a Unique Hash Code?
If a HashTable has 100 Million Rows would it take the same amount of time to look up as something that has 1 Row?

No. It is technically possible but it would be extremely rare to get the exact same amount of overhead. A hash table is organized into buckets. Dictionary<> (and Hashtable) calculate a bucket number for the object with an expression like this:
int bucket = key.GetHashCode() % totalNumberOfBuckets;
So two objects with a different hash code can end of in the same bucket. A bucket is a List<>, the indexer next searches that list for the key which is O(n) where n is the number of items in the bucket.
Dictionary<> dynamically increases the value of totalNumberOfBuckets to keep the bucket search efficient. When you pump a hundred million items in the dictionary, there will be thousands of buckets. The odds that the bucket is empty when you add an item will be quite small. But if it is by chance then, yes, it will take just as long to retrieve the item.
The amount of overhead increases very slowly as the number of items grows. This is called amortized O(1).

Might be helpful : .NET HashTable Vs Dictionary - Can the Dictionary be as fast?

As long as there are no collisions with the hashes, yes.

var dict = new Dictionary<string, string>();
for (int i = 0; i < 100; i++) {
dict.Add("" + i, "" + i);
}
long start = DateTime.Now.Ticks;
string s = dict["10"];
Console.WriteLine(DateTime.Now.Ticks - start);
for (int i = 100; i < 100000; i++) {
dict.Add("" + i, "" + i);
}
start = DateTime.Now.Ticks;
s = dict["10000"];
Console.WriteLine(DateTime.Now.Ticks - start);
This prints 0 on both cases. So it seems the answer would be Yes.
[Got moded down so I'll explain better]
It seems that it is constant. But it depends on the Hash function giving a different result in all keys. As there is no hash function that can do that it all boils down to the Data that you feed to the Dictionary. So you will have to test with your data to see if it is constant.

Ensure uniform (ish) distribution with random number generation

I have a list of objects and I would like to access the objects in a random order continuously.
I was wondering if there was a way of ensuring that the random value were not always similar.
Example.
My list is a list of Queues, and I am trying to interleave the values to produce a real-world scenario for testing.
I don't particularly want all of the items in Queues 1 and 2 before any other item.
Is there a guaruanteed way to do this?
Thanks
EDIT ::
The List of Queues I have is a basically a list of files that i am transmitting to a webservice. Files need to be in a certain order hence the Queues.
So I have
Queue1 = "set1_1.xml", set1_2.xml", ... "set1_n.xml"
Queue2 ...
...
QueueN
While each file needs to be transmitted in order in terms of the other files in its queue, I would like to simulate a real world simulation where files would be received from different sources at different times and so have them interleaved.
At the moment I am just using a simple rand on 0 to (number of Queues) to determine which file to dequeue next. This works but I was asking if there might have been away to get some more uniformity rather than having 50 files from Queue 1 and 2 and then 5 files from Queue 3.
I do realise though that altering the randomness no longer makes it random.
Thank you for all your answers.

Well, it isn't entire clear what the scenario is, but the thing with random is you never can tell ;-p. Anything you try to do to "guarantee" thins will probably reduce the randomness.
How are you doing it? Personally I'd do something like:
static IEnumerable<T> GetItems<T>(IEnumerable<Queue<T>> queues)
{
int remaining = queues.Sum(q => q.Count);
Random rand = new Random();
while (remaining > 0)
{
int index = rand.Next(remaining);
foreach (Queue<T> q in queues)
{
if (index < q.Count)
{
yield return q.Dequeue();
remaining--;
break;
}
else
{
index -= q.Count;
}
}
}
}
This should be fairly uniform over the entire set. The trick here is that by treating the queues as a single large queue, the tendency is that the queue's with lots of items will get dequeued more quickly (since there is more chance of getting an index in their range). This means that it should automatically balance consumption between the queues so that they all run dry at (roughly) the same time. If you don't have LINQ, just change the first line:
int remaining = 0;
foreach(Queue<T> q in queues) {remaining += q.Count;}
Example usage:
static void Main()
{
List<Queue<int>> queues = new List<Queue<int>> {
Build(1,2,3,4,5), Build(6,7,8), Build(9,10,11,12,13)
};
foreach (int i in GetItems(queues))
{
Console.WriteLine(i);
}
}
static Queue<T> Build<T>(params T[] items)
{
Queue<T> queue = new Queue<T>();
foreach (T item in items)
{
queue.Enqueue(item);
}
return queue;
}

It depends on what you really want...
If the "random" values are truly random then you will get uniform distribution with enough iterations.
If you're talking about controlling or manipulating the distribution then the values will no longer be truly random!
So, you can either have:
Truly random values with uniform distribution, or
Controlled distribution, but no longer truly random

Are you trying to shuffle your list?
If so you can do it by sorting it on a random value.
Try something like this:
private Random random = new Random();
public int RandomSort(Queue q1, Queue q2)
{
if (q1 == q2) { return 0; }
return random.Next().CompareTo(random.Next());
}
And then use the RandomSort as the argument in a call to List.Sort();

If the items to be queued have a GetHashCode() algorithm which distributes values evenly across all integers, you can take the modulus operator on the hash value to specify which queue to add the item to. This is basically the same principle which hash tables use to assure an even distribution of values.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.