Good resource for explaining how hash codes are used in collections - c#

Can anyone give a good explanation and / or links to a good resource of how hash codes are used in storing and retrieving objects in hashtables, dictionaries etc, specifically in C# / .NET.
I'm interested to see how Equals and GetHashCode are used collectively when storing and retrieving items.

It depends on the collection, but for a dictionary the hash code is used to determine which bucket the object is added to, and Equals is used to find the item within the bucket, amongst other items which may have the same hash.

This is a pretty good demo: http://research.cs.vt.edu/AVresearch/hashing/buckethash.php

try object.GetHashCode.
"A hash code is a numeric value that is used to identify an object during equality testing. It can also serve as an index for an object in a collection.
The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table."

Related

Are hash codes of System.Type objects of types from the same assembly guaranteed to be unique?

Clarifying edit: The keys in the dictionary are actual instances of System.Type. More specifically every value is stored with its type as the key.
In a specific part of my program the usage of Dictionary<System.Type, SomeThing> takes a large chunk of CPU time, as per Visual Studio 2017 performance profiler.
A change in the type of the dictionary to Dictionary<int, SomeThing> and instead of passing the type object directly I pass the type.GetHashCode() seems to be about 20%-25% faster.
The above optimization will result in a nasty bug if two types have the same hash code, but it seems plausible to me that types can have unique hash codes, at least when it comes to types from the same assembly - which all the types used in this dictionary are.
Possibly relevant information - As per this answer the number of possible types in an assembly is far smaller than the number of values represented by System.Int32.
No. The documentation on object.GetHashCode() make no guarantees, and states:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
...
Do not use the hash code as the key to retrieve an object from a keyed collection.
Because equal hash codes is necessary, but not sufficient, for two objects to be equal.
If you're wondering if Type.GetHashCode() follows a more restrictive definition, its documentation makes no mention of such a change, so it still does not guarantee uniqueness. The reference source does not show any attempt to make this guarantee, either.
A hash-code is never garantueed to be unique for different values, so you should not use it like you are doing.
The same value should however generate the same hashcode.
This is also stated in MSDN:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.
and somewhat further:
Do not use the hash code as the key to retrieve an object from a keyed collection.
Therefore, I would also not rely for GetHashCode for different types to be unique, but at least, you can verify it:
Dictionary<int, string> s = new Dictionary<int, string>();
var types = typeof(int).Assembly.GetTypes();
Console.WriteLine($"Inspecting {types.Length} types...");
foreach (var t in typeof(-put a type from that assembly here-).Assembly.GetTypes())
{
if (s.ContainsKey(t.GetHashCode()))
{
Console.WriteLine($"{t.Name} has the same hashcode as {s[t.GetHashCode()]}");
}
else
{
s.Add(t.GetHashCode(), t.Name);
}
}
Console.WriteLine("done!");
But even if the above test would conclude that there are no collisions, I wouldn't do it, since the implementation of GetHashCode can change over time, which means that collisions in the future might be possible.
A hashcode isn´t ment do be unique. Instead it is used in hash-based collections such as Dictionary in order to limit the number of possible ambiguities. A hashc-ode is nothing but an index, so instead of searching the entire collection for a match only a few items that share a common value - the hash-code - have to be considered.
In fact you could even have a hash-implementation that allways returns the same number for every item. However that´ll leads to O(n) to look for a key in your dictionary, as every key has to be compared.
Anyway you shouldn´t strive for micro-optimizations that get you some nan-seconds in exchange for maintainability and understandability. You should instead use some data-structure that gets the job done and is easy to understand.

C# Hashcode Return value

For a given string "5", if I use the built in GetHashCode() function what is the value that is returned? Am I confused in that it returns the integer value of 5?
It's implementation specific, and you should not rely on anything you happen to observe in one particular implementation. Only rely on what is guaranteed: two equal strings will return the same value, in the same process. The same string value can return a different hash next time you run your program, or on a different machine. This means you should never persist the result of GetHashCode - it's not useful for future comparisons.
If two strings return the same hash code they may be equal - but they may not be.
For string.GetHash() MSDN Docs writes:
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.
The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across versions of the .NET Framework and across platforms (such as 32-bit and 64-bit) for a single version of the .NET Framework. In some cases, they can even differ by application domain.
As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.
Finally, do not use the hash code instead of a value returned by a cryptographic hashing function if you need a cryptographically strong hash. For cryptographic hashes, use a class derived from the System.Security.Cryptography.HashAlgorithm or System.Security.Cryptography.KeyedHashAlgorithm class.
So its kind of "quick-compare-check" feature regarding strings. But you should not relay on the hash-only. Its important to know that these hash-codes are not stable, meaning you must never store them in files, databases etc. - don't persist them.
In general GetHash() is specific to the class implementation as Jon wrote. If we look at the MSDN Docs for object.GetHash() we see that they serve as a index for hash-based collection so the collections index-tree is balanced. See this article for more information on hasing-algorithm.
So if you query one and the same object using GetHash() it should return the same hash-code. That code may be different if your application runs the next time.

Write-once read-many string-to-object map

I'm looking for a data structure that can possibly outperform Dictionary<string, object>. I have a map that has N items - the map is constructed once and then read many, many times. The map doesn't change during the lifetime of the program (no new items are added, no items are deleted and items are not reordered). Because the map doesn't change, it doesn't need to be thread-safe, even though the application using it is heavily multi-threaded. I expect that ~50% of lookups will happen for items not in the map.
Dictionary<TKey, TItem> is quite fast and I may end up using it but I wonder if there's another data structure that's faster for this scenario. While the rest of the program is obviously more expensive than this map, it is used in performance-critical parts and I'd like to speed it up as much as possible.
What you're looking for is a Perfect Hash Function. You can create one based on your list of strings, and then use it for the Dictionary.
The non-generic HashTable has a constructor that accepts IHashCodeProvider that lets you specify your own hash function. I couldn't find an equivalent for Dictionary, so you might have to resort to using a Hashtable instead.
You can use it internally in your PerfectStringHash class, which will do all the type casting for you.
Note that you may need to be able to specify the number of buckets in the hash. I think HashTable only lets you specify the load factor. You may find out you need to roll your own hash entirely. It's a good class for everyone to use, I guess, a generic perfect hash.
EDIT: Apparantly someone already implemented some Perfect Hash algorithms in C#.
The read performance of the generic dictionary is "close to O(1)" according to the remarks on MSDN for most TKey (and you should get pretty good performance with just string keys). And you get this out of the box, free, from the framework, without implementing your own collection.
http://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.90).aspx
If you need to stick with string keys - Dictionary is at least very good (if not best choice).
One more thing to note when you start measuring - consider if computation of hash itself has measurable impact. Searching for long strings should take longer to compute hash. See if items you want to search for can be represented as other objects with constant get hash time.

Referencing an object using its hashcode?

I have created an object, say details.
I then assign: int x = details.GetHashCode();
Later in the program, I would like to access this object using the integer x. Is there a way to do this in C#?
Many thanks
Paul
No:
It may have been garbage collected, unless you've got something in place to stop that.
Hash codes aren't unique - what if there are two objects with the same hash code? (See Eric Lippert's post about hash codes for more information.)
You could create (say) a Dictionary<int, Details> and use the hash code as the key - but I'd strongly recommend that you didn't do that.
Any reason you don't want to just keep a reference to the object instead of the hash code?
No.
A Hashcode represents some hashing function's value on the object.
You can't recreate the original object's refrence from this, and more importantly, there no guarantee that the object still exists.
If there is a way it would go against object orientation. You should expose the details reference to the consuming code.
Yes, you could create a Dictionary<int,Details> and store the object in this dictionary using the hashcode from details.GetHashCode( ) and then later on pull the object out of the dictionary using x.
But it's not something I would suggest doing! What is it you're trying to acheive?
Store the integer and just check the Hashcode again.
Note that in C# the Hashcode is not guarenteed unique. If you are dealing with a few million objects, you can and will run across duplicate hashes with the default implementation very easily.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx
"The default implementation of the GetHashCode method does not guarantee unique return values for different objects."
public static Detail GetDetailsFromHash(this List<Detail> detailsList, int x) {
foreach (var details in detailslist) {
if (details.GetHashCode() == x) {
return details;
}
}
return null;
}
However Hashcodes are not guaranteed to be unique
A hash code isn't really meant to be consumed directly. Its main purpose is so that the item can be used as a key in a collection. It is then the collections responsibility to map the hash back to the object. basically it lets you do something like Dictonary<details, myClass2> which would be much harder if GetHashCode wasn't implemented... but the function isn't a whole lot of use unless you are implementing your own collection or equality operator.

Adding keys to a list or collection - is there any value in hashing the key before adding it

I have stumbled across some code that is adding strings to a List but hashing the value before adding it.
It's using an MD5 hash (MD5CryptoServiceProvider) on the string value, then adding this to the list.
Is there any value in doing this in terms of speed to find the key in the list or is this just unnecessary?
I am not going to assume to know what the authors of the code you were viewing were doing with their list. But I will say that if you have a large list and performance of searching is critical, then there's a class for that. HashSet<T> will suit your needs nicely.
First of all, a list (List<T>) does not have “keys”. However, a Dictionary<TKey, TValue> does.
Secondly, to answer your performance question: no, there is actually a performance penalty in computing that hash. However, before you jump to the conclusion that it is unnecessary, examine the surrounding code and think about whether the author may have actually needed the MD5 hashsum and not the string itself?
Thirdly, if you need to look something up efficiently, you can use a HashSet<T> if you just need to check its existence, or Dictionary<TKey, TValue> if you need to associate the keys that you look up with a value.
If you place strings in a dictionary or hashset, C# will already generate a hash value from any string you put in. This generally uses a hash algorithm that is much faster than MD5.
I don't think it's necessary to do this for a List if the aim is to improve performance. Strings are strings and are looked up the same way whether they are hashed or not.

Categories

Resources