GetHashCode for Dictionary items - c#

I override the Equals method for one of my class. In the method, I check the equality of each pair of a dictionary with those of the dictionary of another instance, like the following does
public override bool Equals (object obj)
{
...
// compare to make sure all <key, value> pair of this.dict have
// the match in obj.dict
...
}
Now, I need to override the GetHashCode method as well as what is suggested.
Do I need to do that for all the keys of the dictionary, or keys plus values?
Basically, would the following be good or overkill?
public override int GetHashCode ()
{
int iHash = 0;
foreach (KeyValuePair<string, T> pair in this.dict)
{
iHash ^= pair.Key.GetHashCode();
iHash ^= pair.Value.GetHashCode();
}
return iHash;
}

Going along with what #Mitch Wheat linked to, that's not the best way to do a GetHashCode() if you use this class with a Dictionary or HashSet.
Imagine your internal Dictionary had only one entry. Your hash is now the value of that single KeyValuePair. You stick the whole class in a HashSet. You add another item to your internal Dictionary. Now the hashcode of your class has changed, because you're iterating over two items in your class.
When you call HashSet.Contains(obj), it calls obj.GetHashCode() which has now changed, even though its the same class instance. HashSet.Contains() will find that it doesn't contain this new hash and return false, never calling Equals (which will return true if the references are the same).
Suddenly its like your object has disappeared from the HashSet, even though the class is in there, with an outdated hash.
You really don't want your hash to change. It's okay to have collisions in your GetHashCode, because if it collides, it will call the (slower) .Equals() method. It's a handy optimization, that, if improperly implemented, can cause some headaches along the way.
As a side note, as pointed out in the link above, it's a good idea to multiply your hash by a prime number before ^ with another value. Helps with keeping the has unique.

Do you plan to use the object in a HashSet? You really only need to implement GetHashCode if the object is going to be used in such a way that it requires it be uniquely identifiable by its hash. It is good practice to always implement GetHashCode taking into account the same fields that are used in equality, but not always necessary.
If it is necessary in your case, I believe you have the right idea.

Related

How can the method Dictionary.Add be O(1) amortized?

I am trying to get a better understanding of how hashtables and dictionaries work in C# from a complexity perspective (but I guess the language is not an important factor, that's probably just a theoretical question).
I know that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
However, let's look at that code:
public class Foo {
public Foo() { }
public override int GetHashCode() {
return 5; //arbitrary value, purposely a constant
}
}
static void Main(string[] args) {
Dictionary<Foo, int> test = new Dictionary<Foo,int>();
Foo a = new Foo();
Foo b = new Foo();
test .Add(a, 5);
test .Add(b, 6); //1. no exception raised, even though GetHashCode() returns the same hash
test .Add(a, 10); //2. exception raised
}
I understand than behind the scenes there is a collision of hashes at 1. and there's probably a separate chaining to handle it.
However, at 2. the argument exception is raised. That means that internally the Dictionary keeps a track of each key inserted after having determined its hash. That also means that each time we add an entry to our dictionary, it checks if the key has not already been inserted using the equals method.
My question is, why is it considered to be O(1) complexity when it seems like it should be O(n) if it checks the already inserted keys?
But it doesn't have to check all of the keys. It only has to check keys that hash to the same value. And, as you say, a good hash code will minimize the number of hash collisions, so on average it doesn't have to make any key comparisions at all.
Remember, the rules for GetHashCode say that if a.HashCode <> b.HashCode, then a <> b. But if a.HashCode == b.GetHashCode, a might be equal to b.
Also, you say:
Iknow that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
That's not entirely true. That's the ideal, assuming a perfect hash function that will give a unique number for every key. But the perfect hash function doesn't exist, in the general case, so typically you'll see O(1) (or very close to it) performance until Count exceeds some fairly large percentage of the capacity: say 85% or 90%.
The answer is simple and difficult.
Simple part: It is because (you can check it by yourself)
a.Equals(b) == false
If you want exception when adding "b" just implement also Equals method.
No difficult part:
Default object implementation of Equals call RuntimeHelpers.Equals. Source of RuntimeHelpers is here. Unfortunately the method is extern:
[System.Security.SecuritySafeCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
public new static extern bool Equals(Object o1, Object o2);
What is excatly implementation of this method, I do not know, but I think it base on pointers (so adresses in memory).

Making HashSet<MyType> differ between objects with the same values

I have overridden Equal(Object comparee) method but when i add objects to my HashSet i still get doubles. What did i miss? The MyType type contains two int fields (let's say that). Is HashSet a wrong collection type, perhaps?
I wish to add some MyType thingies but so that the collection only stores the unique ones, where unique is defined by me (using Equals method or any other way).
You should always override GetHashCode() when you override Equals(). I typically return some sort of primary key, if available, for that method. Otherwise, you can check out this thread for ideas for implementing it.
The key to understanding the relationship between those two methods is:
If two entries have different hash codes, they are definitely not
equal.
If two entries have the same hash code, they might be equal, so call
Equals() to find out for sure.
You need to override GetHashCode() as well; otherwise, your objects will have different hashcodes and will therefore automatically be assumed to be different. Take some unique-ish value from your object and use that if available, or just generate your own.
And don't be lazy and use the same hash code for all of them, either; that will defeat the purpose of a HashSet.
So for your example with two int fields, you might do something like:
public override int GetHashCode() {
return field1 ^ field2;
}

implement GetHashCode() for objects that contain collections

Consider the following objects:
class Route
{
public int Origin { get; set; }
public int Destination { get; set; }
}
Route implements equality operators.
class Routing
{
public List<Route> Paths { get; set; }
}
I used the code below to implement GetHashCode method for the Routing object and it seems to work but I wonder if that's the right way to do it? I rely on equality checks and as I'm uncertain I thought I'll ask you guys. Can I just sum the hash codes or do I need to do more magic in order to guarantee the desired effect?
public override int GetHashCode() =>
{
return (Paths != null
? (Paths.Select(p => p.GetHashCode())
.Sum())
: 0);
}
I checked several GetHashCode() questions here as well as MSDN and Eric Lippert's article on this topic but couldn't find what I'm looking for.
I think your solution is fine. (Much later remark: LINQ's Sum method will act in checked context, so you can very easily get an OverflowException which means it is not so fine, after all.) But it is more usual to do XOR (addition without carry). So it could be something like
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths)
hc ^= p.GetHashCode();
return hc;
}
Addendum (after answer was accepted):
Remember that if you ever use this type Routing in a Dictionary<Routing, Whatever>, a HashSet<Routing> or another situation where a hash table is used, then your instance will be lost if someone alters (mutates) the Routing after it has been added to the collection.
If you're sure that will never happen, use my code above. Dictionary<,> and so on will still work if you make sure no-one alters the Routing that is referenced.
Another choice is to just write
public override int GetHashCode()
{
return 0;
}
if you believe the hash code will never be used. If every instace returns 0 for hash code, you will get very bad performance with hash tables, but your object will not be lost. A third option is to throw a NotSupportedException.
The code from Jeppe Stig Nielsen's answer works but it could lead to a lot of repeating hash code values. Let's say you are hashing a list of ints in the range of 0-100, then your hash code would be guarnteed to be between 0 and 255. This makes for a lot of collisions when used in a Dictionary. Here is an improved version:
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths) {
hc ^= p.GetHashCode();
hc = (hc << 7) | (hc >> (32 - 7)); //rotale hc to the left to swipe over all bits
}
return hc;
}
This code will at least involve all bits over time as more and more items are hashed in.
As a guideline, the hash of an object must be the same over the object's entire lifetime. I would leave the GetHashCode function alone, and not overwrite it. The hash code is only used if you want to put your objects in a hash table.
You should read Eric Lippert's great article about hash codes in .NET: Guidelines and rules for GetHashCode.
Quoted from that article:
Guideline: the integer returned by GetHashCode should never change
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
The GetHashCode function you implemented will not return the same hash code over the lifetime of the object. If you use this function, you will run into trouble if you add those objects to a hash table: the Contains method will not work.
I don't think it's a right way to do, cause to dtermine the final hashcode it has to be unique for specifyed object. In your case you do a Sum(), which can produce the same result with different hashcodes in collection (at the end hashcodes are just integers).
If your intention is to determine equality based on the content of the collection, at this point just compare these cillections between two objects. It could be time consuming operation, by the way.

Why don't I ever have to override GetHashCode when using Dictionaries on personal classes?

It always seems to just "work" without ever having to do anything.
The only thing I can think of is that each class has a hidden sort of static identifier that Object.GetHashCode uses. (also, does anyone know how Object.GetHashCode is implemented? I couldn't find it in the .NET Reflector)
I have never overridden GetHashCode but I was reading around and people say you only need to when overriding Equals and providing custom equality checking to your application so I guess I'm fine?
I'd still like to know how the magic works, though =P
It always seems to just "work" without ever having to do anything.
You didn't tell us if you're using value types or reference types for your keys.
If you're using value types, the default implementation of Equals and GetHashCode are okay (Equals checks if the fields are equals, and GetHashCode is based on the fields (not necessarily all of them!)). If you're using reference types, the default implementation of Equals and GetHashCode use reference equality, which may or may not be okay; it depends on what you're doing.
The only thing I can think of is that each class has a hidden sort of static identifier that Object.GetHashCode uses.
No. The default is a hash code based on the fields for a value type, and the reference for a reference type.
(also, does anyone know how Object.GetHashCode is implemented? I couldn't find it in the .NET Reflector)
It's an implementation detail that you should never ever need to know, and never ever rely on it. It could change on you at any moment.
I have never overridden GetHashCode but I was reading around and people say you only need to when overriding Equals and providing custom equality checking to your application so I guess I'm fine?
Well, is default equality okay for you? If not, override Equals and GetHashCode or implmenet IEqualityComparer<T> for your T.
I'd still like to know how the magic works, though =P
Every object has Equals and GetHashCode. The default implementations are as follows:
For value types, Equals is value equality.
For reference types, Equals is reference equality.
For value types, GetHashCode is based on the fields (again, not necessarily all of them!).
For reference types, GetHashCode is based on the reference.
If you use a overload of Dictionary constructor that doesn't take a IEqualityComparer<T> for your T, it will use EqualityComparer<T>.Default. This IEqualityComparer<T> just uses Equals and GetHashCode. So, if you haven't overridden them, you get the implementations as defined above. If you override Equals and GetHashCode then this is what EqualityComparer<T>.Default will use.
Otherwise, pass a custom implementation of IEqualityComparer<T> to the constructor for Dictionary.
Are you using your custom classes as keys or values? If you are using them only for values, then their GetHashCode doesn't matter.
If you are using them as keys, then the quality of the hash affects performance. The Dictionary stores a list of elements for each hash code, since the hash codes don't need to be unique. In the worst case scenario, if all of your keys end up having the same hash code, then the lookup time for the dictionary will like a list, O(n), instead of like a hash table, O(1).
The documentation for Object.GetHashCode is quite clear:
The default implementation of the GetHashCode method does not guarantee unique return values for different objects... Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.
Object's implementations of Equals() and GetHashCode() (which you're inheriting) compare by reference.
Object.GetHashCode is implemented in native code; you can see it in the SSCLI (Rotor).
Two different instances of a class will (usually) have different hashcodes, even if their properties are equal.
You only need to override them if you want to compare by value – if you want to different instances with the same properties to compare equal.
It really depends on your definition of Equality.
class Person
{
public string Name {get; set;}
}
void Test()
{
var joe1 = new Person {Name="Joe"};
var joe2 = new Person {Name="Joe"};
Assert.AreNotEqual(joe1, joe2);
}
If you have a different definition for equality, you should override Equals & GetHashCode to get the appropriate behavior.
Hash codes are for optimizing lookup performance in hash tables (dictionaries). While hash codes have a goal of colliding as little as possible between instances of objects they are not guaranteed to be unique. The goal should be equal distribution among the int range given a set of typical types of those objects.
The way hash tables work is each object implements a function to compute a hash code hopefully as distributed as possible amongst the int range. Two different objects can produce the same hash code but an instance of an object given it's data should always product the same hash code. Therefore, they are not unique and should not be used for equality. The hash table allocates an array of size n (much smaller than the int range) and when an object is added to the hash table, it calls GetHashCode and then it's mod'd (%) against the size of the array allocated. For collisions in the table, typically a list of objects is chained. Since computing hash codes should be very fast, a lookup is fast - jump to the array offset and walk the chain. The larger the array (more memory), the less collisions and the faster the lookup.
Objects GetHashCode cannot possibly produce a good hash code because by definition it knows nothing about the concrete object that's inheriting from it. That's why if you have custom objects that need to be placed in dictionaries and you want to optimize the lookups (control creating an even distribution with minimal collisions), you should override GetHashCode.
If you need to compare two items, then override equals. If you need the object to be sortable (which is needed for sorted lists) then override IComparable.
Hope that helps explain the difference.

When does Dictionary<TKey, TValue> call TKey.Equals()?

Just overriding Equals in TKey does not help.
public override bool Equals(object obj)
{ /* ... */ }
... Equals() will never be called ...
When you do a dictionary lookup, this is the order things happen:
The dictionary uses TKey.GetHashCode to compute a hash for the bucket.
It then checks all of the hashes using that bucket, and calls Equals on the individual objects, to determine a match.
If the buckets never match (because GetHashCode wasn't overwritten), then you'll never call Equals. This is part of why you should always implement both if you implement either - and you should override both functions (more meaningfully than just calling base.GetHashCode()) if you want to use your object in a hashed collection.
If you're implementing a class, you should implement a GetHashCode routine that returns the same hash code for items that are Equal. Ideally, you want to return a different hash code for items that are not equal whenever possible, as this will make your dictionary lookups much faster.
You should also implement Equals in a way that checks for equal instances correctly.
The default implementation for classes (reference types) just compare the reference itself. Two instances, with exactly the same values, with return false on Equals (since they have different references), by default. Multiple instances will always also return a different hash code, by default.
Dictionary is a Hash Table. It only calls Equals(object obj) if two objects produces the same hash values. Provide a good hash function for your objects to avoid calling Equals().
Keep in mind that the Hashing part has the complexity O(1) and the search part has the complexity O(n) in worst case and O(n/2) in average case. You should avoid objects that generate the same hash value, otherwise this objects are searched linear
Assuming you've defined a custom reference type as a key, you must either:
always pass the same object instance into a dictionary as a key, or
implement a GetHashCode() that always returns the same value even for different instances and an Equals() method can can compare different instances.
The base.GetHashCode() method build the hash based on the instance identity of the object and so cannot be used when you pass in different instances of a type as a key.
The reason that returning 0 for your hash always works, is that the Dictionary class first uses the hash code to look for the bucket your key belongs to, and only then uses the Equals() method to distinguish instances. You should not return 0 as a hash code from a custom type if you intend to use it as a dictionary key because that will effectively degenerate the dictionary into a list with O(n) lookup performance instead of O(1).
You may also want to consider implement IComparable and IEquatable.
Look at the following question for more details:
Using an object as a generic Dictionary key

Categories

Resources