Consider the following objects:
class Route
{
public int Origin { get; set; }
public int Destination { get; set; }
}
Route implements equality operators.
class Routing
{
public List<Route> Paths { get; set; }
}
I used the code below to implement GetHashCode method for the Routing object and it seems to work but I wonder if that's the right way to do it? I rely on equality checks and as I'm uncertain I thought I'll ask you guys. Can I just sum the hash codes or do I need to do more magic in order to guarantee the desired effect?
public override int GetHashCode() =>
{
return (Paths != null
? (Paths.Select(p => p.GetHashCode())
.Sum())
: 0);
}
I checked several GetHashCode() questions here as well as MSDN and Eric Lippert's article on this topic but couldn't find what I'm looking for.
I think your solution is fine. (Much later remark: LINQ's Sum method will act in checked context, so you can very easily get an OverflowException which means it is not so fine, after all.) But it is more usual to do XOR (addition without carry). So it could be something like
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths)
hc ^= p.GetHashCode();
return hc;
}
Addendum (after answer was accepted):
Remember that if you ever use this type Routing in a Dictionary<Routing, Whatever>, a HashSet<Routing> or another situation where a hash table is used, then your instance will be lost if someone alters (mutates) the Routing after it has been added to the collection.
If you're sure that will never happen, use my code above. Dictionary<,> and so on will still work if you make sure no-one alters the Routing that is referenced.
Another choice is to just write
public override int GetHashCode()
{
return 0;
}
if you believe the hash code will never be used. If every instace returns 0 for hash code, you will get very bad performance with hash tables, but your object will not be lost. A third option is to throw a NotSupportedException.
The code from Jeppe Stig Nielsen's answer works but it could lead to a lot of repeating hash code values. Let's say you are hashing a list of ints in the range of 0-100, then your hash code would be guarnteed to be between 0 and 255. This makes for a lot of collisions when used in a Dictionary. Here is an improved version:
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths) {
hc ^= p.GetHashCode();
hc = (hc << 7) | (hc >> (32 - 7)); //rotale hc to the left to swipe over all bits
}
return hc;
}
This code will at least involve all bits over time as more and more items are hashed in.
As a guideline, the hash of an object must be the same over the object's entire lifetime. I would leave the GetHashCode function alone, and not overwrite it. The hash code is only used if you want to put your objects in a hash table.
You should read Eric Lippert's great article about hash codes in .NET: Guidelines and rules for GetHashCode.
Quoted from that article:
Guideline: the integer returned by GetHashCode should never change
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
The GetHashCode function you implemented will not return the same hash code over the lifetime of the object. If you use this function, you will run into trouble if you add those objects to a hash table: the Contains method will not work.
I don't think it's a right way to do, cause to dtermine the final hashcode it has to be unique for specifyed object. In your case you do a Sum(), which can produce the same result with different hashcodes in collection (at the end hashcodes are just integers).
If your intention is to determine equality based on the content of the collection, at this point just compare these cillections between two objects. It could be time consuming operation, by the way.
Related
I am trying to get a better understanding of how hashtables and dictionaries work in C# from a complexity perspective (but I guess the language is not an important factor, that's probably just a theoretical question).
I know that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
However, let's look at that code:
public class Foo {
public Foo() { }
public override int GetHashCode() {
return 5; //arbitrary value, purposely a constant
}
}
static void Main(string[] args) {
Dictionary<Foo, int> test = new Dictionary<Foo,int>();
Foo a = new Foo();
Foo b = new Foo();
test .Add(a, 5);
test .Add(b, 6); //1. no exception raised, even though GetHashCode() returns the same hash
test .Add(a, 10); //2. exception raised
}
I understand than behind the scenes there is a collision of hashes at 1. and there's probably a separate chaining to handle it.
However, at 2. the argument exception is raised. That means that internally the Dictionary keeps a track of each key inserted after having determined its hash. That also means that each time we add an entry to our dictionary, it checks if the key has not already been inserted using the equals method.
My question is, why is it considered to be O(1) complexity when it seems like it should be O(n) if it checks the already inserted keys?
But it doesn't have to check all of the keys. It only has to check keys that hash to the same value. And, as you say, a good hash code will minimize the number of hash collisions, so on average it doesn't have to make any key comparisions at all.
Remember, the rules for GetHashCode say that if a.HashCode <> b.HashCode, then a <> b. But if a.HashCode == b.GetHashCode, a might be equal to b.
Also, you say:
Iknow that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
That's not entirely true. That's the ideal, assuming a perfect hash function that will give a unique number for every key. But the perfect hash function doesn't exist, in the general case, so typically you'll see O(1) (or very close to it) performance until Count exceeds some fairly large percentage of the capacity: say 85% or 90%.
The answer is simple and difficult.
Simple part: It is because (you can check it by yourself)
a.Equals(b) == false
If you want exception when adding "b" just implement also Equals method.
No difficult part:
Default object implementation of Equals call RuntimeHelpers.Equals. Source of RuntimeHelpers is here. Unfortunately the method is extern:
[System.Security.SecuritySafeCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
public new static extern bool Equals(Object o1, Object o2);
What is excatly implementation of this method, I do not know, but I think it base on pointers (so adresses in memory).
When I examined a project(my company codes). I looked this:
public override int GetHashCode()
{
unchecked
{
int result = 17;
result = result * 23 + ((connection != null) ? this.connection.GetHashCode() : 0);
return result;
}
}
actually, I have seen GetHashCode() first time. I ran a little about it. But I cant understand why they used it in this code line and for connection?
Is there a special reason? What is the logic to use getHashCode for connection?
Thanks.
GetHashCode should be used if you know a good way to generate fast and well distributed hashkeys for an object.
This is useful when you want to store the object in dictionaries and such. Better hashing means better efficiency in those datastructures.
Otherwise the base GetHashCode() is used. Which is suboptimal in a lot of cases.For some details on using the GetHasCode override combined with Equals override see;
http://msdn.microsoft.com/en-us/library/ms173147(v=vs.80).aspx
That GetHashCode can be used to quickly identify whether two instances appear to be "the same" (different hashcode = different instance; same hashcode could mean same instance).
Apparently that "connection" field is important in deciding the identity of your object.
Firstly, I'm using the GetHashCode algorithm described, here. Now, picture the following (contrived) example:
class Foo
{
public Foo(int intValue, double doubleValue)
{
this.IntValue = intValue;
this.DoubleValue = doubleValue;
}
public int IntValue { get; private set; }
public double DoubleValue { get; private set; }
public override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 23 + IntValue.GetHashCode();
hash = hash * 23 + DoubleValue.GetHashCode();
return hash;
}
}
}
class DerivedFoo : Foo
{
public DerivedFoo(int intValue, double doubleValue)
: base(intValue, doubleValue)
{
}
}
If I have a Foo and a DerivedFoo with the same values for each of the properties they're going to have the same hash code. Which means I could have HashSet<Foo> or use the Distinct method in Linq and the two instances would be treated as if they were the same.
I'm probably just misunderstanding the use of GetHashCode but I would expect these the two instances to have different hash codes. Is this an invalid expectation or should GetHashCode use the type in the calculation? (Or should DerivedClass also override GetHashCode)?
P.S. I realize there are many, many questions on SO relating to this topic, but I've haven't spotted one that directly answers this question.
GetHashCode() is not supposed to guarantee uniqueness (though it helps for performance if as unique as possible).
The main rule with GetHashCode() is that equivalent objects must have the same hash code, but that doesn't mean non-equivalent objects can't have the same hash code.
If two objects have the same hash code, the Equals() method is then invoked to see if they are the same. Since the types are different (depending on how you coded your Equals overload of course) they will not be equal and thus it will be fine.
Even if you had a different hash code algorithm for each type, there's still always a chance of a collision, thus the need for the Equals() check as well.
Now given your example above, you do not implement Equals() this will make every object distinct regardless of the hash code because the default implementation of Equals() from object is a reference equality check.
If you haven't, go ahead and override Equals() for each of your types as well (they can inherit your implementation of GetHashCode() if you like, or have new ones) and there you can make sure that the type of the compare-to object are the same before declaring them equal. And make sure Equals() and GetHashCode() are always implemented so that:
Objects that are Equals() must have same GetHashCode() results.
Objects with different GetHashCode() must not be Equals().
The two instances do not need to have different hash codes. The results of GetHashCode are not assumed by the HashSet or other framework classes, because there can be collisions even within a type. GetHashCode is simply used to determine the location within the hash table to store the item. If there is a collision within the HashSet, it then falls back on the result of the Equals method to determine the unique match. This means that whever you implement GetHashCode, you should also implement Equals (and check that the types match). Similarly, whenever you implement Equals, you should also implement GetHashCode. See a good explanation by Eric Lippert here.
I override the Equals method for one of my class. In the method, I check the equality of each pair of a dictionary with those of the dictionary of another instance, like the following does
public override bool Equals (object obj)
{
...
// compare to make sure all <key, value> pair of this.dict have
// the match in obj.dict
...
}
Now, I need to override the GetHashCode method as well as what is suggested.
Do I need to do that for all the keys of the dictionary, or keys plus values?
Basically, would the following be good or overkill?
public override int GetHashCode ()
{
int iHash = 0;
foreach (KeyValuePair<string, T> pair in this.dict)
{
iHash ^= pair.Key.GetHashCode();
iHash ^= pair.Value.GetHashCode();
}
return iHash;
}
Going along with what #Mitch Wheat linked to, that's not the best way to do a GetHashCode() if you use this class with a Dictionary or HashSet.
Imagine your internal Dictionary had only one entry. Your hash is now the value of that single KeyValuePair. You stick the whole class in a HashSet. You add another item to your internal Dictionary. Now the hashcode of your class has changed, because you're iterating over two items in your class.
When you call HashSet.Contains(obj), it calls obj.GetHashCode() which has now changed, even though its the same class instance. HashSet.Contains() will find that it doesn't contain this new hash and return false, never calling Equals (which will return true if the references are the same).
Suddenly its like your object has disappeared from the HashSet, even though the class is in there, with an outdated hash.
You really don't want your hash to change. It's okay to have collisions in your GetHashCode, because if it collides, it will call the (slower) .Equals() method. It's a handy optimization, that, if improperly implemented, can cause some headaches along the way.
As a side note, as pointed out in the link above, it's a good idea to multiply your hash by a prime number before ^ with another value. Helps with keeping the has unique.
Do you plan to use the object in a HashSet? You really only need to implement GetHashCode if the object is going to be used in such a way that it requires it be uniquely identifiable by its hash. It is good practice to always implement GetHashCode taking into account the same fields that are used in equality, but not always necessary.
If it is necessary in your case, I believe you have the right idea.
Is it ok to call GetHashCode as a method to test equality from inside the Equals override?
For example, is this code acceptable?
public class Class1
{
public string A
{
get;
set;
}
public string B
{
get;
set;
}
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
return other != null && other.GetHashCode() == this.GetHashCode();
}
public override int GetHashCode()
{
int result = 0;
result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
return result;
}
}
The others are right; your equality operation is broken. To illustrate:
public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}
I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.
Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.
In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.
No, it is not ok, because it's not
equality <=> hashcode equality.
It's just
equality => hashcode equality.
or in the other direction:
hashcode inequality => inequality.
Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.
No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false
You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.
If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.
For example:
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
if (other == null || other.GetHashCode() != this.GetHashCode())
return false;
// the hash codes are the same so you have to do a full object compare.
}
You cannot say that just because the hash codes are equal then the objects must be equal.
The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.
So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.
There is one case where using hashcodes as a short-cut on equality comparisons makes sense.
Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).
There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).
Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).
Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).
Now, when we come to insert our logic is going to be something like:
Get hash code for object.
Get slot for object.
If slot is empty, place object in it and return.
If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).
So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:
private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
||
(
newHash == oldHash // fast, false positives, no fast negatives
&&
_cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
);
}
Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.
Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.
It's wrong implementation, as others have stated why.
You should short circuit the equality check using GetHashCode like:
if (other.GetHashCode() != this.GetHashCode()
return false;
in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.
In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.
To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here
var watch = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
action(); //Equals and GetHashCode called here to test for performance.
}
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);