Using GetHashCode to test equality in Equals override

Using GetHashCode to test equality in Equals override - c#

Is it ok to call GetHashCode as a method to test equality from inside the Equals override?
For example, is this code acceptable?
public class Class1
{
public string A
{
get;
set;
}
public string B
{
get;
set;
}
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
return other != null && other.GetHashCode() == this.GetHashCode();
}
public override int GetHashCode()
{
int result = 0;
result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
return result;
}
}

The others are right; your equality operation is broken. To illustrate:
public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}
I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.
Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.
In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.

No, it is not ok, because it's not
equality <=> hashcode equality.
It's just
equality => hashcode equality.
or in the other direction:
hashcode inequality => inequality.
Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.

No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false

You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.
If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.
For example:
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
if (other == null || other.GetHashCode() != this.GetHashCode())
return false;
// the hash codes are the same so you have to do a full object compare.
}

You cannot say that just because the hash codes are equal then the objects must be equal.
The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.
So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.

There is one case where using hashcodes as a short-cut on equality comparisons makes sense.
Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).
There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).
Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).
Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).
Now, when we come to insert our logic is going to be something like:
Get hash code for object.
Get slot for object.
If slot is empty, place object in it and return.
If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).
So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:
private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
||
(
newHash == oldHash // fast, false positives, no fast negatives
&&
_cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
);
}
Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.
Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.

It's wrong implementation, as others have stated why.
You should short circuit the equality check using GetHashCode like:
if (other.GetHashCode() != this.GetHashCode()
return false;
in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.
In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.
To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here
var watch = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
action(); //Equals and GetHashCode called here to test for performance.
}
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);

Related

Is there any negative consequence in having Equals based on GetHashCode?

Is the following code OK?
public override bool Equals(object obj)
{
if (obj == null || !(obj is LicenseType))
return false;
return GetHashCode() == obj.GetHashCode();
}
public override int GetHashCode()
{
return
Vendor.GetHashCode() ^
Version.GetHashCode() ^
Modifiers.GetHashCode() ^
Locale.GetHashCode();
}
All the properties are enums/numeric fields, and are the only properties that define the LicenseType objects.

No, the documentation states very clearly:
You should not assume that equal hash codes imply object equality.
Also:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality
And:
Caution:
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To test for equality, call the ReferenceEquals or Equals method.

What happens when two different objects are returning the same HashCodes?
It is, after all, just a hash, and so may not be distinct over the full range of values the objects can have.

It is ok (no negative consequences) only if GetHashCode is unique for each possible value. To give an example, the GetHashCode of a short (a 16 bit value) is always unique (let's hope it so :-) ), so basing the Equals to the GetHashCode is ok.
Another example, for int, the GetHashCode() is the value of the integer, so we have that ((int)value).GetHashCode() == ((int)value). Note that this isn't true for example for short (but still the hash codes of a short are unique, simply they use a more complex formula)
Note that what what Patrick wrote is wrong, because that is true for the "user" of an object/class. You are the "writer" of the object/class, so you define the concept of equality and the concept of hash code. If you define that two objects are always equal, whatever their value is, then it's ok.
public override int GetHashCode() { return 1; }
public override bool Equals(object obj) { return true; }
The only important rules for Equals are:
Implementations are required to ensure that if the Equals method returns true for two objects x and y, then the value returned by the GetHashCode method for x must equal the value returned for y.
The Equals method is reflexive, symmetric, and transitive...
Clearly your Equals() and GetHashCode() are ok with this rules, so they are ok.
Just out of curiosity, there is at least an exception for the equality operator (==) (normally you define the equality operator based on the Equals method)
bool v1 = double.NaN.Equals(double.NaN); // true
bool v2 = double.NaN == double.NaN; // false
This because the NaN value is defined in the IEEE 754 standard as being different from all the values, NaN include. For practical reasons, the Equals returns true.

It must be Noted that it is NOT a rule that if two objects have the same hash code, then they must be equal.
There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code, by the Pigeonhole Principle.
Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.
This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

How can the method Dictionary.Add be O(1) amortized?

I am trying to get a better understanding of how hashtables and dictionaries work in C# from a complexity perspective (but I guess the language is not an important factor, that's probably just a theoretical question).
I know that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
However, let's look at that code:
public class Foo {
public Foo() { }
public override int GetHashCode() {
return 5; //arbitrary value, purposely a constant
}
}
static void Main(string[] args) {
Dictionary<Foo, int> test = new Dictionary<Foo,int>();
Foo a = new Foo();
Foo b = new Foo();
test .Add(a, 5);
test .Add(b, 6); //1. no exception raised, even though GetHashCode() returns the same hash
test .Add(a, 10); //2. exception raised
}
I understand than behind the scenes there is a collision of hashes at 1. and there's probably a separate chaining to handle it.
However, at 2. the argument exception is raised. That means that internally the Dictionary keeps a track of each key inserted after having determined its hash. That also means that each time we add an entry to our dictionary, it checks if the key has not already been inserted using the equals method.
My question is, why is it considered to be O(1) complexity when it seems like it should be O(n) if it checks the already inserted keys?

But it doesn't have to check all of the keys. It only has to check keys that hash to the same value. And, as you say, a good hash code will minimize the number of hash collisions, so on average it doesn't have to make any key comparisions at all.
Remember, the rules for GetHashCode say that if a.HashCode <> b.HashCode, then a <> b. But if a.HashCode == b.GetHashCode, a might be equal to b.
Also, you say:
Iknow that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
That's not entirely true. That's the ideal, assuming a perfect hash function that will give a unique number for every key. But the perfect hash function doesn't exist, in the general case, so typically you'll see O(1) (or very close to it) performance until Count exceeds some fairly large percentage of the capacity: say 85% or 90%.

The answer is simple and difficult.
Simple part: It is because (you can check it by yourself)
a.Equals(b) == false
If you want exception when adding "b" just implement also Equals method.
No difficult part:
Default object implementation of Equals call RuntimeHelpers.Equals. Source of RuntimeHelpers is here. Unfortunately the method is extern:
[System.Security.SecuritySafeCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
public new static extern bool Equals(Object o1, Object o2);
What is excatly implementation of this method, I do not know, but I think it base on pointers (so adresses in memory).

implement GetHashCode() for objects that contain collections

Consider the following objects:
class Route
{
public int Origin { get; set; }
public int Destination { get; set; }
}
Route implements equality operators.
class Routing
{
public List<Route> Paths { get; set; }
}
I used the code below to implement GetHashCode method for the Routing object and it seems to work but I wonder if that's the right way to do it? I rely on equality checks and as I'm uncertain I thought I'll ask you guys. Can I just sum the hash codes or do I need to do more magic in order to guarantee the desired effect?
public override int GetHashCode() =>
{
return (Paths != null
? (Paths.Select(p => p.GetHashCode())
.Sum())
: 0);
}
I checked several GetHashCode() questions here as well as MSDN and Eric Lippert's article on this topic but couldn't find what I'm looking for.

I think your solution is fine. (Much later remark: LINQ's Sum method will act in checked context, so you can very easily get an OverflowException which means it is not so fine, after all.) But it is more usual to do XOR (addition without carry). So it could be something like
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths)
hc ^= p.GetHashCode();
return hc;
}
Addendum (after answer was accepted):
Remember that if you ever use this type Routing in a Dictionary<Routing, Whatever>, a HashSet<Routing> or another situation where a hash table is used, then your instance will be lost if someone alters (mutates) the Routing after it has been added to the collection.
If you're sure that will never happen, use my code above. Dictionary<,> and so on will still work if you make sure no-one alters the Routing that is referenced.
Another choice is to just write
public override int GetHashCode()
{
return 0;
}
if you believe the hash code will never be used. If every instace returns 0 for hash code, you will get very bad performance with hash tables, but your object will not be lost. A third option is to throw a NotSupportedException.

The code from Jeppe Stig Nielsen's answer works but it could lead to a lot of repeating hash code values. Let's say you are hashing a list of ints in the range of 0-100, then your hash code would be guarnteed to be between 0 and 255. This makes for a lot of collisions when used in a Dictionary. Here is an improved version:
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths) {
hc ^= p.GetHashCode();
hc = (hc << 7) | (hc >> (32 - 7)); //rotale hc to the left to swipe over all bits
}
return hc;
}
This code will at least involve all bits over time as more and more items are hashed in.

As a guideline, the hash of an object must be the same over the object's entire lifetime. I would leave the GetHashCode function alone, and not overwrite it. The hash code is only used if you want to put your objects in a hash table.
You should read Eric Lippert's great article about hash codes in .NET: Guidelines and rules for GetHashCode.
Quoted from that article:
Guideline: the integer returned by GetHashCode should never change
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
The GetHashCode function you implemented will not return the same hash code over the lifetime of the object. If you use this function, you will run into trouble if you add those objects to a hash table: the Contains method will not work.

I don't think it's a right way to do, cause to dtermine the final hashcode it has to be unique for specifyed object. In your case you do a Sum(), which can produce the same result with different hashcodes in collection (at the end hashcodes are just integers).
If your intention is to determine equality based on the content of the collection, at this point just compare these cillections between two objects. It could be time consuming operation, by the way.

How to write a hashcode generator for this class?

I have a class which holds a position in three floats. I have overridden Equals like so:
return Math.Abs(this.X - that.X) < TOLERANCE
&& Math.Abs(this.Y - that.Y) < TOLERANCE
&& Math.Abs(this.Z - that.Z) < TOLERANCE;
This is all very well, but now I need to write a GetHashCode implementation for these vertices, and I'm stuck. simply taking the hashcode of the three values and xoring them together isn't good enough, because two objects with slightly different positions may be considered the same.
So, how can I build a GetHashCode implementation for this class which will always return the same value for instances which would be considered equal by the above method?

There's only one way to satisfy the requirements of GetHashCode with an Equals like this.
Say you have these objects (the arrows indicate the limits of the tolerance, and I'm simplifying this to 1-D):
a c
<----|----> <----|---->
<----|---->
b
By your implementation of Equals, we have:
a.Equals(b) == true
b.Equals(c) == true
a.Equals(c) == false
(This is the loss of transitivity mentioned...)
However, the requirements of GetHashCode are that Equals being true implies that the hash codes are the same. Thus, we have:
hash(a) = hash(b)
hash(b) = hash(c)
∴ hash(a) = hash(c)
By extension, we can cover any part of the 1-D space with this (imagine d, e, f, ...), and all the hashes will have to be the same!
int GetHashCode()
{
return some_constant_integer;
}
I would say don't bother with .NET's GetHashCode. It doesn't make sense for your application. ;)
If you needed some form of hash for quick lookup for your data type, you should start looking at some kind of spatial index.

I recommend that you rethink your implementation of Equals. It violates the transitive property, and that's going to give you headaches down the road. See How to: Define Value Equality for a Type, and specifically this line:
if (x.Equals(y) && y.Equals(z))
returns true, then x.Equals(z) returns
true. This is called the transitive
property.

This "Equals" implementation doesn't satisfy the transitive property of being equal (that if X equals Y, and Y equals Z, then X equals Z).
Given that you've already got a non-conforming implementation of Equals, I wouldn't worry too much about your hashing code.

Is this possible? In your equality implementation, there's effectively a sliding window within which equality is considered true, however if you have to "bucketize" (or quantize) for a hash, then it's likely that two items that are "equal" might lie on either side of the hash "boundary".

C#/Java: Proper Implementation of CompareTo when Equals tests reference identity

I believe this question applies equally well to C# as to Java, because both require that {c,C}ompareTo be consistent with {e,E}quals:
Suppose I want my equals() method to be the same as a reference check, i.e.:
public bool equals(Object o) {
return this == o;
}
In that case, how do I implement compareTo(Object o) (or its generic equivalent)? Part of it is easy, but I'm not sure about the other part:
public int compareTo(Object o) {
MyClass other = (MyClass)o;
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
I can't just blindly return 1 or -1, because the solution should adhere to the normal requirements of compareTo. I can check all the instance fields, but if they are all equal, I'd still like compareTo to return a value other than 0. It should be true that a.compareTo(b) == -(b.compareTo(a)), and the ordering should stay consistent as long as the objects' state doesn't change.
I don't care about ordering across invocations of the virtual machine, however. This makes me think that I could use something like memory address, if I could get at it. Then again, maybe that won't work, because the Garbage Collector could decide to move my objects around.
hashCode is another idea, but I'd like something that will be always unique, not just mostly unique.
Any ideas?

First of all, if you are using Java 5 or above, you should implement Comparable<MyClass> rather than the plain old Comparable, therefore your compareTo method should take parameters of type MyClass, notObject:
public int compareTo(MyClass other) {
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
As of your question, Josh Bloch in Effective Java (Chapter 3, Item 12) says:
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compare-
To(x)) for all x and y. (This implies that x.compareTo(y) must throw an exception
if and only if y.compareTo(x) throws an exception.)
This means that if c == 0 in the above code, you must return 0.
That in turn means that you can have objects A and B, which are not equal, but their comparison returns 0. What does Mr. Bloch have to say about this?
It is strongly recommended, but not strictly required, that (x.compareTo(y)
== 0) == (x.equals(y)). Generally speaking, any class that implements
the Comparable interface and violates this condition should clearly indicate
this fact. The recommended language is “Note: This class has a natural
ordering that is inconsistent with equals.”
And
A class whose compareTo method imposes an order
that is inconsistent with equals will still work, but sorted collections containing
elements of the class may not obey the general contract of the appropriate collection
interfaces (Collection, Set, or Map). This is because the general contracts
for these interfaces are defined in terms of the equals method, but sorted collections
use the equality test imposed by compareTo in place of equals. It is not a
catastrophe if this happens, but it’s something to be aware of.
Update: So IMHO with your current class, you can not make compareTo consistent with equals. If you really need to have this, the only way I see would be to introduce a new member, which would give a strict natural ordering to your class. Then in case all the meaningful fields of the two objects compare to 0, you could still decide the order of the two based on their special order values.
This extra member may be an instance counter, or a creation timestamp. Or, you could try using a UUID.

In Java or C#, generally speaking, there is no fixed ordering of objects. Instances can be moved around by the garbage collector while executing your compareTo, or the sort operation that's using your compareTo.
As you stated, hash codes are generally not unique, so they're not usable (two different instances with the same hash code bring you back to the original question). And the Java Object.toString implementation which many people believe to surface an object id (MyObject#33c0d9d), is nothing more than the object's class name followed by the hash code. As far as I know, neither the JVM nor the CLR have a notion of an instance id.
If you really want a consistent ordering of your classes, you could try using an incrementing number for each new instance you create. Mind you, incrementing this counter must be thread safe, so it's going to be relatively expensive (in C# you could use Interlocked.Increment).

Two objects don't need to be reference equal to be in the same equivalence class. In my opinion, it should be perfectly acceptable for two different objects to be the same for a comparison, but not reference equal. It seems perfectly natural to me, for example, that if you hydrated two different objects from the same row in the database, that they would be the same for comparison purposes, but not reference equal.
I'd actually be more inclined to modify the behavior of equals to reflect how they are compared rather than the other way around. For most purposes that I can think of this would be more natural.

The generic equivalent is easier to deal with in my opinion, depends what your external requirements are, this is a IComparable<MyClass> example:
public int CompareTo(MyClass other) {
if (other == null) return 1;
if (this == other) {
return 0;
} else {
return foo.CompareTo(other.foo);
}
}
If the classes are equal or if foo is equal, that's the end of the comparison, unless there's something secondary to sort on, in that case add it as the return if foo.CompareTo(other.foo) == 0
If your classes have an ID or something, then compare on that as secondary, otherwise don't worry about it...the collection they're stored it and it's order in arriving at these classes to compare is what's going to determine the final order in the case of equal objects or equal object.foo values.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.