How to write a hashcode generator for this class? - c#

I have a class which holds a position in three floats. I have overridden Equals like so:
return Math.Abs(this.X - that.X) < TOLERANCE
&& Math.Abs(this.Y - that.Y) < TOLERANCE
&& Math.Abs(this.Z - that.Z) < TOLERANCE;
This is all very well, but now I need to write a GetHashCode implementation for these vertices, and I'm stuck. simply taking the hashcode of the three values and xoring them together isn't good enough, because two objects with slightly different positions may be considered the same.
So, how can I build a GetHashCode implementation for this class which will always return the same value for instances which would be considered equal by the above method?

There's only one way to satisfy the requirements of GetHashCode with an Equals like this.
Say you have these objects (the arrows indicate the limits of the tolerance, and I'm simplifying this to 1-D):
a c
<----|----> <----|---->
<----|---->
b
By your implementation of Equals, we have:
a.Equals(b) == true
b.Equals(c) == true
a.Equals(c) == false
(This is the loss of transitivity mentioned...)
However, the requirements of GetHashCode are that Equals being true implies that the hash codes are the same. Thus, we have:
hash(a) = hash(b)
hash(b) = hash(c)
∴ hash(a) = hash(c)
By extension, we can cover any part of the 1-D space with this (imagine d, e, f, ...), and all the hashes will have to be the same!
int GetHashCode()
{
return some_constant_integer;
}
I would say don't bother with .NET's GetHashCode. It doesn't make sense for your application. ;)
If you needed some form of hash for quick lookup for your data type, you should start looking at some kind of spatial index.

I recommend that you rethink your implementation of Equals. It violates the transitive property, and that's going to give you headaches down the road. See How to: Define Value Equality for a Type, and specifically this line:
if (x.Equals(y) && y.Equals(z))
returns true, then x.Equals(z) returns
true. This is called the transitive
property.

This "Equals" implementation doesn't satisfy the transitive property of being equal (that if X equals Y, and Y equals Z, then X equals Z).
Given that you've already got a non-conforming implementation of Equals, I wouldn't worry too much about your hashing code.

Is this possible? In your equality implementation, there's effectively a sliding window within which equality is considered true, however if you have to "bucketize" (or quantize) for a hash, then it's likely that two items that are "equal" might lie on either side of the hash "boundary".

Related

When would == be overridden in a different way to .equals?

I understand the difference between == and .equals. There are plenty of other questions on here that explain the difference in detail e.g. this one: What is the difference between .Equals and == this one: Bitwise equality amongst many others.
My question is: why have them both (I realise there must be a very good reason) - they both appear to do the same thing (unless overridden differently).
When would == be overloaded in a different way to how .equals is overridden?
== is bound statically, at compile-time, because operators are always static. You overload operators - you can't override them. Equals(object) is executed polymorphically, because it's overridden.
In terms of when you'd want them to be different...
Often reference types will override Equals but not overload == at all. It can be useful to easily tell the difference between "these two references refer to the same object" and "these two references refer to equal objects". (You can use ReferenceEquals if necessary, of course - and as Eric points out in comments, that's clearer.) You want to be really clear about when you do that, mind you.
double has this behavior for NaN values; ==(double, double) will always return false when either operand is NaN, even if they're the same NaN. Equals can't do that without invalidating its contract. (Admittedly GetHashCode is broken for different NaN values, but that's a different matter...)
I can't remember ever implementing them to give different results, personally.
My question is: why have them both (I realise there must be a very good reason)
If there's a good reason it has yet to be explained to me. Equality comparisons in C# are a godawful mess, and were #9 on my list of things I regret about the design of C#:
http://www.informit.com/articles/article.aspx?p=2425867
Mathematically, equality is the simplest equivalence relation and it should obey the rules: x == x should always be true, x == y should always be the same as y == x, x == y and x != y should always be opposite valued, if x == y and y == z are true then x == z must be true. C#'s == and Equals mechanisms guarantee none of these properties! (Though, thankfully, ReferenceEquals guarantees all of them.)
As Jon notes in his answer, == is dispatched based on the compile-time types of both operands, and .Equals(object) and .Equals(T) from IEquatable<T> are dispatched based on the runtime type of the left operand. Why are either of those dispatch mechanisms correct? Equality is not a predicate that favours its left hand side, so why should some but not all of the implementations do so?
Really what we want for user-defined equality is a multimethod, where the runtime types of both operands have equal weight, but that's not a concept that exists in C#.
Worse, it is incredibly common that Equals and == are given different semantics -- usually that one is reference equality and the other is value equality. There is no reason by which the naive developer would know which was which, or that they were different. This is a considerable source of bugs. And it only gets worse when you realize that GetHashCode and Equals must agree, but == need not.
Were I designing a new language from scratch, and I for some crazy reason wanted operator overloading -- which I don't -- then I would design a system that would be much, much more straightforward. Something like: if you implement IComparable<T> on a type then you automatically get <, <=, ==, !=, and so on, operators defined for you, and they are implemented so that they are consistent. That is x<=y must have the semantics of x<y || x==y and also the semantics of !(x>y), and that x == y is always the same as y == x, and so on.
Now, if your question really is:
How on earth did we get into this godawful mess?
Then I wrote down some thoughts on that back in 2009:
https://blogs.msdn.microsoft.com/ericlippert/2009/04/09/double-your-dispatch-double-your-fun/
The TLDR is: framework designers and language designers have different goals and different constraints, and they sometimes do not take those factors into account in their designs in order to ensure a consistent, logical experience across the platform. It's a failure of the design process.
When would == be overloaded in a different way to how .equals is overridden?
I would never do so unless I had a very unusual, very good reason. When I implement arithmetic types I always implement all of the operators to be consistent with each other.
One case that can come up is when you have a previous codebase that depends on reference equality via ==, but you decide you want to add value equality checking. One way to do this is to implement IEquatable<T>, which is great, but now what about all that existing code that was assuming only references were equal? Should the inherited Object.Equals be different from how IEquatable<T>.Equals works? This doesn't have an easy answer, as ideally you want all of those functions/operators to act in a consistent way.
For a concrete case in the BCL where this happened, look at TimeZoneInfo. In that particular case, == and Object.Equals were kept the same, but it's not clear-cut that this was the best choice.
As an aside, one way you can mitigate the above problem is to make the class immutable. In this case, code is less likely be broken by having previously relied on reference equality, since you can't mutate the instance via a reference and invalidate an equality that was previously checked.
Generally, you want them to do the same thing, particularly if your code is going to be used by anyone other than yourself and the person next to you. Ideally, for anyone who uses your code, you want to adhere to the principle of least surprise, which having randomly different behaviours violates. Having said this:
Overloading equality is generally a bad idea, unless a type is immutable, and sealed. If you're at the stage where you have to ask questions about it, then the odds of getting it right in any other case are slim. There are lots of reasons for this:
A. Equals and GetHashCode play together to enable dictionaries and hash sets to work - if you have an inconsistent implementation (or if the hash code changes over time) then one of the following can occur:
Dictionaries/sets start performing as effectively linear-time lookups.
Items get lost in dictionaries/sets
B. What were you really trying to do? Generally, the identity of an object in an object-oriented language IS it's reference. So having two equal objects with different references is just a waste of memory. There was probably no need to create a duplicate in the first place.
C. What you often find when you start implementing equality for objects is that you're looking for a definition of equality that is "for a particular purpose". This makes it a really bad idea to burn your one-and-only Equals for this - much better to define different EqualityComparers for the uses.
D. As others have pointed out, you overload operators but override methods. This means that unless the operators call the methods, horribly amusing and inconsistent results occur when someone tries to use == and finds the wrong (unexpected) method gets called at the wrong level of the hierarchy.

Is there any negative consequence in having Equals based on GetHashCode?

Is the following code OK?
public override bool Equals(object obj)
{
if (obj == null || !(obj is LicenseType))
return false;
return GetHashCode() == obj.GetHashCode();
}
public override int GetHashCode()
{
return
Vendor.GetHashCode() ^
Version.GetHashCode() ^
Modifiers.GetHashCode() ^
Locale.GetHashCode();
}
All the properties are enums/numeric fields, and are the only properties that define the LicenseType objects.
No, the documentation states very clearly:
You should not assume that equal hash codes imply object equality.
Also:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality
And:
Caution:
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To test for equality, call the ReferenceEquals or Equals method.
What happens when two different objects are returning the same HashCodes?
It is, after all, just a hash, and so may not be distinct over the full range of values the objects can have.
It is ok (no negative consequences) only if GetHashCode is unique for each possible value. To give an example, the GetHashCode of a short (a 16 bit value) is always unique (let's hope it so :-) ), so basing the Equals to the GetHashCode is ok.
Another example, for int, the GetHashCode() is the value of the integer, so we have that ((int)value).GetHashCode() == ((int)value). Note that this isn't true for example for short (but still the hash codes of a short are unique, simply they use a more complex formula)
Note that what what Patrick wrote is wrong, because that is true for the "user" of an object/class. You are the "writer" of the object/class, so you define the concept of equality and the concept of hash code. If you define that two objects are always equal, whatever their value is, then it's ok.
public override int GetHashCode() { return 1; }
public override bool Equals(object obj) { return true; }
The only important rules for Equals are:
Implementations are required to ensure that if the Equals method returns true for two objects x and y, then the value returned by the GetHashCode method for x must equal the value returned for y.
The Equals method is reflexive, symmetric, and transitive...
Clearly your Equals() and GetHashCode() are ok with this rules, so they are ok.
Just out of curiosity, there is at least an exception for the equality operator (==) (normally you define the equality operator based on the Equals method)
bool v1 = double.NaN.Equals(double.NaN); // true
bool v2 = double.NaN == double.NaN; // false
This because the NaN value is defined in the IEEE 754 standard as being different from all the values, NaN include. For practical reasons, the Equals returns true.
It must be Noted that it is NOT a rule that if two objects have the same hash code, then they must be equal.
There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code, by the Pigeonhole Principle.
Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.
This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

Implement GetHashCode on a class that has wildcard Equatability

Suppose I want to be able to compare 2 lists of ints and treat one particular value as a wild card.
e.g.
If -1 is a wild card, then
{1,2,3,4} == {1,2,-1,4} //returns true
And I'm writing a class to wrap all this logic, so it implements IEquatable and has the relevant logic in public override bool Equals()
But I have always thought that you more-or-less had to implement GetHashCode if you were overriding .Equals(). Granted it's not enforced by the compiler, but I've been under the impression that if you don't then you're doing it wrong.
Except I don't see how I can implement .GetHashCode() without either breaking its contract (objects that are Equal have different hashes), or just having the implementation be return 1.
Thoughts?
This implementation of Equals is already invalid, as it is not transitive. You should probably leave Equals with the default implementation, and write a new method like WildcardEquals (as suggested in the other answers here).
In general, whenever you have changed Equals, you must implement GetHashCode if you want to be able to store the objects in a hashtable (e.g. a Dictionary<TKey, TValue>) and have it work correctly. If you know for certain that the objects will never end up in a hashtable, then it is in theory optional (but it would be safer and clearer in that case to override it to throw a "NotSupportedException" or always return 0).
The general contract is to always implement GetHashCode if you override Equals, as you can't always be sure in advance that later users won't put your objects in hashtables.
In this case, I would create a new or extension method, WildcardEquals(other), instead of using the operators.
I wouldn't recommend hiding this kind of complexity.
From a logical point of view, we break the concept of equality. It is not transitive any longer. So in case of wildcards, A==B and B==C does not mean that A==C.
From a technical pount of view, returning the same value from GetHashCode() is not somenting unforgivable.
The only possible idea I see is to exploit at least the length, e.g.:
public override int GetHashCode()
{
return this.Length.GetHashCode()
}
It's recommended, but not mandatory at all. If you don't need that custom implementation of GetHashCode, just don't do it.
GetHashCode is generally only important if you're going to be storing elements of your class in some kind of collection, such as a set. If that's the case here then I don't think you're going to be able to achieve consistent semantics since as #AlexD points out equality is no longer transitive.
For example, (using string globs rather than integer lists) if you add the strings "A", "B", and "*" to a set, your set will end up with either one or two elements depending on the order you add them in.
If that's not what you want then I'd recommend putting the wildcard matching into a new method (e.g. EquivalentTo()) rather than overloading equality.
Having GetHashCode() always return a constant is the only 'legal' way of fulfilling the equals/hashcode constraint.
It'll potentially be inefficient if you put it in a hashmap, or similar, but that might be fine (non-equal hashcodes imply non-equality, but equal hashcodes imply nothing).
I think this is the only possible valid option there. Hashcodes essentially exist as keys to look things up by quickly, and since your wildcard must match every item, its key for lookup must equal every item's key, so they must all be the same.
As others have noted though, this isn't what equals is normally for, and breaks assumptions that many other things may use for equals (such as transitivity - EDIT: turns out this is actually contractual requirement, so no-go), so it's definitely worth at least considering comparing these manually, or with an explicitly separate equality comparer.
Since you've changed what "equals" means (adding in wildcards changes things dramatically) then you're already outside the scope of the normal use of Equals and GetHashCode. It's just a recommendation and in your case it seems like it doesn't fit. So don't worry about it.
That said, make sure you're not using your class in places that might use GetHashCode. That can get you in a load of trouble and be hard to debug if you're not watching for it.
It is generally expected that Equals(Object) and IEquatable<T>.Equals(T) should implement equivalence relations, such that if X is observed to be equal to Y, and Y is observed to be equal to Z, and none of the items have been modified, X may be assumed to be equal to Z; additionally, if X is equal to Y and Y does not equal Z, then X may be assumed not to equal Z either. Wild-card and fuzzy comparison methods are do not implement equivalence relations, and thus Equals should generally not be implemented with such semantics.
Many collections will kinda-sorta work with objects that implement Equals in a way that doesn't implement an equivalence relation, provided that any two objects that might compare equal to each other always return the same hash code. Doing this will often require that many things that would compare unequal to return the same hash code, though depending upon what types of wildcard are supported it may be possible to separate items to some degree.
For example, if the only wildcard which a particular string supports represents "arbitrary string of one or more digits", one could hash the string by converting all sequences of consecutive digits and/or string-of-digit wildcard characters into a single "string of digits" wildcard character. If # represents any digit, then the strings abc123, abc#, abc456, and abc#93#22#7 would all be hashed to the same value as abc#, but abc#b, abc123b, etc. could hash to a different value. Depending upon the distribution of strings, such distinctions may or may not yield better performance than returning a constant value.
Note that even if one implements GetHashCode in such a fashion that equal objects yield equal hashes, some collections may still get behave oddly if the equality method doesn't implement an equivalence relation. For example, if a collection foo contains items with keys "abc1" and "abc2", attempts to access foo["abc#"] might arbitrarily return the first item or the second. Attempts to delete the key "abc#" may arbitrarily remove one or both items, or may fail after deleting one item (its expected post-condition wouldn't be met, since abc# would be in the collection even after deletion).
Rather than trying to jinx Equals to compare hash-code equality, an alternative approach is to have a dictionary which holds for each possible wildcard string that would match at least one main-collection string a list of the strings it might possibly match. Thus, if there are many strings which would match abc#, they could all have different hash codes; if a user enters "abc#" as a search request, the system would look up "abc#" in the wild-card dictionary and receive a list of all strings matching that pattern, which could then be looked up individually in the main dictionary.

Does the CompareTo() method use GetHashCode()?

Does the method CompareTo() use GetHashCode() to define to the object a comparable (not the interface) number? If I do
MyObject.CompareTo(MyOtherObject.GetHashCode())
What will happen if I don't want to override the CompareTo() method ?
No, CompareTo does not / should not use GetHashCode to check for equality.
They might (emphasis on might) use it to determine inequality, if the hash code is cached and thus cheaper to look at than comparing all the internal data, but equal hash codes does not necessarily mean equal objects.
If you implement Equals and GetHashCode (you need to implement both or none), then here are the rules you should follow:
If the two objects are equal (Equals returns true), they should produce the same hash code from GetHashCode. You can turn this rule on its head and say that if the two GetHashCode methods returns different values, Equals should return false.
Note that the opposite does not hold. If Equals returns false, it is perfectly valid, though usually very unlikely, that GetHashCode returns the same value. Likewise, if GetHashCode returns the same value, it is perfectly valid, though again usually very unlikely, that Equals returns false. This is because of the Pigeonhole Principle (wikipedia link).
Always use the same fields to check for equality and calculate the hash code
Don't use mutable fields (if you can help it). If you do, make it very clear which fields will break the hashcode and equality checks. Stuffing mutable objects in a hashset or dictionary and the modifying them will break everything.
If it's an object you're creating yourself then here are some rules:
Implement CompareTo and IComparable<T> if you need ordering support. Don't implement CompareTo only to get equality checkes.
Implement Equals, GetHashCode, and IEquatable<T> if you need equality checks.
If it's an object you cannot modify, create:
IComparer<T> to support ordering
IEqualityComparer<T> to support equality checks
Most collections or methods that will do ordering or equality checks allows you to specify an extra object that determines the rules for the ordering or equality checks, assuming that either the implementation built into the object is wrong (possibly in just this single scenario) or missing.
Links to all the types:
System.IComparable<T> and its external sibling System.IComparer<T>
System.IEquatable<T> and its external sibling System.IEqualityComparer<T>

Using GetHashCode to test equality in Equals override

Is it ok to call GetHashCode as a method to test equality from inside the Equals override?
For example, is this code acceptable?
public class Class1
{
public string A
{
get;
set;
}
public string B
{
get;
set;
}
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
return other != null && other.GetHashCode() == this.GetHashCode();
}
public override int GetHashCode()
{
int result = 0;
result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
return result;
}
}
The others are right; your equality operation is broken. To illustrate:
public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}
I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.
Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.
In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.
No, it is not ok, because it's not
equality <=> hashcode equality.
It's just
equality => hashcode equality.
or in the other direction:
hashcode inequality => inequality.
Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.
No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false
You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.
If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.
For example:
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
if (other == null || other.GetHashCode() != this.GetHashCode())
return false;
// the hash codes are the same so you have to do a full object compare.
}
You cannot say that just because the hash codes are equal then the objects must be equal.
The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.
So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.
There is one case where using hashcodes as a short-cut on equality comparisons makes sense.
Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).
There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).
Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).
Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).
Now, when we come to insert our logic is going to be something like:
Get hash code for object.
Get slot for object.
If slot is empty, place object in it and return.
If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).
So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:
private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
||
(
newHash == oldHash // fast, false positives, no fast negatives
&&
_cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
);
}
Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.
Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.
It's wrong implementation, as others have stated why.
You should short circuit the equality check using GetHashCode like:
if (other.GetHashCode() != this.GetHashCode()
return false;
in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.
In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.
To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here
var watch = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
action(); //Equals and GetHashCode called here to test for performance.
}
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);

Categories

Resources