Implement GetHashCode on a class that has wildcard Equatability

Implement GetHashCode on a class that has wildcard Equatability - c#

Suppose I want to be able to compare 2 lists of ints and treat one particular value as a wild card.
e.g.
If -1 is a wild card, then
{1,2,3,4} == {1,2,-1,4} //returns true
And I'm writing a class to wrap all this logic, so it implements IEquatable and has the relevant logic in public override bool Equals()
But I have always thought that you more-or-less had to implement GetHashCode if you were overriding .Equals(). Granted it's not enforced by the compiler, but I've been under the impression that if you don't then you're doing it wrong.
Except I don't see how I can implement .GetHashCode() without either breaking its contract (objects that are Equal have different hashes), or just having the implementation be return 1.
Thoughts?

This implementation of Equals is already invalid, as it is not transitive. You should probably leave Equals with the default implementation, and write a new method like WildcardEquals (as suggested in the other answers here).
In general, whenever you have changed Equals, you must implement GetHashCode if you want to be able to store the objects in a hashtable (e.g. a Dictionary<TKey, TValue>) and have it work correctly. If you know for certain that the objects will never end up in a hashtable, then it is in theory optional (but it would be safer and clearer in that case to override it to throw a "NotSupportedException" or always return 0).
The general contract is to always implement GetHashCode if you override Equals, as you can't always be sure in advance that later users won't put your objects in hashtables.

In this case, I would create a new or extension method, WildcardEquals(other), instead of using the operators.
I wouldn't recommend hiding this kind of complexity.

From a logical point of view, we break the concept of equality. It is not transitive any longer. So in case of wildcards, A==B and B==C does not mean that A==C.
From a technical pount of view, returning the same value from GetHashCode() is not somenting unforgivable.

The only possible idea I see is to exploit at least the length, e.g.:
public override int GetHashCode()
{
return this.Length.GetHashCode()
}

It's recommended, but not mandatory at all. If you don't need that custom implementation of GetHashCode, just don't do it.

GetHashCode is generally only important if you're going to be storing elements of your class in some kind of collection, such as a set. If that's the case here then I don't think you're going to be able to achieve consistent semantics since as #AlexD points out equality is no longer transitive.
For example, (using string globs rather than integer lists) if you add the strings "A", "B", and "*" to a set, your set will end up with either one or two elements depending on the order you add them in.
If that's not what you want then I'd recommend putting the wildcard matching into a new method (e.g. EquivalentTo()) rather than overloading equality.

Having GetHashCode() always return a constant is the only 'legal' way of fulfilling the equals/hashcode constraint.
It'll potentially be inefficient if you put it in a hashmap, or similar, but that might be fine (non-equal hashcodes imply non-equality, but equal hashcodes imply nothing).
I think this is the only possible valid option there. Hashcodes essentially exist as keys to look things up by quickly, and since your wildcard must match every item, its key for lookup must equal every item's key, so they must all be the same.
As others have noted though, this isn't what equals is normally for, and breaks assumptions that many other things may use for equals (such as transitivity - EDIT: turns out this is actually contractual requirement, so no-go), so it's definitely worth at least considering comparing these manually, or with an explicitly separate equality comparer.

Since you've changed what "equals" means (adding in wildcards changes things dramatically) then you're already outside the scope of the normal use of Equals and GetHashCode. It's just a recommendation and in your case it seems like it doesn't fit. So don't worry about it.
That said, make sure you're not using your class in places that might use GetHashCode. That can get you in a load of trouble and be hard to debug if you're not watching for it.

It is generally expected that Equals(Object) and IEquatable<T>.Equals(T) should implement equivalence relations, such that if X is observed to be equal to Y, and Y is observed to be equal to Z, and none of the items have been modified, X may be assumed to be equal to Z; additionally, if X is equal to Y and Y does not equal Z, then X may be assumed not to equal Z either. Wild-card and fuzzy comparison methods are do not implement equivalence relations, and thus Equals should generally not be implemented with such semantics.
Many collections will kinda-sorta work with objects that implement Equals in a way that doesn't implement an equivalence relation, provided that any two objects that might compare equal to each other always return the same hash code. Doing this will often require that many things that would compare unequal to return the same hash code, though depending upon what types of wildcard are supported it may be possible to separate items to some degree.
For example, if the only wildcard which a particular string supports represents "arbitrary string of one or more digits", one could hash the string by converting all sequences of consecutive digits and/or string-of-digit wildcard characters into a single "string of digits" wildcard character. If # represents any digit, then the strings abc123, abc#, abc456, and abc#93#22#7 would all be hashed to the same value as abc#, but abc#b, abc123b, etc. could hash to a different value. Depending upon the distribution of strings, such distinctions may or may not yield better performance than returning a constant value.
Note that even if one implements GetHashCode in such a fashion that equal objects yield equal hashes, some collections may still get behave oddly if the equality method doesn't implement an equivalence relation. For example, if a collection foo contains items with keys "abc1" and "abc2", attempts to access foo["abc#"] might arbitrarily return the first item or the second. Attempts to delete the key "abc#" may arbitrarily remove one or both items, or may fail after deleting one item (its expected post-condition wouldn't be met, since abc# would be in the collection even after deletion).
Rather than trying to jinx Equals to compare hash-code equality, an alternative approach is to have a dictionary which holds for each possible wildcard string that would match at least one main-collection string a list of the strings it might possibly match. Thus, if there are many strings which would match abc#, they could all have different hash codes; if a user enters "abc#" as a search request, the system would look up "abc#" in the wild-card dictionary and receive a list of all strings matching that pattern, which could then be looked up individually in the main dictionary.

Related

Why does SortedSet<T> ignore CompareTo() values of 0 by default?

When I try to AddRange to a SortedSet, it doesn't add values when their compareto result is zero. This doesn't make any sense to me, as it's not a SortedSet of the values that compareTo is comparing, it's a SortedSet of T. I am trying to understand why Microsoft would implement it like this.
Is there any logical explanation to this that will help me remember in the future?
Is the correct way to implement IComparable on my type by not returning 0 from CompareTo? What happens if I need CompareTo to determine equality explicitly when I use its IComparable.CompareTo() method elsewhere in my code?

It would make a SortedSet non-deterministic
SortedSet
Represents a collection of objects that is maintained in sorted order.
If ties were allowed there would be an arbitrary order. It would be like duplicate keys in a dictionary.
If you want a set of T with some that return 0 from compare to then SortSet is not the proper collection.
I get the feeling this is a bit of an XY problem. What are you trying to solve?

The whole point of the comparison is to say whether one item is greater than, less than or equal to another. A set doesn't allow equal values - and in a sorted set, equality is defined by the comparison. It's as simple as that.
For your second question, it sounds like you want to have two different comparisons - one which returns 0 in certain cases where the other one wouldn't. You can do that by implementing IComparer<T> in a separate class, as the separate comparison. Just bear in mind that you'll still want to return 0 for Compare(x, x) for example, otherwise you'd never be able to find anything in the set other than by iterating over it...

Hashcodes are same, but != returns true?

I am stumped by a seemingly simple problem. I have two objects that I am comparing with a !=.
When I run the application, a != b is true.
When I put a breakpoint and do a Watch, a.GetHashCode() == b.GetHashCode() is true.
These two (reference type) objects are defined in a different assembly, but I cannot find an override to the != method (although GetHashCode is overridden). Is there another explanation for this? Could it be possible that a GetHashCode for two objects could be the same, but a not-overriden != would return true?
Thanks.

When two objects that are different return the same code it is called a "collision". With only ~4 billion possible integer values, and more than 4 billion possible values of [your class name here] some collisions are inevitable. This is why a hash based structure (i.e. Dictionary) can't rely entirely on GetHashCode, it also needs a sensible Equals implementation to be effective. The Equals method is what is used to resolve these collisions.
Of course it's also possible that the creator of the class overwrite either GetHashCode or Equals and in some way made a mistake that in some way violated the "contract" for generating hash codes. Here is one list of guidelines to keep in mind when creating your GetHashCode methods. Remember that there is a fairly small set of things that you have to do, and another set of things that can be done to make it work efficiently.
return 0; is actually a perfectly acceptable GetHashCode implementation. It conforms with all of the rules, it just has a 100% chance of causing collisions, so it will be extraordinarily inefficient and you shouldn't ever actually do that.

It is perfectly legal for two objects that are not equal to have the same hashcode, but it is not valid for two objects that are equal to have different hashcodes.
The Dictionary style collection classes use the hashcode value (the GetHashCode value returned from the object specified as the key) to put the key/value pair into a hashbin. All key/value pairs where the hashcode value is the same for the key go into the same hashbin. If the hashcode generation is effective it means there will be very few (hopefully one just one) key/value pairs in each non-empty hashbin in the dictionary.
When you access contents in the dictionary by specifying an object as a key, the pseudo logic for finding the correct value to be returned is:
Get the hashcode value for the object specified as the key in the request (GetHashCode())
If there is a non-empty hashbin for that hashcode, iterate over the key objects of all key/value pairs in that hashbin. For each key/value pair in the hashbin, check if the key object Equals() the object that was passed in as the key to the request. If so, return the Value object in that key/value pair.
This is what makes dictionary lookups very effective compared to looking for an object in a List style collection (when the hashcode distribution is good).

Why don't I ever have to override GetHashCode when using Dictionaries on personal classes?

It always seems to just "work" without ever having to do anything.
The only thing I can think of is that each class has a hidden sort of static identifier that Object.GetHashCode uses. (also, does anyone know how Object.GetHashCode is implemented? I couldn't find it in the .NET Reflector)
I have never overridden GetHashCode but I was reading around and people say you only need to when overriding Equals and providing custom equality checking to your application so I guess I'm fine?
I'd still like to know how the magic works, though =P

It always seems to just "work" without ever having to do anything.
You didn't tell us if you're using value types or reference types for your keys.
If you're using value types, the default implementation of Equals and GetHashCode are okay (Equals checks if the fields are equals, and GetHashCode is based on the fields (not necessarily all of them!)). If you're using reference types, the default implementation of Equals and GetHashCode use reference equality, which may or may not be okay; it depends on what you're doing.
The only thing I can think of is that each class has a hidden sort of static identifier that Object.GetHashCode uses.
No. The default is a hash code based on the fields for a value type, and the reference for a reference type.
(also, does anyone know how Object.GetHashCode is implemented? I couldn't find it in the .NET Reflector)
It's an implementation detail that you should never ever need to know, and never ever rely on it. It could change on you at any moment.
I have never overridden GetHashCode but I was reading around and people say you only need to when overriding Equals and providing custom equality checking to your application so I guess I'm fine?
Well, is default equality okay for you? If not, override Equals and GetHashCode or implmenet IEqualityComparer<T> for your T.
I'd still like to know how the magic works, though =P
Every object has Equals and GetHashCode. The default implementations are as follows:
For value types, Equals is value equality.
For reference types, Equals is reference equality.
For value types, GetHashCode is based on the fields (again, not necessarily all of them!).
For reference types, GetHashCode is based on the reference.
If you use a overload of Dictionary constructor that doesn't take a IEqualityComparer<T> for your T, it will use EqualityComparer<T>.Default. This IEqualityComparer<T> just uses Equals and GetHashCode. So, if you haven't overridden them, you get the implementations as defined above. If you override Equals and GetHashCode then this is what EqualityComparer<T>.Default will use.
Otherwise, pass a custom implementation of IEqualityComparer<T> to the constructor for Dictionary.

Are you using your custom classes as keys or values? If you are using them only for values, then their GetHashCode doesn't matter.
If you are using them as keys, then the quality of the hash affects performance. The Dictionary stores a list of elements for each hash code, since the hash codes don't need to be unique. In the worst case scenario, if all of your keys end up having the same hash code, then the lookup time for the dictionary will like a list, O(n), instead of like a hash table, O(1).
The documentation for Object.GetHashCode is quite clear:
The default implementation of the GetHashCode method does not guarantee unique return values for different objects... Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.

Object's implementations of Equals() and GetHashCode() (which you're inheriting) compare by reference.
Object.GetHashCode is implemented in native code; you can see it in the SSCLI (Rotor).
Two different instances of a class will (usually) have different hashcodes, even if their properties are equal.
You only need to override them if you want to compare by value – if you want to different instances with the same properties to compare equal.

It really depends on your definition of Equality.
class Person
{
public string Name {get; set;}
}
void Test()
{
var joe1 = new Person {Name="Joe"};
var joe2 = new Person {Name="Joe"};
Assert.AreNotEqual(joe1, joe2);
}
If you have a different definition for equality, you should override Equals & GetHashCode to get the appropriate behavior.

Hash codes are for optimizing lookup performance in hash tables (dictionaries). While hash codes have a goal of colliding as little as possible between instances of objects they are not guaranteed to be unique. The goal should be equal distribution among the int range given a set of typical types of those objects.
The way hash tables work is each object implements a function to compute a hash code hopefully as distributed as possible amongst the int range. Two different objects can produce the same hash code but an instance of an object given it's data should always product the same hash code. Therefore, they are not unique and should not be used for equality. The hash table allocates an array of size n (much smaller than the int range) and when an object is added to the hash table, it calls GetHashCode and then it's mod'd (%) against the size of the array allocated. For collisions in the table, typically a list of objects is chained. Since computing hash codes should be very fast, a lookup is fast - jump to the array offset and walk the chain. The larger the array (more memory), the less collisions and the faster the lookup.
Objects GetHashCode cannot possibly produce a good hash code because by definition it knows nothing about the concrete object that's inheriting from it. That's why if you have custom objects that need to be placed in dictionaries and you want to optimize the lookups (control creating an even distribution with minimal collisions), you should override GetHashCode.
If you need to compare two items, then override equals. If you need the object to be sortable (which is needed for sorted lists) then override IComparable.
Hope that helps explain the difference.

C# HashSet with struct and string

I created to following code to verify the uniqueness of a series of "tuples":
struct MyTuple
{
public MyTuple(string a, string b, string c)
{
ValA = a; ValB = b; ValC = c;
}
private string ValA;
private string ValB;
private string ValC;
}
...
HashSet<MyTuple> tupleList = new HashSet<MyTuple>();
If I'm correct, I will not end up with two tuples with the same values in my HashSet thanks to the fact that I'm using a struct. I could not have the same behavior with a class without implementing IEquatable or something like that (I didn't dig too much how to do that).
I want to know if there is some gotcha about what I do. Performance wise, I wouldn't expect the use of a struct to be a problem considering that the string inside are reference types.
Edit:
I want my HashSet to never contains two tuples having string with the same values. In other words, I want the string to behave like values types.

The gotcha is that it will not work. If two strings are "a", they can still be different references. That case would break your implementation.
Implement Equals() and GetHashCode() properly (e.g. using the ones from the supplied strings, and take care with NULL references in your struct), and possibly IEquatable<MyTuple> to make it even nicer.
Edit: The default implementation is explicitly not suitable to be used in hash tables and sets. This is clearly stated in the ValueType.GetHashCode() implementation (added emphasis):
The GetHashCode method applies to
types derived from ValueType. One or
more fields of the derived type is
used to calculate the return value. If
you call the derived type's
GetHashCode method, the return value
is not likely to be suitable for use
as a key in a hash table.
Additionally, if the value of one or
more of those fields changes, the
return value might become unsuitable
for use as a key in a hash table. In
either case, consider writing your own
implementation of the GetHashCode
method that more closely represents
the concept of a hash code for the
type.
You should always implement Equals() and GetHashCode() as "pair", and this is even more obvious since the ValueType.Equals() is terribly inefficient and unreliable (using reflection, unknown method of equality comparison). Also, there is the performance problem when not overriding those two (structs will get boxed when calling the default implementations).

Your approach should work, but you should make the string values read-only, as Lucero said.
You could also take a look at the new .NET 4.0 Tuple types. Although they are implemented as classes (because of supporting up to quite many parameters), they implement the new interface IStructuralEquatable which is intended exactly for your purpose.

It depends on what you mean by "same values". If you want the strings themselves to be unequal to one another—rather than just the references—then you're going to have to write your own Equals(MyTuple) method.

What can go wrong if one fails to override GetHashCode() when overriding Equals()? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is it important to override GetHashCode when Equals method is overridden?
In C#, what specifically can go wrong if one fails to override GetHashCode() when overriding Equals()?

The most visible way is for mapping structures.
Any class which does this will have unpredictable behavior when used as the Key for a Dictionary or HashTable. The reason being is that the implementation uses both GetHashCode and Equals to properly find a value in the table. The short version of the algorithm is the following
Take the modulus of the HashCode by the number of buckets and that's the bucket index
Call .Equals() for the specified Key and every Key in the particular bucket.
If there is a match that is the value, no match = no value.
Failing to keep GetHashCode and Equals in sync will completely break this algorithm (and numerous others).

Think of a hash / dictionary structure as a collection of numbered buckets. If you always put things in the bucket corresponding to their GetHashCode(), then you only have to search one bucket (using Equals()) to see if something is there. This works, provided you're looking in the right bucket.
So the rule is: if Equals() says two objects are Equal(), they must have the same GetHashCode().

If you do not override GetHashCode, anything that compares your objects may get it wrong.
As it is documented that GetHashCode must return the same value if two instances are equal, then it is the prerogative of any code which wishes to test them for equality to use GetHashCode as a first pass to group objects which may be equal (as it knows that objects with different hash codes cannot be equal). If your GetHashCode method returns different values for equal objects then they may get into different groups in the first pass and never be compared using their Equals method.
This could affect any collection-type data structure, but would be particularly problematic in hash-code based ones such as dictionaries and hash-sets.
In short: Always override GetHashCode when you override Equals, and ensure their implementations are consistent.

Any algorithm that uses the Key will fail to work, assuming it relies on the intended behaviour of hash keys.
Two objects that are Equal should have the same hash key value, which is not remotely guaranteed by the default implementation.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.