Questions about IEqualityComparer<T> / List<T>.Distinct()

Questions about IEqualityComparer<T> / List<T>.Distinct() - c#

Here is the equality comparer I just wrote because I wanted a distinct set of items from a list containing entities.
class InvoiceComparer : IEqualityComparer<Invoice>
{
public bool Equals(Invoice x, Invoice y)
{
// A
if (Object.ReferenceEquals(x, y)) return true;
// B
if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null)) return false;
// C
return x.TxnID == y.TxnID;
}
public int GetHashCode(Invoice obj)
{
if (Object.ReferenceEquals(obj, null)) return 0;
return obj.TxnID2.GetHashCode();
}
}
Why does Distinct require a comparer as opposed to a Func<T,T,bool>?
Are (A) and (B) anything other than optimizations, and are there scenarios when they would not act the expected way, due to subtleness in comparing references?
If I wanted to, could I replace (C) with
return GetHashCode(x) == GetHashCode(y)

So it can use hashcodes to be O(n) as opposed to O(n2)
(A) is an optimization.
(B) is necessary; otherwise, it would throw an NullReferenceException.
If Invoice is a struct, however, they're both unnecessary and slower.
No. Hashcodes are not unique

A is a simple and quick way to ensure that both objects located at the same memory address so both references the same object.
B - if one of the references is null - obviuosly it does not make any sense doing equality comparision
C - no, sometimes GetHashCode() can return the same value for different objects (hash collision) so you should do equality comparison
Regarding the same hash code value for different objects, MSDN:
If two objects compare as equal, the GetHashCode method for each
object must return the same value. However, if two objects do not
compare as equal, the GetHashCode methods for the two object do not
have to return different values.

Distinct() basically works on the term "not equal". therefore, if your list contains non-primitiv types, you must implement your own EqualityComparer.
At A, you check out whether the objects are identical or not. If two objects are equal, they don't have to be identical, but if they are identical, you can be sure that they are equal. So the A part can increase the method's effectivity in some cases.

Related

Understanding behavior & overriding GetHashCode()

I tried to followed the Guidelines from MSDN and also referred to This great question but the following seems to not behave as expected.
I'm trying to represent a structure similar to a FQN where as if P1 was listed before P2, P2 would only exist in the same set as P1. Like how scope works.
On the subject of GetHashCode()
I have a class with properties like this.
class data{
public readonly string p1, p2;
public data(string p1, string p2) {
this.p1 = p1;
this.p2 = p2;
}
public override int GetHashCode()
{
return this.p1.GetHashCode() ^ this.p2.GetHashCode();
}
/*also show the equal for comparison*/
public override bool Equals(System.Object obj)
{
if (obj == null)
return false;
data d = obj as data;
if ((System.Object)d == null)
return false;
/*I thought this would be smart*/
return d.ToString() == this.ToString();
}
public override string ToString() {
return "[" + p1 +"][" + p2+ "]";
}
}
In a Dictionary (dict) I use data as a key, so this would make the scope look like d1.p1.p2 (or rather p2 of p1 of d1, however you prefer to imagine it)
Dictionary<data,int> dict = new Dictionary<data,int>();
I've examined the behavior when d1.p1 and another d2.p1 are different, the operation resolves correctly. However when d1.p1 and d2.p1 are the same and p2 of d1 and d2 are different I observe the following behavior.
data d1 = new data(){ p1="p1", p2="p2" };
data d2 = new data(){ p1="p1", p2="XX" };
dict.add(d1, 0);
dict.add(d2, 1);
dict[d1] = 4;
The result is that both elements are 4
Is GetHashCode() overridden correctly?
Is Equals overridden correctly?
If they are both fine how/why does this behavior happen?
On the subject of a Dictionary
In the watch window (VS2013), I have my dictionary's key list show me, instead of a single key per index as I would normally expect, each property of my data object is a key for a single index. So I'm not sure if in there lay the problem or I'm just misunderstanding the Watch window's representation of an object as a key. I know how that is the way VS will display an object but, I'm not certain that's how I would expect it to be displayed for a key in a dictionary.
I thought GetHashCode() was a Dictionary's primary "comparison" operation, is this always correct?
What's the real "Index" to a Dictionary where the key is an object?
UPDATE
After looking at each hash code directly I noticed that they do repeat. Yet the Dictionary does not determine that the index exists. Below is an example of the data I see.
1132917379 string: [ABC][ABC]
-565659420 string: [ABC][123]
-1936108909 string: [123][123]
//second loop with next set of strings
1132917379 string: [xxx][xxx]
-565659420 string: [xxx][yyy]
//...etc

Is GetHachCode() overridden correctly?
Sure, for some definition of "correct". It may not be overridden well, but it's not an incorrect implementation (since two instances of the class that are considered equal will hash to the same value). Of course with that requirement you could always just return 0 from GetHashCode and it would be "correct". It certainly wouldn't be good.
That said your particular implementation is not as good as it could be. For example, in your class, the order of the strings matter. I.e. new data( "A", "B" ) != new data( "B", "A" ). However, these will always hash equal because your GetHashCode implementation is symmetric. Better to break the symmetry in some fashion. For example:
public int GetHashCode()
{
return p1.GetHashCode() ^ ( 13 * p2.GetHashCode() );
}
Now it's less likely that there will be a collision for two instances that are not equal.
Is Equal overridden correctly?
Well, it can definitely be improved. For example, the first null check is redundant and so is the cast to object in the second comparison. The whole thing would probably be better written as:
public bool Equals( object obj )
{
var other = obj as data;
if( other == null ) return false;
return p1 == obj.p1 && p2 == obj.p2;
}
I also removed the call to ToString since it doesn't significantly simplify the code or make it more readable. It's also an inefficient way of performing the comparison, since you have to construct two new strings before the comparison even happens. Just comparing the members directly gives you more opportunities for an early out and, more importantly, is a lot easier to read (the actual equality implementation doesn't depend on the string representation).
If they are both fine how/why does this behavior happen?
I don't know, because the code you've given won't do this. It also won't compile. Your data class has two readonly fields, you can't assign those using an initializer list as you've shown in your last code snippet.
I can only speculate as to the behavior you're seeing, because nothing you've shown here would result in the behavior as described.
The best advice I can give is to make sure that your key class is not mutable. Mutable types will not play nice with Dictionary. The Dictionary class does not expect the hash codes of objects to change, so if GetHashCode depends on any part of your class that is mutable, it's very possible for things to get very messed up.
I thought GetHachCode() was a Dictionary's primary "comparison" operation, is this always correct?
Dictionary only uses GetHashCode as a way to "address" objects (to be specific the hash code is used to identify which bucket an item should be placed in). It doesn't use it directly as a comparison necessarily. And if it does, it can only use it to determine that two objects are not equal, it can't use it to determine if they are equal.
What's the real "Index" to a Dictionary where the key is an object?
I'm not entirely sure what you're asking here, but I'm inclined to say that the answer is that it doesn't matter. Where the item goes is unimportant. If you care about that sort of thing, you probably shouldn't be using a Dictionary.

Is GetHashCode() overridden correctly?
No. You allow passing null for p1 or p2 and null.GetHashCode() throws a NullReferenceException which is not allowed in GetHashCode. Either forbid passing null, or make GetHashCode return an int for nulls (zero is OK).
You are also XORing unaltered ints; this means every class you create that contain two of the same values will have a hashCode of zero. This is a very common collision; typically one multiplies each hashcode by a prime number to avoid this.
Is Equals overridden correctly?
No. The page you linked to is the non-generic Equals used by System.Collections.HashTable. You are using the generic System.Collections.Generic.Dictionary, which uses the generic IEquatable<T>. You need to implement IEquatable<data> as explained in the accepted answer to the SO question you posted.
It is true that IEquatable<data> will fall back to Equals(System.Object obj) if not defined, but do not rely on that behavior. Also, converting ints to strings to compare them is not “smart”. Any time you feel you should write a comment excusing something as “smart” you are making a mistake.
A better implementation of 'data` would be:
public class MatPair : IEquatable<MatPair>
{
public readonly string MatNeedsToExplainWhatThisRepresents;
public readonly string MatNeedsToExplainThisToo;
public MatPair(string matNeedsToExplainWhatThisRepresents,
string matNeedsToExplainThisToo)
{
if (matNeedsToExplainWhatThisRepresents == null) throw new ArgumentNullException("matNeedsToExplainWhatThisRepresents");
if (matNeedsToExplainThisToo == null) throw new ArgumentNullException("matNeedsToExplainThisToo");
this.MatNeedsToExplainWhatThisRepresents = matNeedsToExplainWhatThisRepresents;
this.MatNeedsToExplainThisToo = matNeedsToExplainThisToo;
}
[Obsolete]
public override bool Equals(object obj)
{
return Equals(obj as MatPair);
}
public bool Equals(MatPair matPair)
{
return matPair != null
&& matPair.MatNeedsToExplainWhatThisRepresents == MatNeedsToExplainWhatThisRepresents
&& matPair.MatNeedsToExplainThisToo == MatNeedsToExplainThisToo;
}
public override int GetHashCode()
{
unchecked
{
return MatNeedsToExplainWhatThisRepresents.GetHashCode() * 31
^ MatNeedsToExplainThisToo.GetHashCode();
}
}
public override string ToString()
{
return "{" + MatNeedsToExplainWhatThisRepresents + ", "
+ MatNeedsToExplainThisToo + "}";
}
}

Can I compare two arbitrary references in a way that can be used in a CompareTo method?

I'm writing an implementation of IComparable<T>.CompareTo(T) for a struct. I'm doing a member-wise comparison (i.e. d = a.CompareTo(other.a); if (d != 0) { return d; } etc.), but one of the members is of a class (let's call it Y) that doesn't implement IComparable or have any other reasonable way of comparing so I just want to compare the references (in this case I know that all instances of Y are unique). Is there a way of doing that that will yield an int that is suitable for use in a comparison method?

It's not meaningful to compare references looking for order relationships. It's only meaningful to look for equality.

Your statement that the class
doesn't implement IComparable or have
any other reasonable way of comparing
Seems to be contra-indicative of finding
a way of doing that that will yield an
int that is suitable for use in a
comparison method
If the objects cannot be reasonably compared for ordering, it is best to exclude them from the comparison logic entirely.

Comparing the results of Object.GetHashCode() works well enough for my purposes (I just care about reference equality and it doesn't matter where anything unequal is sorted. I want instances of T that have the same instances of Y as members to be next to each other in the sorted result). My T.CompareTo() uses the following helper:
static int Compare(object a, object b)
{
if (a == null)
{
return b == null ? 0 : 1;
}
if (b == null)
{
return -1;
}
return a.GetHashCode().CompareTo(b.GetHashCode());
}

GetHashCode() problem using xor

My understanding is that you're typically supposed to use xor with GetHashCode() to produce an int to identify your data by its value (as opposed to by its reference). Here's a simple example:
class Foo
{
int m_a;
int m_b;
public int A
{
get { return m_a; }
set { m_a = value; }
}
public int B
{
get { return m_b; }
set { m_b = value; }
}
public Foo(int a, int b)
{
m_a = a;
m_b = b;
}
public override int GetHashCode()
{
return A ^ B;
}
public override bool Equals(object obj)
{
return this.GetHashCode() == obj.GetHashCode();
}
}
The idea being, I want to compare one instance of Foo to another based on the value of properties A and B. If Foo1.A == Foo2.A and Foo1.B == Foo2.B, then we have equality.
Here's the problem:
Foo one = new Foo(1, 2);
Foo two = new Foo(2, 1);
if (one.Equals(two)) { ... } // This is true!
These both produce a value of 3 for GetHashCode(), causing Equals() to return true. Obviously, this is a trivial example, and with only two properties I could simply compare the individual properties in the Equals() method. However, with a more complex class this would get out of hand quickly.
I know that sometimes it makes good sense to set the hash code only once, and always return the same value. However, for mutable objects where an evaluation of equality is necessary, I don't think this is reasonable.
What's the best way to handle property values that could easily be interchanged when implementing GetHashCode()?
See Also
What is the best algorithm for an overridden System.Object.GetHashCode?

First off - Do not implement Equals() only in terms of GetHashCode() - hashcodes will sometimes collide even when objects are not equal.
The contract for GetHashCode() includes the following:
different hashcodes means that objects are definitely not equal
same hashcodes means objects might be equal (but possibly might not)
Andrew Hare suggested I incorporate his answer:
I would recommend that you read this solution (by our very own Jon Skeet, by the way) for a "better" way to calculate a hashcode.
No, the above is relatively slow and
doesn't help a lot. Some people use
XOR (eg a ^ b ^ c) but I prefer the
kind of method shown in Josh Bloch's
"Effective Java":
public override int GetHashCode()
{
int hash = 23;
hash = hash*37 + craneCounterweightID;
hash = hash*37 + trailerID;
hash = hash*37 + craneConfigurationTypeCode.GetHashCode();
return hash;
}
The 23 and 37 are arbitrary numbers
which are co-prime.
The benefit of the above over the XOR
method is that if you have a type
which has two values which are
frequently the same, XORing those
values will always give the same
result (0) whereas the above will
differentiate between them unless
you're very unlucky.
As mentioned in the above snippet, you might also want to look at Joshua Bloch's book, Effective Java, which contains a nice treatment of the subject (the hashcode discussion applies to .NET as well).

Andrew has posted a good example for generating a better hash code, but also bear in mind that you shouldn't use hash codes as an equality check, since they are not guaranteed to be unique.
For a trivial example of why this is consider a double object. It has more possible values than an int so it is impossible to have a unique int for each double. Hashes are really just a first pass, used in situations like a dictionary when you need to find the key quickly, by first comparing hashes a large percentage of the possible keys can be ruled out and only the keys with matching hashes need to have the expense of a full equality check (or other collision resolution methods).

Hashing always involves collisions and you have to deal with it (f.e., compare hash values and if they are equal, exactly compare the values inside the classes to be sure the classes are equal).
Using a simple XOR, you'll get many collisions. If you want less, use some mathematical functions that distribute values across different bits (bit shifts, multiplying with primes etc.).

Read Overriding GetHashCode for mutable objects? C# and think about implementing IEquatable<T>

There are several better hash implementations. FNV hash for example.

Out of curiosity since hashcodes are typically a bad idea for comparison, wouldn't it be better to just do the following code, or am I missing something?
public override bool Equals(object obj)
{
bool isEqual = false;
Foo otherFoo = obj as Foo;
if (otherFoo != null)
{
isEqual = (this.A == otherFoo.A) && (this.B == otherFoo.B);
}
return isEqual;
}

A quick generate and good distribution of hash
public override int GetHashCode()
{
return A.GetHashCode() ^ B.GetHashCode(); // XOR
}

When, typically, do you use == equality on a reference typed variable in Java / C#?

As a kind of follow up to the question titled "Difference Between Equals and ==": in what kind of situations would you find yourself testing for reference equality in Java / C#?

Consolidating answers ...
When, typically, do you use ==
equality on a reference typed variable
in Java / C#?
1. To check for null:
if (a == null) ...
2. For efficiency when you are constructing an equals implementation:
boolean equals(Object o) {
if (o == null) return false;
if (this == o) return true;
// Some people would prefer "if (!(o instanceof ClassName)) ..."
if (getClass() != o.getClass()) return false;
// Otherwise, cast o, leverage super.equals() when possible, and
// compare each instance variable ...
3. For efficiency when you are comparing enums or comparing objects of a class designed such that comparing object identity is equivalent to checking object equivalence (e.g. Class objects):
enum Suit { DIAMONDS, HEARTS, CLUBS, SPADES }
class SomeCardGame {
...
boolean static isATrumpCard(Card c) {
return (c.getSuit() == TRUMP_SUIT);
}
}
4. When you really intend to check object identity, not object equivalence, e.g. a test case that wants to make sure a class isn't giving up a reference to an internal data structure instance.
boolean iChoseNotToUseJUnitForSomeReasonTestCase() {
final List<String> externalList = testObject.getCopyOfList();
final List<String> instanceVariableValue =
getInstanceVariableValueViaCleverReflectionTrickTo(testObject, "list");
if (instanceVariableValue == externalList) {
System.out.println("fail");
} else {
System.out.println("pass");
}
}
Interestingly, for point #3, one article suggests using equals is safer than using .equals() because the compiler will complain if you attempt to compare to object references that are not of the same class (http://schneide.wordpress.com/2008/09/22/or-equals/).

For Java, most commonly to see if a reference is null:
if (someReference == null) {
//do something
}
It is also pretty common with enums, but the most common place to use it that I've seen is in a properly implemented equals method. The first check will check for reference equality and only do the more expensive calculation if that returns false.

Because it's so fast and simple, == is routinely part of the Equals() function. If two objects are ==, then they're Equals, and you don't need any further processing. For primitives in Java, == is your only option (I don't think that's true of C#, but am not sure). I'll use == to test for null (although more frequently it's !=). Other than that... Stick with Equals(). It's rare and probably a sign of trouble to need == over Equals.

You can use it for quick equality checks.
For example if you are doing a compare with some very large objects that may take a while to iterate through, you can save time to see if they are referring to the same object first
Compare(LargeObj a, LargeObj b) {
if (a == b)
return true;
// do other comparisons here
}

In C#, the == operator may be overloaded, which would mean that you can compare reference types for LOGICAL equality.
Any time the implementer of a class thought that it would be logical to compare equality, you may do so using their methods.
This being said, be careful when checking if (a == b) in C#. This MAY be a reference equals, or it MAY be a logical comparison.
In order to keep the two cases distinct, it could be a good idea to use the two following methods, just to be clear:
object.ReferenceEquals(a, b)
a.Equals(b)
The first one is a reference comparison, and the second is usually a logical one.
here is a typical == overload, just for reference:
public static bool operator ==(RRWinRecord lhs, RRWinRecord rhs)
{
if (object.ReferenceEquals(lhs, rhs))
{
return true;
}
else if ((object)lhs == null || (object)rhs == null)
{
return false;
}
else
{
return lhs.Wins == rhs.Wins &&
lhs.Losses == rhs.Losses &&
lhs.Draws == rhs.Draws &&
lhs.OverallScore == rhs.OverallScore;
}
}

Try to avoid == with objects that are normally defined with final static and then passed as members in classes that are serialized.
For example (before we had enums) I had a class defined that simulated the idea of an enumeration. Since the constructor was private and all allowed instances were defined as final statics I wrongly assumed that == was always safe to use on these objects.
For example (typing code without compiling, so excuse me if there are some issues with the code)
public class CarType implements Serializable {
public final static CarType DODGE = new CarType("Dodge");
public final static CarType JEEP = new CarType("JEEP");
private final String mBrand;
private CarType( String pBrand ) {
mBrand = pBrand;
}
public boolean equals( Object pOther ) {
...
}
}
The CarType instances were serialized when used in other objects... but after materialisation (on a different JVM instance) the == operation failed.

Basically, you can use == when you "know" that "==" is equivalent to ".equals" (in Java).
By doing so, you can gain performance.
A simple example where you'll see this is in a .equals method.
public boolean equals(Object o) {
// early short circuit test
if (this == o) {
return true;
}
// rest of equals method...
}
But in some algorithms, you have good knowledge of the objects you're working with, so you can rely on == for tests. Such as in containers, working with graphs, etc.

My personal preference (C#) is to always leave == to mean reference equality (referring to the same object). If I need logical equivalence, I'll use Equals<T>. If I follow this rule uniformly, then there's never any confusion.

Why is it important to override GetHashCode when Equals method is overridden?

Given the following class
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Which is preferred?
return base.GetHashCode();
//return this.FooId.GetHashCode();
}
}
I have overridden the Equals method because Foo represent a row for the Foos table. Which is the preferred method for overriding the GetHashCode?
Why is it important to override GetHashCode?

Yes, it is important if your item will be used as a key in a dictionary, or HashSet<T>, etc - since this is used (in the absence of a custom IEqualityComparer<T>) to group items into buckets. If the hash-code for two items does not match, they may never be considered equal (Equals will simply never be called).
The GetHashCode() method should reflect the Equals logic; the rules are:
if two things are equal (Equals(...) == true) then they must return the same value for GetHashCode()
if the GetHashCode() is equal, it is not necessary for them to be the same; this is a collision, and Equals will be called to see if it is a real equality or not.
In this case, it looks like "return FooId;" is a suitable GetHashCode() implementation. If you are testing multiple properties, it is common to combine them using code like below, to reduce diagonal collisions (i.e. so that new Foo(3,5) has a different hash-code to new Foo(5,3)):
In modern frameworks, the HashCode type has methods to help you create a hashcode from multiple values; on older frameworks, you'd need to go without, so something like:
unchecked // only needed if you're compiling with arithmetic checks enabled
{ // (the default compiler behaviour is *disabled*, so most folks won't need this)
int hash = 13;
hash = (hash * 7) + field1.GetHashCode();
hash = (hash * 7) + field2.GetHashCode();
...
return hash;
}
Oh - for convenience, you might also consider providing == and != operators when overriding Equals and GetHashCode.
A demonstration of what happens when you get this wrong is here.

It's actually very hard to implement GetHashCode() correctly because, in addition to the rules Marc already mentioned, the hash code should not change during the lifetime of an object. Therefore the fields which are used to calculate the hash code must be immutable.
I finally found a solution to this problem when I was working with NHibernate.
My approach is to calculate the hash code from the ID of the object. The ID can only be set though the constructor so if you want to change the ID, which is very unlikely, you have to create a new object which has a new ID and therefore a new hash code. This approach works best with GUIDs because you can provide a parameterless constructor which randomly generates an ID.

By overriding Equals you're basically stating that you know better how to compare two instances of a given type.
Below you can see an example of how ReSharper writes a GetHashCode() function for you. Note that this snippet is meant to be tweaked by the programmer:
public override int GetHashCode()
{
unchecked
{
var result = 0;
result = (result * 397) ^ m_someVar1;
result = (result * 397) ^ m_someVar2;
result = (result * 397) ^ m_someVar3;
result = (result * 397) ^ m_someVar4;
return result;
}
}
As you can see it just tries to guess a good hash code based on all the fields in the class, but if you know your object's domain or value ranges you could still provide a better one.

Please don´t forget to check the obj parameter against null when overriding Equals().
And also compare the type.
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
The reason for this is: Equals must return false on comparison to null. See also http://msdn.microsoft.com/en-us/library/bsc2ak47.aspx

How about:
public override int GetHashCode()
{
return string.Format("{0}_{1}_{2}", prop1, prop2, prop3).GetHashCode();
}
Assuming performance is not an issue :)

As of .NET 4.7 the preferred method of overriding GetHashCode() is shown below. If targeting older .NET versions, include the System.ValueTuple nuget package.
// C# 7.0+
public override int GetHashCode() => (FooId, FooName).GetHashCode();
In terms of performance, this method will outperform most composite hash code implementations. The ValueTuple is a struct so there won't be any garbage, and the underlying algorithm is as fast as it gets.

Just to add on above answers:
If you don't override Equals then the default behavior is that references of the objects are compared. The same applies to hashcode - the default implmentation is typically based on a memory address of the reference.
Because you did override Equals it means the correct behavior is to compare whatever you implemented on Equals and not the references, so you should do the same for the hashcode.
Clients of your class will expect the hashcode to have similar logic to the equals method, for example linq methods which use a IEqualityComparer first compare the hashcodes and only if they're equal they'll compare the Equals() method which might be more expensive to run, if we didn't implement hashcode, equal object will probably have different hashcodes (because they have different memory address) and will be determined wrongly as not equal (Equals() won't even hit).
In addition, except the problem that you might not be able to find your object if you used it in a dictionary (because it was inserted by one hashcode and when you look for it the default hashcode will probably be different and again the Equals() won't even be called, like Marc Gravell explains in his answer, you also introduce a violation of the dictionary or hashset concept which should not allow identical keys -
you already declared that those objects are essentially the same when you overrode Equals so you don't want both of them as different keys on a data structure which suppose to have a unique key. But because they have a different hashcode the "same" key will be inserted as different one.

It is because the framework requires that two objects that are the same must have the same hashcode. If you override the equals method to do a special comparison of two objects and the two objects are considered the same by the method, then the hash code of the two objects must also be the same. (Dictionaries and Hashtables rely on this principle).

We have two problems to cope with.
You cannot provide a sensible GetHashCode() if any field in the
object can be changed. Also often a object will NEVER be used in a
collection that depends on GetHashCode(). So the cost of
implementing GetHashCode() is often not worth it, or it is not
possible.
If someone puts your object in a collection that calls
GetHashCode() and you have overrided Equals() without also making
GetHashCode() behave in a correct way, that person may spend days
tracking down the problem.
Therefore by default I do.
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Some comment to explain if there is a real problem with providing GetHashCode()
// or if I just don't see a need for it for the given class
throw new Exception("Sorry I don't know what GetHashCode should do for this class");
}
}

Hash code is used for hash-based collections like Dictionary, Hashtable, HashSet etc. The purpose of this code is to very quickly pre-sort specific object by putting it into specific group (bucket). This pre-sorting helps tremendously in finding this object when you need to retrieve it back from hash-collection because code has to search for your object in just one bucket instead of in all objects it contains. The better distribution of hash codes (better uniqueness) the faster retrieval. In ideal situation where each object has a unique hash code, finding it is an O(1) operation. In most cases it approaches O(1).

It's not necessarily important; it depends on the size of your collections and your performance requirements and whether your class will be used in a library where you may not know the performance requirements. I frequently know my collection sizes are not very large and my time is more valuable than a few microseconds of performance gained by creating a perfect hash code; so (to get rid of the annoying warning by the compiler) I simply use:
public override int GetHashCode()
{
return base.GetHashCode();
}
(Of course I could use a #pragma to turn off the warning as well but I prefer this way.)
When you are in the position that you do need the performance than all of the issues mentioned by others here apply, of course. Most important - otherwise you will get wrong results when retrieving items from a hash set or dictionary: the hash code must not vary with the life time of an object (more accurately, during the time whenever the hash code is needed, such as while being a key in a dictionary): for example, the following is wrong as Value is public and so can be changed externally to the class during the life time of the instance, so you must not use it as the basis for the hash code:
class A
{
public int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //WRONG! Value is not constant during the instance's life time
}
}
On the other hand, if Value can't be changed it's ok to use:
class A
{
public readonly int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //OK Value is read-only and can't be changed during the instance's life time
}
}

You should always guarantee that if two objects are equal, as defined by Equals(), they should return the same hash code. As some of the other comments state, in theory this is not mandatory if the object will never be used in a hash based container like HashSet or Dictionary. I would advice you to always follow this rule though. The reason is simply because it is way too easy for someone to change a collection from one type to another with the good intention of actually improving the performance or just conveying the code semantics in a better way.
For example, suppose we keep some objects in a List. Sometime later someone actually realizes that a HashSet is a much better alternative because of the better search characteristics for example. This is when we can get into trouble. List would internally use the default equality comparer for the type which means Equals in your case while HashSet makes use of GetHashCode(). If the two behave differently, so will your program. And bear in mind that such issues are not the easiest to troubleshoot.
I've summarized this behavior with some other GetHashCode() pitfalls in a blog post where you can find further examples and explanations.

As of C# 9(.net 5 or .net core 3.1), you may want to use records as it does Value Based Equality by default.

It's my understanding that the original GetHashCode() returns the memory address of the object, so it's essential to override it if you wish to compare two different objects.
EDITED:
That was incorrect, the original GetHashCode() method cannot assure the equality of 2 values. Though objects that are equal return the same hash code.

Below using reflection seems to me a better option considering public properties as with this you don't have have to worry about addition / removal of properties (although not so common scenario). This I found to be performing better also.(Compared time using Diagonistics stop watch).
public int getHashCode()
{
PropertyInfo[] theProperties = this.GetType().GetProperties();
int hash = 31;
foreach (PropertyInfo info in theProperties)
{
if (info != null)
{
var value = info.GetValue(this,null);
if(value != null)
unchecked
{
hash = 29 * hash ^ value.GetHashCode();
}
}
}
return hash;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Questions about IEqualityComparer<T> / List<T>.Distinct() - c#

So it can use hashcodes to be O(n) as opposed to O(n2) (A) is an optimization. (B) is necessary; otherwise, it would throw an NullReferenceException. If Invoice is a struct, however, they're both unnecessary and slower. No. Hashcodes are not unique

Related

Understanding behavior & overriding GetHashCode()

Can I compare two arbitrary references in a way that can be used in a CompareTo method?

GetHashCode() problem using xor

When, typically, do you use == equality on a reference typed variable in Java / C#?

Why is it important to override GetHashCode when Equals method is overridden?

Categories

Resources