Object.GetHashCode - c#

My question may duplicate Default implementation for Object.GetHashCode() but I'm asking again because I didn't understand the accepted answer to that one.
To begin with I have three questions about the accepted answer to the previous question, which quotes some documentation as follows:
"However, because this index can be reused after the object is reclaimed during garbage collection, it is possible to obtain the same hash code for two different objects."
Is this true? It seems to me that two objects won't have the same hash code, because an object's code isn't reused until the object is garbage collected (i.e. no longer exists).
"Also, two objects that represent the same value have the same hash code only if they are the exact same object."
Is this a problem? For example, I want to associate some data with each of the node instances in a DOM tree. To do this, the 'nodes' must have an identity or hash code, so that I can use them as keys in a dictionary of the data. Isn't a hash code which identities whether it's "the exact same object", i.e. "reference equality rather than "value equality", what I want?
"This implementation is not particularly useful for hashing; therefore, derived classes should override GetHashCode"
Is this true? If it's not good for hashing, then what if anything is it good for, and why is it even defined as a method of Object?
My final (and perhaps most important to me) question is, if I must invent/override a GetHashCode() implementation for an arbitrary type which has "reference equality" semantics, is the following a reasonable and good implementation:
class SomeType
{
//create a new value for each instance
static int s_allocated = 0;
//value associated with this instance
int m_allocated;
//more instance data
... plus other data members ...
//constructor
SomeType()
{
allocated = ++s_allocated;
}
//override GetHashCode
public override int GetHashCode()
{
return m_allocated;
}
}
Edit
FYI I tested it, using the following code:
class TestGetHash
{
//default implementation
class First
{
int m_x;
}
//my implementation
class Second
{
static int s_allocated = 0;
int m_allocated;
int m_x;
public Second()
{
m_allocated = ++s_allocated;
}
public override int GetHashCode()
{
return m_allocated;
}
}
//stupid worst-case implementation
class Third
{
int m_x;
public override int GetHashCode()
{
return 0;
}
}
internal static void test()
{
testT<First>(100, 1000);
testT<First>(1000, 100);
testT<Second>(100, 1000);
testT<Second>(1000, 100);
testT<Third>(100, 100);
testT<Third>(1000, 10);
}
static void testT<T>(int objects, int iterations)
where T : new()
{
System.Diagnostics.Stopwatch stopWatch =
System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < iterations; ++i)
{
Dictionary<T, object> dictionary = new Dictionary<T, object>();
for (int j = 0; j < objects; ++j)
{
T t = new T();
dictionary.Add(t, null);
}
for (int k = 0; k < 100; ++k)
{
foreach (T t in dictionary.Keys)
{
object o = dictionary[t];
}
}
}
stopWatch.Stop();
string stopwatchMessage = string.Format(
"Stopwatch: {0} type, {1} objects, {2} iterations, {3} msec",
typeof(T).Name, objects, iterations,
stopWatch.ElapsedMilliseconds);
System.Console.WriteLine(stopwatchMessage);
}
}
On my machine the results/output are as follows:
First type, 100 objects, 1000 iterations, 2072 msec
First type, 1000 objects, 100 iterations, 2098 msec
Second type, 100 objects, 1000 iterations, 1300 msec
Second type, 1000 objects, 100 iterations, 1319 msec
Third type, 100 objects, 100 iterations, 1487 msec
Third type, 1000 objects, 10 iterations, 13754 msec
My implementation takes half the time of the default implementation (but my type is bigger by the size of my m_allocated data member).
My implementation and the default implementation both scale linearly.
In comparison and as a sanity check, the stupid implementation starts bad and scales worse.

The most important property a hash code implementation must have is this:
If two objects compare as equal then they must have identical hash codes.
If you have a class where instances of the class compare by reference equality, then you do not need to override GetHashCode; the default implementation guarantees that two objects that are the same reference have the same hash code. (You're calling the same method twice on the same object, so of course the result is the same.)
If you have written a class which implements its own equality that is different from reference equality then you are REQUIRED to override GetHashCode such that two objects that compare as equal have equal hash codes.
Now, you could do so by simply returning zero every time. That would be a lousy hash function, but it would be legal.
Other properties of good hash functions are:
GetHashCode should never throw an exception
Mutable objects which compare for equality on their mutable state, and therefore hash on their mutable state, are dangerously bug-prone. You can put an object into a hash table, mutate it, and be unable to get it out again. Try to never hash or compare for equality on mutable state.
GetHashCode should be extremely fast -- remember, the purpose of a good hash algorithm is to improve the performance of lookups. If the hash is slow then the lookups can't be made fast.
Objects which do not compare as equal should have dissimilar hash codes, well distributed over the whole range of a 32 bit integer

Question:
Is this true? It seems to me that two objects won't have the same hash code, because
an object's code isn't reused until the object is garbage collected (i.e. no longer exists).
Two objects may share the same hash code, if it is generated by default GetHashCode implementation, because:
Default GetHashCode result shouldn't be changed during object's lifetime, and default implementation ensures this. If it could change, such types as Hashtable couldn't deal with this implementation. That's because it's expected that default hash code is a hash code of unique instance identifier (even although there is no such identifier :) ).
Range of GetHashCode values is range of integer (2^32).
Conclusion:
It's enough to allocate 2^32 strongly-referenced objects to (must be easy on Win64) to reach the limit.
Finally, there is an explicit statement in object.GetHashCode reference in MSDN: The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.

You do not actually need to modify anything on a class which requires only reference equality.
Also, formally, that is not a good implementation since it has poor distribution. A hash function should have a reasonable distribution since it improves hash bucket distribution, and indirectly, performance in collections which use hash tables. As I said, that is a formal answer, one of the guidelines when designing a hash function.

Related

Best way to keep track of objects when writing a debugger [duplicate]

Is there a way of getting a unique identifier of an instance?
GetHashCode() is the same for the two references pointing to the same instance. However, two different instances can (quite easily) get the same hash code:
Hashtable hashCodesSeen = new Hashtable();
LinkedList<object> l = new LinkedList<object>();
int n = 0;
while (true)
{
object o = new object();
// Remember objects so that they don't get collected.
// This does not make any difference though :(
l.AddFirst(o);
int hashCode = o.GetHashCode();
n++;
if (hashCodesSeen.ContainsKey(hashCode))
{
// Same hashCode seen twice for DIFFERENT objects (n is as low as 5322).
Console.WriteLine("Hashcode seen twice: " + n + " (" + hashCode + ")");
break;
}
hashCodesSeen.Add(hashCode, null);
}
I'm writing a debugging addin, and I need to get some kind of ID for a reference which is unique during the run of the program.
I already managed to get internal ADDRESS of the instance, which is unique until the garbage collector (GC) compacts the heap (= moves the objects = changes the addresses).
Stack Overflow question Default implementation for Object.GetHashCode() might be related.
The objects are not under my control as I am accessing objects in a program being debugged using the debugger API. If I was in control of the objects, adding my own unique identifiers would be trivial.
I wanted the unique ID for building a hashtable ID -> object, to be able to lookup already seen objects. For now I solved it like this:
Build a hashtable: 'hashCode' -> (list of objects with hash code == 'hashCode')
Find if object seen(o) {
candidates = hashtable[o.GetHashCode()] // Objects with the same hashCode.
If no candidates, the object is new
If some candidates, compare their addresses to o.Address
If no address is equal (the hash code was just a coincidence) -> o is new
If some address equal, o already seen
}
.NET 4 and later only
Good news, everyone!
The perfect tool for this job is built in .NET 4 and it's called ConditionalWeakTable<TKey, TValue>. This class:
can be used to associate arbitrary data with managed object instances much like a dictionary (although it is not a dictionary)
does not depend on memory addresses, so is immune to the GC compacting the heap
does not keep objects alive just because they have been entered as keys into the table, so it can be used without making every object in your process live forever
uses reference equality to determine object identity; moveover, class authors cannot modify this behavior so it can be used consistently on objects of any type
can be populated on the fly, so does not require that you inject code inside object constructors
Checked out the ObjectIDGenerator class? This does what you're attempting to do, and what Marc Gravell describes.
The ObjectIDGenerator keeps track of previously identified objects. When you ask for the ID of an object, the ObjectIDGenerator knows whether to return the existing ID, or generate and remember a new ID.
The IDs are unique for the life of the ObjectIDGenerator instance. Generally, a ObjectIDGenerator life lasts as long as the Formatter that created it. Object IDs have meaning only within a given serialized stream, and are used for tracking which objects have references to others within the serialized object graph.
Using a hash table, the ObjectIDGenerator retains which ID is assigned to which object. The object references, which uniquely identify each object, are addresses in the runtime garbage-collected heap. Object reference values can change during serialization, but the table is updated automatically so the information is correct.
Object IDs are 64-bit numbers. Allocation starts from one, so zero is never a valid object ID. A formatter can choose a zero value to represent an object reference whose value is a null reference (Nothing in Visual Basic).
The reference is the unique identifier for the object. I don't know of any way of converting this into anything like a string etc. The value of the reference will change during compaction (as you've seen), but every previous value A will be changed to value B, so as far as safe code is concerned it's still a unique ID.
If the objects involved are under your control, you could create a mapping using weak references (to avoid preventing garbage collection) from a reference to an ID of your choosing (GUID, integer, whatever). That would add a certain amount of overhead and complexity, however.
RuntimeHelpers.GetHashCode() may help (MSDN).
You can develop your own thing in a second. For instance:
class Program
{
static void Main(string[] args)
{
var a = new object();
var b = new object();
Console.WriteLine("", a.GetId(), b.GetId());
}
}
public static class MyExtensions
{
//this dictionary should use weak key references
static Dictionary<object, int> d = new Dictionary<object,int>();
static int gid = 0;
public static int GetId(this object o)
{
if (d.ContainsKey(o)) return d[o];
return d[o] = gid++;
}
}
You can choose what you will like to have as unique ID on your own, for instance, System.Guid.NewGuid() or simply integer for fastest access.
How about this method:
Set a field in the first object to a new value. If the same field in the second object has the same value, it's probably the same instance. Otherwise, exit as different.
Now set the field in the first object to a different new value. If the same field in the second object has changed to the different value, it's definitely the same instance.
Don't forget to set field in the first object back to it's original value on exit.
Problems?
It is possible to make a unique object identifier in Visual Studio: In the watch window, right-click the object variable and choose Make Object ID from the context menu.
Unfortunately, this is a manual step, and I don't believe the identifier can be accessed via code.
You would have to assign such an identifier yourself, manually - either inside the instance, or externally.
For records related to a database, the primary key may be useful (but you can still get duplicates). Alternatively, either use a Guid, or keep your own counter, allocating using Interlocked.Increment (and make it large enough that it isn't likely to overflow).
I know that this has been answered, but it's at least useful to note that you can use:
http://msdn.microsoft.com/en-us/library/system.object.referenceequals.aspx
Which will not give you a "unique id" directly, but combined with WeakReferences (and a hashset?) could give you a pretty easy way of tracking various instances.
If you are writing a module in your own code for a specific usage, majkinetor's method MIGHT have worked. But there are some problems.
First, the official document does NOT guarantee that the GetHashCode() returns an unique identifier (see Object.GetHashCode Method ()):
You should not assume that equal hash codes imply object equality.
Second, assume you have a very small amount of objects so that GetHashCode() will work in most cases, this method can be overridden by some types.
For example, you are using some class C and it overrides GetHashCode() to always return 0. Then every object of C will get the same hash code.
Unfortunately, Dictionary, HashTable and some other associative containers will make use this method:
A hash code is a numeric value that is used to insert and identify an object in a hash-based collection such as the Dictionary<TKey, TValue> class, the Hashtable class, or a type derived from the DictionaryBase class. The GetHashCode method provides this hash code for algorithms that need quick checks of object equality.
So, this approach has great limitations.
And even more, what if you want to build a general purpose library?
Not only are you not able to modify the source code of the used classes, but their behavior is also unpredictable.
I appreciate that Jon and Simon have posted their answers, and I will post a code example and a suggestion on performance below.
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.Serialization;
using System.Collections.Generic;
namespace ObjectSet
{
public interface IObjectSet
{
/// <summary> check the existence of an object. </summary>
/// <returns> true if object is exist, false otherwise. </returns>
bool IsExist(object obj);
/// <summary> if the object is not in the set, add it in. else do nothing. </summary>
/// <returns> true if successfully added, false otherwise. </returns>
bool Add(object obj);
}
public sealed class ObjectSetUsingConditionalWeakTable : IObjectSet
{
/// <summary> unit test on object set. </summary>
internal static void Main() {
Stopwatch sw = new Stopwatch();
sw.Start();
ObjectSetUsingConditionalWeakTable objSet = new ObjectSetUsingConditionalWeakTable();
for (int i = 0; i < 10000000; ++i) {
object obj = new object();
if (objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.Add(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
public bool IsExist(object obj) {
return objectSet.TryGetValue(obj, out tryGetValue_out0);
}
public bool Add(object obj) {
if (IsExist(obj)) {
return false;
} else {
objectSet.Add(obj, null);
return true;
}
}
/// <summary> internal representation of the set. (only use the key) </summary>
private ConditionalWeakTable<object, object> objectSet = new ConditionalWeakTable<object, object>();
/// <summary> used to fill the out parameter of ConditionalWeakTable.TryGetValue(). </summary>
private static object tryGetValue_out0 = null;
}
[Obsolete("It will crash if there are too many objects and ObjectSetUsingConditionalWeakTable get a better performance.")]
public sealed class ObjectSetUsingObjectIDGenerator : IObjectSet
{
/// <summary> unit test on object set. </summary>
internal static void Main() {
Stopwatch sw = new Stopwatch();
sw.Start();
ObjectSetUsingObjectIDGenerator objSet = new ObjectSetUsingObjectIDGenerator();
for (int i = 0; i < 10000000; ++i) {
object obj = new object();
if (objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.Add(obj)) { Console.WriteLine("bug!!!"); }
if (!objSet.IsExist(obj)) { Console.WriteLine("bug!!!"); }
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
public bool IsExist(object obj) {
bool firstTime;
idGenerator.HasId(obj, out firstTime);
return !firstTime;
}
public bool Add(object obj) {
bool firstTime;
idGenerator.GetId(obj, out firstTime);
return firstTime;
}
/// <summary> internal representation of the set. </summary>
private ObjectIDGenerator idGenerator = new ObjectIDGenerator();
}
}
In my test, the ObjectIDGenerator will throw an exception to complain that there are too many objects when creating 10,000,000 objects (10x than in the code above) in the for loop.
Also, the benchmark result is that the ConditionalWeakTable implementation is 1.8x faster than the ObjectIDGenerator implementation.
The information I give here is not new, I just added this for completeness.
The idea of this code is quite simple:
Objects need a unique ID, which isn't there by default. Instead, we have to rely on the next best thing, which is RuntimeHelpers.GetHashCode to get us a sort-of unique ID
To check uniqueness, this implies we need to use object.ReferenceEquals
However, we would still like to have a unique ID, so I added a GUID, which is by definition unique.
Because I don't like locking everything if I don't have to, I don't use ConditionalWeakTable.
Combined, that will give you the following code:
public class UniqueIdMapper
{
private class ObjectEqualityComparer : IEqualityComparer<object>
{
public bool Equals(object x, object y)
{
return object.ReferenceEquals(x, y);
}
public int GetHashCode(object obj)
{
return RuntimeHelpers.GetHashCode(obj);
}
}
private Dictionary<object, Guid> dict = new Dictionary<object, Guid>(new ObjectEqualityComparer());
public Guid GetUniqueId(object o)
{
Guid id;
if (!dict.TryGetValue(o, out id))
{
id = Guid.NewGuid();
dict.Add(o, id);
}
return id;
}
}
To use it, create an instance of the UniqueIdMapper and use the GUID's it returns for the objects.
Addendum
So, there's a bit more going on here; let me write a bit down about ConditionalWeakTable.
ConditionalWeakTable does a couple of things. The most important thing is that it doens't care about the garbage collector, that is: the objects that you reference in this table will be collected regardless. If you lookup an object, it basically works the same as the dictionary above.
Curious no? After all, when an object is being collected by the GC, it checks if there are references to the object, and if there are, it collects them. So if there's an object from the ConditionalWeakTable, why will the referenced object be collected then?
ConditionalWeakTable uses a small trick, which some other .NET structures also use: instead of storing a reference to the object, it actually stores an IntPtr. Because that's not a real reference, the object can be collected.
So, at this point there are 2 problems to address. First, objects can be moved on the heap, so what will we use as IntPtr? And second, how do we know that objects have an active reference?
The object can be pinned on the heap, and its real pointer can be stored. When the GC hits the object for removal, it unpins it and collects it. However, that would mean we get a pinned resource, which isn't a good idea if you have a lot of objects (due to memory fragmentation issues). This is probably not how it works.
When the GC moves an object, it calls back, which can then update the references. This might be how it's implemented judging by the external calls in DependentHandle - but I believe it's slightly more sophisticated.
Not the pointer to the object itself, but a pointer in the list of all objects from the GC is stored. The IntPtr is either an index or a pointer in this list. The list only changes when an object changes generations, at which point a simple callback can update the pointers. If you remember how Mark & Sweep works, this makes more sense. There's no pinning, and removal is as it was before. I believe this is how it works in DependentHandle.
This last solution does require that the runtime doesn't re-use the list buckets until they are explicitly freed, and it also requires that all objects are retrieved by a call to the runtime.
If we assume they use this solution, we can also address the second problem. The Mark & Sweep algorithm keeps track of which objects have been collected; as soon as it has been collected, we know at this point. Once the object checks if the object is there, it calls 'Free', which removes the pointer and the list entry. The object is really gone.
One important thing to note at this point is that things go horribly wrong if ConditionalWeakTable is updated in multiple threads and if it isn't thread safe. The result would be a memory leak. This is why all calls in ConditionalWeakTable do a simple 'lock' which ensures this doesn't happen.
Another thing to note is that cleaning up entries has to happen once in a while. While the actual objects will be cleaned up by the GC, the entries are not. This is why ConditionalWeakTable only grows in size. Once it hits a certain limit (determined by collision chance in the hash), it triggers a Resize, which checks if objects have to be cleaned up -- if they do, free is called in the GC process, removing the IntPtr handle.
I believe this is also why DependentHandle is not exposed directly - you don't want to mess with things and get a memory leak as a result. The next best thing for that is a WeakReference (which also stores an IntPtr instead of an object) - but unfortunately doesn't include the 'dependency' aspect.
What remains is for you to toy around with the mechanics, so that you can see the dependency in action. Be sure to start it multiple times and watch the results:
class DependentObject
{
public class MyKey : IDisposable
{
public MyKey(bool iskey)
{
this.iskey = iskey;
}
private bool disposed = false;
private bool iskey;
public void Dispose()
{
if (!disposed)
{
disposed = true;
Console.WriteLine("Cleanup {0}", iskey);
}
}
~MyKey()
{
Dispose();
}
}
static void Main(string[] args)
{
var dep = new MyKey(true); // also try passing this to cwt.Add
ConditionalWeakTable<MyKey, MyKey> cwt = new ConditionalWeakTable<MyKey, MyKey>();
cwt.Add(new MyKey(true), dep); // try doing this 5 times f.ex.
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
Console.WriteLine("Wait");
Console.ReadLine(); // Put a breakpoint here and inspect cwt to see that the IntPtr is still there
}

How to implement GetHashcode for a difference-tolerant DateTime comparer? [duplicate]

I am trying to implement an IEqualityComparer that has a tolerance on a date comparison. I have also looked into this question. The problem is that I can't use a workaround because I am using the IEqualityComparer in a LINQ .GroupJoin(). I have tried a few implementations that allow for tolerance. I can get the Equals() to work because I have both objects but I can't figure out how to implement GetHashCode().
My best attempt looks something like this:
public class ThingWithDateComparer : IEqualityComparer<IThingWithDate>
{
private readonly int _daysToAdd;
public ThingWithDateComparer(int daysToAdd)
{
_daysToAdd = daysToAdd;
}
public int GetHashCode(IThingWithDate obj)
{
unchecked
{
var hash = 17;
hash = hash * 23 + obj.BirthDate.AddDays(_daysToAdd).GetHashCode();
return hash;
}
}
public bool Equals(IThingWithDate x, IThingWithDate y)
{
throw new NotImplementedException();
}
}
public interface IThingWithDate
{
DateTime BirthDate { get; set; }
}
With .GroupJoin() building a HashTable out of the GetHashCode() it applies the days to add to both/all objects. This doesn't work.
The problem is impossible, conceptually. You're trying to compare objects in a way that doesn't have a form of equality that is necessary for the operations you're trying to perform with it. For example, GroupJoin is dependant on the assumption that if A is equal to B, and B is equal to C, then A is equal to C, but in your situation, that's not true. A and B may be "close enough" together for you to want to group them, but A and C may not be.
You're going to need to not implement IEqualityComparer at all, because you cannot fulfill the contract that it requires. If you want to create a mapping of items in one collection to all of the items in another collection that are "close enough" to it then you're going to need to write that algorithm yourself (doing so efficiently is likely to be hard, but doing so inefficiently isn't shouldn't' be that difficult), rather than using GroupJoin, because it's not capable of performing that operation.
I can't see any way to generate a logical hash code for your given criteria.
The hash code is used to determine if 2 dates should stick together. If they should group together, than they must return the same hash code.
If your "float" is 5 days, that means that 1/1/2000 must generate the same hash code as 1/4/2000, and 1/4/2000 must generate the same hashcode as 1/8/2000 (since they are both within 5 days of each other). That implies that 1/1/2000 has the same code as 1/8/2000 (since if a=b and b=c, a=c).
1/1/2000 and 1/8/2000 are outside the 5 day "float".

Why does initial hash value in GetHashCode() implementation generated for anonymous class depend on property names?

When generating GetHashCode() implementation for anonymous class, Roslyn computes the initial hash value based on the property names. For example, the class generated for
var x = new { Int = 42, Text = "42" };
is going to have the following GetHashCode() method:
public override in GetHashCode()
{
int hash = 339055328;
hash = hash * -1521134295 + EqualityComparer<int>.Default.GetHashCode( Int );
hash = hash * -1521134295 + EqualityComparer<string>.Default.GetHashCode( Text );
return hash;
}
But if we change the property names, the initial value changes:
var x = new { Int2 = 42, Text2 = "42" };
public override in GetHashCode()
{
int hash = 605502342;
hash = hash * -1521134295 + EqualityComparer<int>.Default.GetHashCode( Int2 );
hash = hash * -1521134295 + EqualityComparer<string>.Default.GetHashCode( Text2 );
return hash;
}
What's the reason behind this behaviour? Is there some problem with just picking a big [prime?] number and using it for all the anonymous classes?
Is there some problem with just picking a big [prime?] number and using it for all the anonymous classes?
There is nothing wrong with doing this, it just tends to produce a less efficient value.
The goal of a GetHashCode implementation is to return different results for values which are not equal. This decreases the chance of collisions when the values are used in hash based collections (such as Dictionary<TKey, TValue>).
An anonymous value can never be equal to another anonymous value if they represent different types. The type of an anonymous value is defined by the shape of the properties:
Name of properties
Type of properties
Count of properties
Two anonymous values which differ on any of these characteristics represent different types and hence can never be equal values.
Given this is true it makes sense for the compiler to generate GetHashCode implementations which tend to return different values for different types. This is why the compiler includes the property names when computing the initial hash.
Unless someone from the Roslyn team steps up we can only speculate. I would have done it the same way. Using a different seed for each anonymous type seems like a useful way to have more randomness in the hash codes. For example it causes new { a = 1 }.GetHashCode() != new { b = 1 }.GetHashCode() to be true.
I also wonder whether there are any bad seeds that cause the hash code computation to fall apart. I don't think so. Even a 0 seed would work.
The Roslyn source code can be found in AnonymousTypeGetHashCodeMethodSymbol. The initial hash code value is based on a hash of the names of the anonymous type.

GetHashCode() problem using xor

My understanding is that you're typically supposed to use xor with GetHashCode() to produce an int to identify your data by its value (as opposed to by its reference). Here's a simple example:
class Foo
{
int m_a;
int m_b;
public int A
{
get { return m_a; }
set { m_a = value; }
}
public int B
{
get { return m_b; }
set { m_b = value; }
}
public Foo(int a, int b)
{
m_a = a;
m_b = b;
}
public override int GetHashCode()
{
return A ^ B;
}
public override bool Equals(object obj)
{
return this.GetHashCode() == obj.GetHashCode();
}
}
The idea being, I want to compare one instance of Foo to another based on the value of properties A and B. If Foo1.A == Foo2.A and Foo1.B == Foo2.B, then we have equality.
Here's the problem:
Foo one = new Foo(1, 2);
Foo two = new Foo(2, 1);
if (one.Equals(two)) { ... } // This is true!
These both produce a value of 3 for GetHashCode(), causing Equals() to return true. Obviously, this is a trivial example, and with only two properties I could simply compare the individual properties in the Equals() method. However, with a more complex class this would get out of hand quickly.
I know that sometimes it makes good sense to set the hash code only once, and always return the same value. However, for mutable objects where an evaluation of equality is necessary, I don't think this is reasonable.
What's the best way to handle property values that could easily be interchanged when implementing GetHashCode()?
See Also
What is the best algorithm for an overridden System.Object.GetHashCode?
First off - Do not implement Equals() only in terms of GetHashCode() - hashcodes will sometimes collide even when objects are not equal.
The contract for GetHashCode() includes the following:
different hashcodes means that objects are definitely not equal
same hashcodes means objects might be equal (but possibly might not)
Andrew Hare suggested I incorporate his answer:
I would recommend that you read this solution (by our very own Jon Skeet, by the way) for a "better" way to calculate a hashcode.
No, the above is relatively slow and
doesn't help a lot. Some people use
XOR (eg a ^ b ^ c) but I prefer the
kind of method shown in Josh Bloch's
"Effective Java":
public override int GetHashCode()
{
int hash = 23;
hash = hash*37 + craneCounterweightID;
hash = hash*37 + trailerID;
hash = hash*37 + craneConfigurationTypeCode.GetHashCode();
return hash;
}
The 23 and 37 are arbitrary numbers
which are co-prime.
The benefit of the above over the XOR
method is that if you have a type
which has two values which are
frequently the same, XORing those
values will always give the same
result (0) whereas the above will
differentiate between them unless
you're very unlucky.
As mentioned in the above snippet, you might also want to look at Joshua Bloch's book, Effective Java, which contains a nice treatment of the subject (the hashcode discussion applies to .NET as well).
Andrew has posted a good example for generating a better hash code, but also bear in mind that you shouldn't use hash codes as an equality check, since they are not guaranteed to be unique.
For a trivial example of why this is consider a double object. It has more possible values than an int so it is impossible to have a unique int for each double. Hashes are really just a first pass, used in situations like a dictionary when you need to find the key quickly, by first comparing hashes a large percentage of the possible keys can be ruled out and only the keys with matching hashes need to have the expense of a full equality check (or other collision resolution methods).
Hashing always involves collisions and you have to deal with it (f.e., compare hash values and if they are equal, exactly compare the values inside the classes to be sure the classes are equal).
Using a simple XOR, you'll get many collisions. If you want less, use some mathematical functions that distribute values across different bits (bit shifts, multiplying with primes etc.).
Read Overriding GetHashCode for mutable objects? C# and think about implementing IEquatable<T>
There are several better hash implementations. FNV hash for example.
Out of curiosity since hashcodes are typically a bad idea for comparison, wouldn't it be better to just do the following code, or am I missing something?
public override bool Equals(object obj)
{
bool isEqual = false;
Foo otherFoo = obj as Foo;
if (otherFoo != null)
{
isEqual = (this.A == otherFoo.A) && (this.B == otherFoo.B);
}
return isEqual;
}
A quick generate and good distribution of hash
public override int GetHashCode()
{
return A.GetHashCode() ^ B.GetHashCode(); // XOR
}

Why is it important to override GetHashCode when Equals method is overridden?

Given the following class
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Which is preferred?
return base.GetHashCode();
//return this.FooId.GetHashCode();
}
}
I have overridden the Equals method because Foo represent a row for the Foos table. Which is the preferred method for overriding the GetHashCode?
Why is it important to override GetHashCode?
Yes, it is important if your item will be used as a key in a dictionary, or HashSet<T>, etc - since this is used (in the absence of a custom IEqualityComparer<T>) to group items into buckets. If the hash-code for two items does not match, they may never be considered equal (Equals will simply never be called).
The GetHashCode() method should reflect the Equals logic; the rules are:
if two things are equal (Equals(...) == true) then they must return the same value for GetHashCode()
if the GetHashCode() is equal, it is not necessary for them to be the same; this is a collision, and Equals will be called to see if it is a real equality or not.
In this case, it looks like "return FooId;" is a suitable GetHashCode() implementation. If you are testing multiple properties, it is common to combine them using code like below, to reduce diagonal collisions (i.e. so that new Foo(3,5) has a different hash-code to new Foo(5,3)):
In modern frameworks, the HashCode type has methods to help you create a hashcode from multiple values; on older frameworks, you'd need to go without, so something like:
unchecked // only needed if you're compiling with arithmetic checks enabled
{ // (the default compiler behaviour is *disabled*, so most folks won't need this)
int hash = 13;
hash = (hash * 7) + field1.GetHashCode();
hash = (hash * 7) + field2.GetHashCode();
...
return hash;
}
Oh - for convenience, you might also consider providing == and != operators when overriding Equals and GetHashCode.
A demonstration of what happens when you get this wrong is here.
It's actually very hard to implement GetHashCode() correctly because, in addition to the rules Marc already mentioned, the hash code should not change during the lifetime of an object. Therefore the fields which are used to calculate the hash code must be immutable.
I finally found a solution to this problem when I was working with NHibernate.
My approach is to calculate the hash code from the ID of the object. The ID can only be set though the constructor so if you want to change the ID, which is very unlikely, you have to create a new object which has a new ID and therefore a new hash code. This approach works best with GUIDs because you can provide a parameterless constructor which randomly generates an ID.
By overriding Equals you're basically stating that you know better how to compare two instances of a given type.
Below you can see an example of how ReSharper writes a GetHashCode() function for you. Note that this snippet is meant to be tweaked by the programmer:
public override int GetHashCode()
{
unchecked
{
var result = 0;
result = (result * 397) ^ m_someVar1;
result = (result * 397) ^ m_someVar2;
result = (result * 397) ^ m_someVar3;
result = (result * 397) ^ m_someVar4;
return result;
}
}
As you can see it just tries to guess a good hash code based on all the fields in the class, but if you know your object's domain or value ranges you could still provide a better one.
Please donĀ“t forget to check the obj parameter against null when overriding Equals().
And also compare the type.
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
The reason for this is: Equals must return false on comparison to null. See also http://msdn.microsoft.com/en-us/library/bsc2ak47.aspx
How about:
public override int GetHashCode()
{
return string.Format("{0}_{1}_{2}", prop1, prop2, prop3).GetHashCode();
}
Assuming performance is not an issue :)
As of .NET 4.7 the preferred method of overriding GetHashCode() is shown below. If targeting older .NET versions, include the System.ValueTuple nuget package.
// C# 7.0+
public override int GetHashCode() => (FooId, FooName).GetHashCode();
In terms of performance, this method will outperform most composite hash code implementations. The ValueTuple is a struct so there won't be any garbage, and the underlying algorithm is as fast as it gets.
Just to add on above answers:
If you don't override Equals then the default behavior is that references of the objects are compared. The same applies to hashcode - the default implmentation is typically based on a memory address of the reference.
Because you did override Equals it means the correct behavior is to compare whatever you implemented on Equals and not the references, so you should do the same for the hashcode.
Clients of your class will expect the hashcode to have similar logic to the equals method, for example linq methods which use a IEqualityComparer first compare the hashcodes and only if they're equal they'll compare the Equals() method which might be more expensive to run, if we didn't implement hashcode, equal object will probably have different hashcodes (because they have different memory address) and will be determined wrongly as not equal (Equals() won't even hit).
In addition, except the problem that you might not be able to find your object if you used it in a dictionary (because it was inserted by one hashcode and when you look for it the default hashcode will probably be different and again the Equals() won't even be called, like Marc Gravell explains in his answer, you also introduce a violation of the dictionary or hashset concept which should not allow identical keys -
you already declared that those objects are essentially the same when you overrode Equals so you don't want both of them as different keys on a data structure which suppose to have a unique key. But because they have a different hashcode the "same" key will be inserted as different one.
It is because the framework requires that two objects that are the same must have the same hashcode. If you override the equals method to do a special comparison of two objects and the two objects are considered the same by the method, then the hash code of the two objects must also be the same. (Dictionaries and Hashtables rely on this principle).
We have two problems to cope with.
You cannot provide a sensible GetHashCode() if any field in the
object can be changed. Also often a object will NEVER be used in a
collection that depends on GetHashCode(). So the cost of
implementing GetHashCode() is often not worth it, or it is not
possible.
If someone puts your object in a collection that calls
GetHashCode() and you have overrided Equals() without also making
GetHashCode() behave in a correct way, that person may spend days
tracking down the problem.
Therefore by default I do.
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Some comment to explain if there is a real problem with providing GetHashCode()
// or if I just don't see a need for it for the given class
throw new Exception("Sorry I don't know what GetHashCode should do for this class");
}
}
Hash code is used for hash-based collections like Dictionary, Hashtable, HashSet etc. The purpose of this code is to very quickly pre-sort specific object by putting it into specific group (bucket). This pre-sorting helps tremendously in finding this object when you need to retrieve it back from hash-collection because code has to search for your object in just one bucket instead of in all objects it contains. The better distribution of hash codes (better uniqueness) the faster retrieval. In ideal situation where each object has a unique hash code, finding it is an O(1) operation. In most cases it approaches O(1).
It's not necessarily important; it depends on the size of your collections and your performance requirements and whether your class will be used in a library where you may not know the performance requirements. I frequently know my collection sizes are not very large and my time is more valuable than a few microseconds of performance gained by creating a perfect hash code; so (to get rid of the annoying warning by the compiler) I simply use:
public override int GetHashCode()
{
return base.GetHashCode();
}
(Of course I could use a #pragma to turn off the warning as well but I prefer this way.)
When you are in the position that you do need the performance than all of the issues mentioned by others here apply, of course. Most important - otherwise you will get wrong results when retrieving items from a hash set or dictionary: the hash code must not vary with the life time of an object (more accurately, during the time whenever the hash code is needed, such as while being a key in a dictionary): for example, the following is wrong as Value is public and so can be changed externally to the class during the life time of the instance, so you must not use it as the basis for the hash code:
class A
{
public int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //WRONG! Value is not constant during the instance's life time
}
}
On the other hand, if Value can't be changed it's ok to use:
class A
{
public readonly int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //OK Value is read-only and can't be changed during the instance's life time
}
}
You should always guarantee that if two objects are equal, as defined by Equals(), they should return the same hash code. As some of the other comments state, in theory this is not mandatory if the object will never be used in a hash based container like HashSet or Dictionary. I would advice you to always follow this rule though. The reason is simply because it is way too easy for someone to change a collection from one type to another with the good intention of actually improving the performance or just conveying the code semantics in a better way.
For example, suppose we keep some objects in a List. Sometime later someone actually realizes that a HashSet is a much better alternative because of the better search characteristics for example. This is when we can get into trouble. List would internally use the default equality comparer for the type which means Equals in your case while HashSet makes use of GetHashCode(). If the two behave differently, so will your program. And bear in mind that such issues are not the easiest to troubleshoot.
I've summarized this behavior with some other GetHashCode() pitfalls in a blog post where you can find further examples and explanations.
As of C# 9(.net 5 or .net core 3.1), you may want to use records as it does Value Based Equality by default.
It's my understanding that the original GetHashCode() returns the memory address of the object, so it's essential to override it if you wish to compare two different objects.
EDITED:
That was incorrect, the original GetHashCode() method cannot assure the equality of 2 values. Though objects that are equal return the same hash code.
Below using reflection seems to me a better option considering public properties as with this you don't have have to worry about addition / removal of properties (although not so common scenario). This I found to be performing better also.(Compared time using Diagonistics stop watch).
public int getHashCode()
{
PropertyInfo[] theProperties = this.GetType().GetProperties();
int hash = 31;
foreach (PropertyInfo info in theProperties)
{
if (info != null)
{
var value = info.GetValue(this,null);
if(value != null)
unchecked
{
hash = 29 * hash ^ value.GetHashCode();
}
}
}
return hash;
}

Categories

Resources