Hashing an array in c#

Hashing an array in c# - c#

Short question
How to implement GetHashCode for an Array.
Details
I have an object that overrides Equals, checking that:
this.array[n] == otherObject.array[n]
for all n in array.
Naturally I should implement the complementary GetHashCode.
I was wondering if there is .NET way to do this, or if I should implement my own, something like
hash = hash ^ array[n]
Clarification
My object contains an array, and I'm interested on GetHashCode for the elements of the array. My code for array equivalence is for example only - like my question says but maybe I wasn't clear, I'm interested in GetHashCode (not Equals). I say I naturally should implement the complementary GetHashCode because it is a requirement of .NET to implement this once Equals is overridden (for Dictionary etc. to function correctly). Thanks.

To compute a hash code using the elements of an array, you can cast the array to IStructuralEquatable and then call the GetHashCode(IEqualityComparer) method, passing a comparer for the type of elements in the array.
(The cast is necessary because the Array class implements the method explicitly.)
For example, if your object has an int array, then you can implement GetHashCode like this:
public override int GetHashCode()
{
return ((IStructuralEquatable)this.array).GetHashCode(EqualityComparer<int>.Default);
}
In case you're curious, here's how the Array class implements the GetHashCode method (from the Reference Source):
internal static int CombineHashCodes(int h1, int h2) {
return (((h1 << 5) + h1) ^ h2);
}
int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();
int ret = 0;
for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
}
return ret;
}
As you can see, the current implementation only uses the last eight elements of the array.

It depends on what you want ...
One option as Michael answered above is to have a hashcode based on the array elements. This will be in line with your Equals value semantics. However because "As a guideline, the hash of an object must be the same over the object's entire lifetime" you will have to ensure your array does not change after getting its hashcode. To have a non immutable container with a demand that it never changes sounds error prone to me.
Your other (IMO better option) is to switch to an immutable container (ie ImmutableArray) then a value-based hashcode makes sense. You can either use IStructuralEquatable as above or more generally:
public override GetHashCode() =>
Value.Aggregate(0, (total, next) => HashCode.Combine(total, next));
which will work for other immutable collections as well.

I don't agree you should naturally implement GetHashCode on an array
You would have to update it with every change
Or calculate it on the fly
I would compare directly on the fly
SequenceEquals will use the default equality comparer so you should also implement
public bool Equals
0n the objects in the array
Enumerable.SequenceEqual
Has an example
public static void SequenceEqualEx1()
{
Pet pet1 = new Pet { Name = "Turbo", Age = 2 };
Pet pet2 = new Pet { Name = "Peanut", Age = 8 };
// Create two lists of pets.
List<Pet> pets1 = new List<Pet> { pet1, pet2 };
List<Pet> pets2 = new List<Pet> { pet1, pet2 };
bool equal = pets1.SequenceEqual(pets2);
Console.WriteLine(
"The lists {0} equal.",
equal ? "are" : "are not");
}

Using the current framework one could consider using
int value=0;
for (var i = 0;i< this.array.Length; i++)
{
value=HashCode.Combine(this.array[i],value);
}

Related

Array as Dictionary key gives a lot of collisions

I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.
When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).
Here's an example demonstrating my problem:
Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());
Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());
Output:
Int32
43124074
51601393
String
406954194
406954194
As you can see, the arrays return a different hash.
Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?
See my own answer below for a performance comparison of all the suggestions.
About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.

That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.
In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.

Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.
public class ArrayKeyComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
return x == null || y == null
? x == null && y == null
: x.SequenceEqual(y);
}
public int GetHashCode(int[] obj)
{
var seed = 0;
if(obj != null)
foreach (int i in obj)
seed %= i.GetHashCode();
return seed;
}
}
Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.

Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.
The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.
So that's why your arrays with identical contents are returning different hash codes.
Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.
The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).
Fortunately you can cache the hash code so it's only computed once.
Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.
Ideally you would only use this kind of hashing for arrays that are never changed.
Bearing all that in mind, a simple wrapper would look like this:
public sealed class IntArrayKey
{
public IntArrayKey(int[] array)
{
Array = array;
_hashCode = hashCode();
}
public int[] Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in Array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
}
You can use that in place of the actual arrays for more sensible hash code generation.
As per the comments below, here's a version of the class that:
Makes a defensive copy of the array so that it cannot be modified.
Implements equality operators.
Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.
Code:
public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
public IntArrayKey(IEnumerable<int> sequence)
{
_array = sequence.ToArray();
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
public bool Equals(IntArrayKey other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return _hashCode == other._hashCode && equals(other.Array);
}
public override bool Equals(object obj)
{
return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
}
public static bool operator == (IntArrayKey left, IntArrayKey right)
{
return Equals(left, right);
}
public static bool operator != (IntArrayKey left, IntArrayKey right)
{
return !Equals(left, right);
}
public IReadOnlyList<int> Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
bool equals(IReadOnlyList<int> other) // other cannot be null.
{
if (_array.Length != other.Count)
return false;
for (int i = 0; i < _array.Length; ++i)
if (_array[i] != other[i])
return false;
return true;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in _array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
readonly int[] _array;
}
If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:
public IntArrayKey(int[] array)
{
_array = array;
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}

If you know the length of the arrays you're using, you could use a Tuple.
Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Outputs
Tuple
1248
1248

I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.
The suggestions are as follows:
int[] as key (original attempt -- does not work at all, included as a benchmark)
string as key (original solution -- works, but slow)
Tuple as key (suggested by David)
ValueTuple as key (inspired by the Tuple)
Direct int[] hash as key
IntArrayKey (suggested by Matthew Watson)
int[] as key with Skeet's IEqualityComparer
int[] as key with David's IEqualityComparer
I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).
For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.
The results are as follows (ordered from best to worst):
Algorithm Works? Time usage
NonGenericSkeetEquality YES 392 ms
SkeetEquality YES 422 ms
ValueTuple YES 521 ms
QuickIntArrayKey YES 747 ms
IntArrayKey YES 972 ms
Tuple YES 1 609 ms
string YES 2 291 ms
DavidEquality YES 1 139 200 ms ***
int[] NO 336 ms
IntHash NO 386 ms
The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.
** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.
*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.

Using Linq Except not Working as I Thought

List1 contains items { A, B } and List2 contains items { A, B, C }.
What I need is to be returned { C } when I use Except Linq extension. Instead I get returned { A, B } and if I flip the lists around in my expression the result is { A, B, C }.
Am I misunderstanding the point of Except? Is there another extension I am not seeing to use?
I have looked through and tried a number of different posts on this matter with no success thus far.
var except = List1.Except(List2); //This is the line I have thus far
EDIT: Yes I was comparing simple objects. I have never used IEqualityComparer, it was interesting to learn about.
Thanks all for the help. The problem was not implementing the comparer. The linked blog post and example below where helpful.

If you are storing reference types in your list, you have to make sure there is a way to compare the objects for equality. Otherwise they will be checked by comparing if they refer to same address.
You can implement IEqualityComparer<T> and send it as a parameter to Except() function. Here's a blog post you may find helpful.
edit: the original blog post link was broken and has been replaced above

So just for completeness...
// Except gives you the items in the first set but not the second
var InList1ButNotList2 = List1.Except(List2);
var InList2ButNotList1 = List2.Except(List1);
// Intersect gives you the items that are common to both lists
var InBothLists = List1.Intersect(List2);
Edit: Since your lists contain objects you need to pass in an IEqualityComparer for your class... Here is what your except will look like with a sample IEqualityComparer based on made up objects... :)
// Except gives you the items in the first set but not the second
var equalityComparer = new MyClassEqualityComparer();
var InList1ButNotList2 = List1.Except(List2, equalityComparer);
var InList2ButNotList1 = List2.Except(List1, equalityComparer);
// Intersect gives you the items that are common to both lists
var InBothLists = List1.Intersect(List2);
public class MyClass
{
public int i;
public int j;
}
class MyClassEqualityComparer : IEqualityComparer<MyClass>
{
public bool Equals(MyClass x, MyClass y)
{
return x.i == y.i &&
x.j == y.j;
}
public int GetHashCode(MyClass obj)
{
unchecked
{
if (obj == null)
return 0;
int hashCode = obj.i.GetHashCode();
hashCode = (hashCode * 397) ^ obj.i.GetHashCode();
return hashCode;
}
}
}

You simply confused the order of arguments. I can see where this confusion arose, because the official documentation isn't as helpful as it could be:
Produces the set difference of two sequences by using the default equality comparer to compare values.
Unless you're versed in set theory, it may not be clear what a set difference actually is—it's not simply what's different between the sets. In reality, Except returns the list of elements in the first set that are not in the second set.
Try this:
var except = List2.Except(List1); // { C }

Writing a custom comparer does seem to solve the problem, but I think https://stackoverflow.com/a/12988312/10042740 is a much more simple and elegant solution.
It overwrites the GetHashCode() and Equals() methods in your object defining class, then the default comparer does its magic without extra code cluttering up the place.

Just for Ref:
I wanted to compare USB Drives connected and available to the system.
So this is the class which implements interface IEqualityComparer
public class DriveInfoEqualityComparer : IEqualityComparer<DriveInfo>
{
public bool Equals(DriveInfo x, DriveInfo y)
{
if (object.ReferenceEquals(x, y))
return true;
if (x == null || y == null)
return false;
// compare with Drive Level
return x.VolumeLabel.Equals(y.VolumeLabel);
}
public int GetHashCode(DriveInfo obj)
{
return obj.VolumeLabel.GetHashCode();
}
}
and you can use it like this
var newDeviceLst = DriveInfo.GetDrives()
.ToList()
.Except(inMemoryDrives, new DriveInfoEqualityComparer())
.ToList();

How does CompareTo sort a list?

Below, a list l that contains a list of Product with Name and Price properties.
The list can be sort alphabetically by the following class ProductNameComparer which implements IComparar.
List<Product> l = p.GetList();
l.Sort(new ProductNameComparer());
MessageBox.Show(l[0].Name);
public class ProductNameComparer : IComparer<Product>
{
public int Compare(Product x, Product y)
{
return x.Name.CompareTo(y.Name);
}
}
I do not understand how the list is being sorted. According to MSDN CompareTo returns an Int32 type value of less than zero, zero, or greater than zero. If I have:
string c = "Apple";
string d = "Orange";
return c.CompareTo(d)
The function will return "-1".
But if I replace l.Sort(-1) instead of l.Sort(new ProductNameComparer()) the code doesn't compile
Also why would Compare(Product x, Product y) takes only two Products as argument and yet managed to compare and sort a list (>2) of products?

The Sort method doesn't just call Compare once - it calls it multiple times, whenever it needs to compare two items. It's a general sort algorithm which is able to sort any collection of items, so long as it can compare any two of them in a consistent way.
The code doesn't compile if you try to call l.Sort(-1) because that's just trying to pass in an integer - what would that even mean?
You need to understand that you're not giving the Sort method one comparison result - you're giving it the ability to compare whichever items it needs to.

For the purpose of demonstration here is a possible implementation of the Sort method (a highly inefficient one, I know):
public void Sort(System.Collections.Generic.IComparer<T> comparer)
{
for (int i = 0; i < this.Count - 1; i++)
{
for (int j = i + 1; j < this.Count; j++)
{
if (comparer.Compare(this[i], this[j]) > 0)
{
T tmp = this[i];
this[i] = this[j];
this[j] = tmp;
}
}
}
}

The Sort method overload used in your example (new ProductNameComparer()) requires the parameter to implement an IComparer interface. Calling Sort(-1) won't work since int doesn't implement this interface. As per #JonSkeet, the result of calling CompareTo() is used by the sorting strategy to order the list.

Why do 2 delegate instances return the same hashcode?

Take the following:
var x = new Action(() => { Console.Write("") ; });
var y = new Action(() => { });
var a = x.GetHashCode();
var b = y.GetHashCode();
Console.WriteLine(a == b);
Console.WriteLine(x == y);
This will print:
True
False
Why is the hashcode the same?
It is kinda surprising, and will make using delegates in a Dictionary as slow as a List (aka O(n) for lookups).
Update:
The question is why. IOW who made such a (silly) decision?
A better hashcode implementation would have been:
return Method ^ Target == null ? 0 : Target.GetHashcode();
// where Method is IntPtr

Easy! Since here is the implementation of the GetHashCode (sitting on the base class Delegate):
public override int GetHashCode()
{
return base.GetType().GetHashCode();
}
(sitting on the base class MulticastDelegate which will call above):
public sealed override int GetHashCode()
{
if (this.IsUnmanagedFunctionPtr())
{
return ValueType.GetHashCodeOfPtr(base._methodPtr);
}
object[] objArray = this._invocationList as object[];
if (objArray == null)
{
return base.GetHashCode();
}
int num = 0;
for (int i = 0; i < ((int) this._invocationCount); i++)
{
num = (num * 0x21) + objArray[i].GetHashCode();
}
return num;
}
Using tools such as Reflector, we can see the code and it seems like the default implementation is as strange as we see above.
The type value here will be Action. Hence the result above is correct.
UPDATE

My first attempt of a better implementation:
public class DelegateEqualityComparer:IEqualityComparer<Delegate>
{
public bool Equals(Delegate del1,Delegate del2)
{
return (del1 != null) && del1.Equals(del2);
}
public int GetHashCode(Delegate obj)
{
if(obj==null)
return 0;
int result = obj.Method.GetHashCode() ^ obj.GetType().GetHashCode();
if(obj.Target != null)
result ^= RuntimeHelpers.GetHashCode(obj);
return result;
}
}
The quality of this should be good for single cast delegates, but not so much for multicast delegates (If I recall correctly Target/Method return the values of the last element delegate).
But I'm not really sure if it fulfills the contract in all corner cases.
Hmm it looks like quality requires referential equality of the targets.

This smells like some of the cases mentioned in this thread, maybe it will give you some pointers on this behaviour. else, you could log it there :-)
What's the strangest corner case you've seen in C# or .NET?
Rgds GJ

From MSDN :
The default implementation of
GetHashCode does not guarantee
uniqueness or consistency; therefore,
it must not be used as a unique object
identifier for hashing purposes.
Derived classes must override
GetHashCode with an implementation
that returns a unique hash code. For
best results, the hash code must be
based on the value of an instance
field or property, instead of a static
field or property.
So if you have not overwritten the GetHashCode method, it may return the same. I suspect this is because it generates it from the definition, not the instance.

What's the role of GetHashCode in the IEqualityComparer<T> in .NET?

I'm trying to understand the role of the GetHashCode method of the interface IEqualityComparer.
The following example is taken from MSDN:
using System;
using System.Collections.Generic;
class Example {
static void Main() {
try {
BoxEqualityComparer boxEqC = new BoxEqualityComparer();
Dictionary<Box, String> boxes = new Dictionary<Box,
string>(boxEqC);
Box redBox = new Box(4, 3, 4);
Box blueBox = new Box(4, 3, 4);
boxes.Add(redBox, "red");
boxes.Add(blueBox, "blue");
Console.WriteLine(redBox.GetHashCode());
Console.WriteLine(blueBox.GetHashCode());
}
catch (ArgumentException argEx) {
Console.WriteLine(argEx.Message);
}
}
}
public class Box {
public Box(int h, int l, int w) {
this.Height = h;
this.Length = l;
this.Width = w;
}
public int Height { get; set; }
public int Length { get; set; }
public int Width { get; set; }
}
class BoxEqualityComparer : IEqualityComparer<Box> {
public bool Equals(Box b1, Box b2) {
if (b1.Height == b2.Height & b1.Length == b2.Length
& b1.Width == b2.Width) {
return true;
}
else {
return false;
}
}
public int GetHashCode(Box bx) {
int hCode = bx.Height ^ bx.Length ^ bx.Width;
return hCode.GetHashCode();
}
}
Shouldn't the Equals method implementation be enough to compare two Box objects? That is where we tell the framework the rule used to compare the objects. Why is the GetHashCode needed?
Thanks.
Lucian

A bit of background first...
Every object in .NET has an Equals method and a GetHashCode method.
The Equals method is used to compare one object with another object - to see if the two objects are equivalent.
The GetHashCode method generates a 32-bit integer representation of the object. Since there is no limit to how much information an object can contain, certain hash codes are shared by multiple objects - so the hash code is not necessarily unique.
A dictionary is a really cool data structure that trades a higher memory footprint in return for (more or less) constant costs for Add/Remove/Get operations. It is a poor choice for iterating over though. Internally, a dictionary contains an array of buckets, where values can be stored. When you add a Key and Value to a dictionary, the GetHashCode method is called on the Key. The hashcode returned is used to determine the index of the bucket in which the Key/Value pair should be stored.
When you want to access the Value, you pass in the Key again. The GetHashCode method is called on the Key, and the bucket containing the Value is located.
When an IEqualityComparer is passed into the constructor of a dictionary, the IEqualityComparer.Equals and IEqualityComparer.GetHashCode methods are used instead of the methods on the Key objects.
Now to explain why both methods are necessary, consider this example:
BoxEqualityComparer boxEqC = new BoxEqualityComparer();
Dictionary<Box, String> boxes = new Dictionary<Box, string>(boxEqC);
Box redBox = new Box(100, 100, 25);
Box blueBox = new Box(1000, 1000, 25);
boxes.Add(redBox, "red");
boxes.Add(blueBox, "blue");
Using the BoxEqualityComparer.GetHashCode method in your example, both of these boxes have the same hashcode - 100^100^25 = 1000^1000^25 = 25 - even though they are clearly not the same object. The reason that they are the same hashcode in this case is because you are using the ^ (bitwise exclusive-OR) operator so 100^100 cancels out leaving zero, as does 1000^1000. When two different objects have the same key, we call that a collision.
When we add two Key/Value pairs with the same hashcode to a dictionary, they are both stored in the same bucket. So when we want to retrieve a Value, the GetHashCode method is called on our Key to locate the bucket. Since there is more than one value in the bucket, the dictionary iterates over all of the Key/Value pairs in the bucket calling the Equals method on the Keys to find the correct one.
In the example that you posted, the two boxes are equivalent, so the Equals method returns true. In this case the dictionary has two identical Keys, so it throws an exception.
TLDR
So in summary, the GetHashCode method is used to generate an address where the object is stored. So a dictionary doesn't have to search for it. It just computes the hashcode and jumps to that location. The Equals method is a better test of equality, but cannot be used to map an object into an address space.

GetHashCode is used in Dictionary colections and it creates hash for storing objects in it. Here is a nice article why and how to use IEqualtyComparer and GetHashCode http://dotnetperls.com/iequalitycomparer

While it would be possible for a Dictionary<TKey,TValue> to have its GetValue and similar methods call Equals on every single stored key to see whether it matches the one being sought, that would be very slow. Instead, like many hash-based collections, it relies upon GetHashCode to quickly exclude most non-matching values from consideration. If calling GetHashCode on an item being sought yields 42, and a collection has 53,917 items, but calling GetHashCode on 53,914 of the items yielded a value other than 42, then only 3 items will have to be compared to the ones being sought. The other 53,914 may safely be ignored.
The reason a GetHashCode is included in an IEqualityComparer<T> is to allow for the possibility that a dictionary's consumer might want to regard as equal objects that would normally not regard each other as equal. The most common example would be a caller that wants to use strings as keys but use case-insensitive comparisons. In order to make that work efficiently, the dictionary will need to have some form of hash function that will yield the same value for "Fox" and "FOX", but hopefully yield something else for "box" or "zebra". Since the GetHashCode method built into String doesn't work that way, the dictionary will need to get such a method from somewhere else, and IEqualityComparer<T> is the most logical place since the need for such a hash code would be very strongly associated with an Equals method that considers "Fox" and "FOX" identical to each other, but not to "box" or "zebra".

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.