Arrays/Lists and computing hashvalues (VB, C#) - c#

I feel bad asking this question but I am currently not able to program and test this as I'm writing this on my cell-phone and not on my dev machine :P (Easy rep points if someone answers! XD )
Anyway, I've had experience with using hashvalues from String objects. E.g., if I have StringA and StringB both equal to "foo", they'll both compute out the same hashvalue, because they're set to equal values.
Now what if I have a List, with T being a native data type. If I tried to compute the hashvalue of ListA and ListB, assuming that they'd both be the same size and contain the same information, wouldn't they have equal hashvalues as well?
Assuming as sample dataset of 'byte' with a length of 5
{5,2,0,1,3}

It depends on how you calculate the hash value and how you define equality. For example, two different instances of an array which happen to contain the same values may not be considered equal depending on your application. In this case you may include the address or some other unique value per array as part of the hash function.
However, if you want to consider to distinct arrays which contain the same values equal you would calculate the list hash using only the values in the array. Of course, then you have to consider if ordering matters to you or not in determining equality (and thus influencing your hash function).

If the order of items is important then you could generate a sequence hashcode like this.
public static int GetOrderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int hash = 269;
foreach (T item in source)
{
hash = (hash * 17) + item.GetHashCode;
}
return hash;
}
}
If the order of items isn't important, then you could do something like this instead:
public static int GetUnorderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int sum = 907;
int count = 953;
foreach (T item in source)
{
sum = sum + item.GetHashCode();
count++
}
return 991 * sum * count;
}
}
(Note that both of these methods will have poor performance for larger collections, in which case you might want to implement some sort of cache and only recalculate the hashcode when the collection changes.)

If your talking about the built-in list types, then no, they will not be equal. Why? Because List<T> is a reference type, so equality will do a comparison to see if the references are the same. If you are creating a custom list type, then you could override the Equals and GetHashCode methods to support this behavior, but it isn't going to happen on the built in types.

Related

Array as Dictionary key gives a lot of collisions

I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.
When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).
Here's an example demonstrating my problem:
Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());
Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());
Output:
Int32
43124074
51601393
String
406954194
406954194
As you can see, the arrays return a different hash.
Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?
See my own answer below for a performance comparison of all the suggestions.
About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.
That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.
In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.
Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.
public class ArrayKeyComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
return x == null || y == null
? x == null && y == null
: x.SequenceEqual(y);
}
public int GetHashCode(int[] obj)
{
var seed = 0;
if(obj != null)
foreach (int i in obj)
seed %= i.GetHashCode();
return seed;
}
}
Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.
Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.
The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.
So that's why your arrays with identical contents are returning different hash codes.
Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.
The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).
Fortunately you can cache the hash code so it's only computed once.
Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.
Ideally you would only use this kind of hashing for arrays that are never changed.
Bearing all that in mind, a simple wrapper would look like this:
public sealed class IntArrayKey
{
public IntArrayKey(int[] array)
{
Array = array;
_hashCode = hashCode();
}
public int[] Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in Array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
}
You can use that in place of the actual arrays for more sensible hash code generation.
As per the comments below, here's a version of the class that:
Makes a defensive copy of the array so that it cannot be modified.
Implements equality operators.
Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.
Code:
public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
public IntArrayKey(IEnumerable<int> sequence)
{
_array = sequence.ToArray();
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
public bool Equals(IntArrayKey other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return _hashCode == other._hashCode && equals(other.Array);
}
public override bool Equals(object obj)
{
return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
}
public static bool operator == (IntArrayKey left, IntArrayKey right)
{
return Equals(left, right);
}
public static bool operator != (IntArrayKey left, IntArrayKey right)
{
return !Equals(left, right);
}
public IReadOnlyList<int> Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
bool equals(IReadOnlyList<int> other) // other cannot be null.
{
if (_array.Length != other.Count)
return false;
for (int i = 0; i < _array.Length; ++i)
if (_array[i] != other[i])
return false;
return true;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in _array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
readonly int[] _array;
}
If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:
public IntArrayKey(int[] array)
{
_array = array;
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
If you know the length of the arrays you're using, you could use a Tuple.
Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Outputs
Tuple
1248
1248
I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.
The suggestions are as follows:
int[] as key (original attempt -- does not work at all, included as a benchmark)
string as key (original solution -- works, but slow)
Tuple as key (suggested by David)
ValueTuple as key (inspired by the Tuple)
Direct int[] hash as key
IntArrayKey (suggested by Matthew Watson)
int[] as key with Skeet's IEqualityComparer
int[] as key with David's IEqualityComparer
I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).
For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.
The results are as follows (ordered from best to worst):
Algorithm Works? Time usage
NonGenericSkeetEquality YES 392 ms
SkeetEquality YES 422 ms
ValueTuple YES 521 ms
QuickIntArrayKey YES 747 ms
IntArrayKey YES 972 ms
Tuple YES 1 609 ms
string YES 2 291 ms
DavidEquality YES 1 139 200 ms ***
int[] NO 336 ms
IntHash NO 386 ms
The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.
** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.
*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.

Dictionary hash function for fuzzy lookups

When an approximated comparison between strings is required, the basic Levenshtein Distance can help. It measures the amount of modifications of the string needed to equal another string:
"aaaa" vs "aaab" => 1
"abba" vs "aabb" => 2
"aaaa" vs "a" => 3
When using a Dictionary<T, U> one can provide a custom IEqualityComparer<T>. One can implement the Levenshtein Distance as an IEqualityComparer<string>:
public class LevenshteinStringComparer : IEqualityComparer<string>
{
private readonly int _maximumDistance;
public LevenshteinStringComparer(int maximumDistance)
=> _maximumDistance = maximumDistance;
public bool Equals(string x, string y)
=> ComputeLevenshteinDistance(x, y) <= _maximumDistance;
public int GetHashCode(string obj)
=> 0;
private static int ComputeLevenshteinDistance(string s, string t)
{
// Omitted for simplicity
// Example can be found here: https://www.dotnetperls.com/levenshtein
}
}
So we can use a fuzzy dictionary:
var dict = new Dictionary<string, int>(new LevenshteinStringComparer(2));
dict["aaa"] = 1;
dict["aab"] = 2; // Modify existing value under "aaa" key
// Only one key was created:
dict.Keys => { "aaa" }
Having all this set up, you may have noticed that we don't have implemented a proper GetHashCode in the LevenshteinStringComparer which would be greatly appreciated by the dictionary. As some rule of thumbs regarding hash codes, I'd use:
Unequal objects should not have the same hash code
Equal objects must have the same hash code
The only possible hash function following these rules I can imagine is a constant number, just as implemented in the given code. This isn't optimal though, but when we start for example to take the default hash of the string, then aaa and aab would end up with different hashes, even though they are handled as equal. Thinking further this means all possible strings have to have the same hash.
Am I correct? And why does the performance of the dictionary gets better when I use the default string hash function with hash collisions for our comparer? Shouldn't this make the hash buckets inside the dictionary invalid?
public int GetHashCode(string obj)
=> obj.GetHashCode();
I don't think there is a hashing function that could work in your case.
The problem is that you have to assign the bucket based on a signle value only, while you can't know what was added before. But the Levenshtein distance of the item being hashed can be anything from 0 to "infinity", only thing that matters is what it is compared with. Hence you cannot satisfy the second condition of the hashing function (to have equal objects have the same hash code).
Another argument "pseudo-proof" would be the situation when you want maximum distance of 2 and you already have two items in the dictionary, which have mutual distance of 3. If you then add a string which is of distance 2 from the first item and distance 1 from the second item, how would you decide which item should it match to? It satisfies your maximum for both items, but it should probably match with the second one rather than the first one. But not knowing anything about the contents of the dictionary you cannot know how to hash it correctly.
For the second question - using the default string.GetHashCode() method does improve performance, but it destroys the functionality of your equality comparer. If you test this solution on your sample code, you can see that the dict will contain two keys now. This is because GetHashCode returned two different hash codes, so there was no conflict and dict now has two buckets and your Equals method was not even executed.
I can understand fuzzy lookup. But not fuzzy storage. Why would you want to overwrite "aaa" when assigning a value for "aab"? If all you want is fuzzy lookup wouldn't it be better to have a normal dictionary which has an extension to do a fuzzy lookup like...
public static class DictionaryExtensions
{
public static IEnumerable<T> FuzzyMatch<T>(this IDictionary<string, T> dictionary, string key, int distance = 2)
{
IEqualityComparer<string> comparer = new LevenshteinStringComparer(distance);
return dictionary
.Keys
.Where(k => comparer.Equals(k, key))
.Select(k => dictionary[k]);
}
}
This is more of a comment than an answer. To answer your question, if you consider the following example...
"abba" vs "cbbc" => 2
"cddc" vs "cbbc" => 2
"abba" vs "cddc" => 4
You get the gist here? i.e Clearly its not possible for the following to be true
abba == cbbc &&
cddc == cbbc &&
abba != cddc

Grouping unique integer arrays

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.
So, for every new array I need to figure out if it already exits. And if it does what's the index.
A naive approach is to keep all arrays in a list and then just call:
MyList.FindIndex(x=>x.SequenceEqual(Small_Array));
But this becomes very slow if the number of arrays is getting into the thousands.
A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.
Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?
Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".
The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:
public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
// Note: SequenceEqual already checks the count before looking at content.
public bool Equals(int[] first, int[] second) =>
first.SequenceEqual(second);
public int GetHashCode(int[] array)
{
unchecked
{
int hash = 23;
foreach (var item in array)
{
hash = hash * 31 + item;
}
return hash;
}
}
}
You'd then create the dictionary like this:
var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());
Then you'd have something like:
public int MaybeAddArray(int[] array)
{
if (!arrayMap.TryGetValue(array, out var index))
{
index = arrayMap.Count + 1;
arrayMap[array] = index;
}
return index;
}
Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

What's the role of GetHashCode in the IEqualityComparer<T> in .NET?

I'm trying to understand the role of the GetHashCode method of the interface IEqualityComparer.
The following example is taken from MSDN:
using System;
using System.Collections.Generic;
class Example {
static void Main() {
try {
BoxEqualityComparer boxEqC = new BoxEqualityComparer();
Dictionary<Box, String> boxes = new Dictionary<Box,
string>(boxEqC);
Box redBox = new Box(4, 3, 4);
Box blueBox = new Box(4, 3, 4);
boxes.Add(redBox, "red");
boxes.Add(blueBox, "blue");
Console.WriteLine(redBox.GetHashCode());
Console.WriteLine(blueBox.GetHashCode());
}
catch (ArgumentException argEx) {
Console.WriteLine(argEx.Message);
}
}
}
public class Box {
public Box(int h, int l, int w) {
this.Height = h;
this.Length = l;
this.Width = w;
}
public int Height { get; set; }
public int Length { get; set; }
public int Width { get; set; }
}
class BoxEqualityComparer : IEqualityComparer<Box> {
public bool Equals(Box b1, Box b2) {
if (b1.Height == b2.Height & b1.Length == b2.Length
& b1.Width == b2.Width) {
return true;
}
else {
return false;
}
}
public int GetHashCode(Box bx) {
int hCode = bx.Height ^ bx.Length ^ bx.Width;
return hCode.GetHashCode();
}
}
Shouldn't the Equals method implementation be enough to compare two Box objects? That is where we tell the framework the rule used to compare the objects. Why is the GetHashCode needed?
Thanks.
Lucian
A bit of background first...
Every object in .NET has an Equals method and a GetHashCode method.
The Equals method is used to compare one object with another object - to see if the two objects are equivalent.
The GetHashCode method generates a 32-bit integer representation of the object. Since there is no limit to how much information an object can contain, certain hash codes are shared by multiple objects - so the hash code is not necessarily unique.
A dictionary is a really cool data structure that trades a higher memory footprint in return for (more or less) constant costs for Add/Remove/Get operations. It is a poor choice for iterating over though. Internally, a dictionary contains an array of buckets, where values can be stored. When you add a Key and Value to a dictionary, the GetHashCode method is called on the Key. The hashcode returned is used to determine the index of the bucket in which the Key/Value pair should be stored.
When you want to access the Value, you pass in the Key again. The GetHashCode method is called on the Key, and the bucket containing the Value is located.
When an IEqualityComparer is passed into the constructor of a dictionary, the IEqualityComparer.Equals and IEqualityComparer.GetHashCode methods are used instead of the methods on the Key objects.
Now to explain why both methods are necessary, consider this example:
BoxEqualityComparer boxEqC = new BoxEqualityComparer();
Dictionary<Box, String> boxes = new Dictionary<Box, string>(boxEqC);
Box redBox = new Box(100, 100, 25);
Box blueBox = new Box(1000, 1000, 25);
boxes.Add(redBox, "red");
boxes.Add(blueBox, "blue");
Using the BoxEqualityComparer.GetHashCode method in your example, both of these boxes have the same hashcode - 100^100^25 = 1000^1000^25 = 25 - even though they are clearly not the same object. The reason that they are the same hashcode in this case is because you are using the ^ (bitwise exclusive-OR) operator so 100^100 cancels out leaving zero, as does 1000^1000. When two different objects have the same key, we call that a collision.
When we add two Key/Value pairs with the same hashcode to a dictionary, they are both stored in the same bucket. So when we want to retrieve a Value, the GetHashCode method is called on our Key to locate the bucket. Since there is more than one value in the bucket, the dictionary iterates over all of the Key/Value pairs in the bucket calling the Equals method on the Keys to find the correct one.
In the example that you posted, the two boxes are equivalent, so the Equals method returns true. In this case the dictionary has two identical Keys, so it throws an exception.
TLDR
So in summary, the GetHashCode method is used to generate an address where the object is stored. So a dictionary doesn't have to search for it. It just computes the hashcode and jumps to that location. The Equals method is a better test of equality, but cannot be used to map an object into an address space.
GetHashCode is used in Dictionary colections and it creates hash for storing objects in it. Here is a nice article why and how to use IEqualtyComparer and GetHashCode http://dotnetperls.com/iequalitycomparer
While it would be possible for a Dictionary<TKey,TValue> to have its GetValue and similar methods call Equals on every single stored key to see whether it matches the one being sought, that would be very slow. Instead, like many hash-based collections, it relies upon GetHashCode to quickly exclude most non-matching values from consideration. If calling GetHashCode on an item being sought yields 42, and a collection has 53,917 items, but calling GetHashCode on 53,914 of the items yielded a value other than 42, then only 3 items will have to be compared to the ones being sought. The other 53,914 may safely be ignored.
The reason a GetHashCode is included in an IEqualityComparer<T> is to allow for the possibility that a dictionary's consumer might want to regard as equal objects that would normally not regard each other as equal. The most common example would be a caller that wants to use strings as keys but use case-insensitive comparisons. In order to make that work efficiently, the dictionary will need to have some form of hash function that will yield the same value for "Fox" and "FOX", but hopefully yield something else for "box" or "zebra". Since the GetHashCode method built into String doesn't work that way, the dictionary will need to get such a method from somewhere else, and IEqualityComparer<T> is the most logical place since the need for such a hash code would be very strongly associated with an Equals method that considers "Fox" and "FOX" identical to each other, but not to "box" or "zebra".

Examining two string arrays for equivalence

Is there a better way to examine whether two string arrays have the same contents than this?
string[] first = new string[]{"cat","and","mouse"};
string[] second = new string[]{"cat","and","mouse"};
bool contentsEqual = true;
if(first.Length == second.Length){
foreach (string s in first)
{
contentsEqual &= second.Contains(s);
}
}
else{
contentsEqual = false;
}
Console.WriteLine(contentsEqual.ToString());// true
Enumerable.SequenceEquals if they're supposed to be in the same order.
You should consider using the intersect method. It will give you all the matching values and then you can just compare the count of the resulting array with one the arrays that were compared.
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.intersect.aspx
This is O(n^2). If the arrays have the same length, sort them, then compare elements in the same position. This is O(n log n).
Or you can use a hash set or dictionary: insert each word in the first array, then see if every word in the second array is in the set or dictionary. This is O(n) on average.
Nothing wrong with the logic of the method, but the fact that you're testing Contains for each item in the first sequence means the algorithm runs in O(n^2) time in general. You can also make one or two other smaller optimisations and improvements
I would implement such a function as follows. Define an extension method as such (example in .NET 4.0).
public static bool SequenceEquals<T>(this IEnumerable<T> seq1, IEnumerable<T> seq2)
{
foreach (var pair in Enumerable.Zip(seq1, seq2)
{
if (!pair.Item1.Equals(pair.Item2))
return;
}
return false;
}
You could try Enumerable.Intersect: http://msdn.microsoft.com/en-us/library/bb460136.aspx
The result of the operation is every element that is common to both arrays. If the length of the result is equal to the length of both arrays, then the two arrays contain the same items.
Enumerable.Union: http://msdn.microsoft.com/en-us/library/bb341731.aspx would work too; just check that the result of the Union operation has length of zero (meaning there are no elements that are unique to only one array);
Although I'm not exactly sure how the functions handle duplicates.

Categories

Resources