Basically, I have the following so far:
class Foo {
public override bool Equals(object obj)
{
Foo d = obj as Foo ;
if (d == null)
return false;
return this.Equals(d);
}
#region IEquatable<Foo> Members
public bool Equals(Foo other)
{
if (this.Guid != String.Empty && this.Guid == other.Guid)
return true;
else if (this.Guid != String.Empty || other.Guid != String.Empty)
return false;
if (this.Title == other.Title &&
this.PublishDate == other.PublishDate &&
this.Description == other.Description)
return true;
return false;
}
}
So, the problem is this: I have a non-required field Guid, which is a unique identifier. If this isn't set, then I need to try to determine equality based on less accurate metrics as an attempt at determining if two objects are equal. This works fine, but it make GetHashCode() messy... How should I go about it? A naive implementation would be something like:
public override int GetHashCode() {
if (this.Guid != String.Empty)
return this.Guid.GetHashCode();
int hash = 37;
hash = hash * 23 + this.Title.GetHashCode();
hash = hash * 23 + this.PublishDate.GetHashCode();
hash = hash * 23 + this.Description.GetHashCode();
return hash;
}
But what are the chances of the two types of hash colliding? Certainly, I wouldn't expect it to be 1 in 2 ** 32. Is this a bad idea, and if so, how should I be doing it?
A very easy hash code method for custom classes is to bitwise XOR each of the fields' hash codes together. It can be as simple as this:
int hash = 0;
hash ^= this.Title.GetHashCode();
hash ^= this.PublishDate.GetHashCode();
hash ^= this.Description.GetHashCode();
return hash;
From the link above:
XOR has the following nice properties:
It does not depend on order of computation.
It does not “waste” bits. If you change even one bit in one of the components, the final value will change.
It is quick, a single cycle on even the most primitive computer.
It preserves uniform distribution. If the two pieces you combine are uniformly distributed so will the combination be. In other words, it does not tend to collapse the range of the digest into a narrower band.
XOR doesn't work well if you expect to have duplicate values in your fields as duplicate values will cancel each other out when XORed. Since you're hashing together three unrelated fields that should not be a problem in this case.
I don't think there is a problem with the approach you have chosen to use. Worrying 'too much' about hash collisions is almost always an indication of over-thinking the problem; as long as the hash is highly likely to be different you should be fine.
Ultimately you may even want to consider leaving out the Description from your hash anyway if it is reasonable to expect that most of the time objects can be distinguished based on their title and publication date (books?).
You could even consider disregarding the GUID in your hash function altogether, and only use it in the Equals implementation to disambiguate the unlikely(?) case of hash clashes.
Related
code(from interactive shell):
> var a = new Dictionary<float, string>();
> a.Add(float.NaN, "it is NaN");
> a[float.NaN]
"it is NaN"
So it is possible, but is it safe?
Paraphrasing from https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/Single.cs;
public const float NaN = (float)0.0 / (float)0.0;
public static unsafe bool IsNaN(float f) => f != f;
public int CompareTo(object? value){
...
if (m_value < f) return -1;
if (m_value > f) return 1;
if (m_value == f) return 0;
if (IsNaN(m_value))
return IsNaN(f) ? 0 : -1;
else // f is NaN.
return 1;
}
public bool Equals(float obj)
{
if (obj == m_value)
{
return true;
}
return IsNaN(obj) && IsNaN(m_value);
}
public override int GetHashCode()
{
int bits = Unsafe.As<float, int>(ref Unsafe.AsRef(in m_value));
// Optimized check for IsNan() || IsZero()
if (((bits - 1) & 0x7FFFFFFF) >= 0x7F800000)
{
// Ensure that all NaNs and both zeros have the same hash code
bits &= 0x7F800000;
}
return bits;
}
You can see that NaN requires special handling in each of these cases. The standard IEEE representation leaves most bits undefined, and defines special cases for comparisons even if those bit values are identical.
However you can also see that both GetHashCode() && Equals() treat two NaN's as equivalent. So I believe that using NaN as a dictionary key should be fine.
That depends on what you mean by safe.
If you expect people to be able to use the dictionary and compare its keys to other floats, they will have to deal with a key value of NaN correctly themselves. And since float.NaN == float.NaN happens to be False, that may cause issues down the line.
However, the Dictionary succeeds in performing the lookup and other operations work correctly as well.
The question here is really why you need it in the first place?
It's bad idea to use float as key of Dictionary.
In theory you can do it. But when you work with float\double\decimal you shoud use some Epsilon to compare 2 values. Use formula like this:
abs(a1 - a2) < Epsilon
It's need due to rounding of float in operations and existing of irrational numbers. For example how you will compare with PI or sqrt(2)?
So, on this case using float as dictionary key is bad idea.
I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.
When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).
Here's an example demonstrating my problem:
Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());
Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());
Output:
Int32
43124074
51601393
String
406954194
406954194
As you can see, the arrays return a different hash.
Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?
See my own answer below for a performance comparison of all the suggestions.
About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.
That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.
In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.
Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.
public class ArrayKeyComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
return x == null || y == null
? x == null && y == null
: x.SequenceEqual(y);
}
public int GetHashCode(int[] obj)
{
var seed = 0;
if(obj != null)
foreach (int i in obj)
seed %= i.GetHashCode();
return seed;
}
}
Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.
Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.
The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.
So that's why your arrays with identical contents are returning different hash codes.
Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.
The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).
Fortunately you can cache the hash code so it's only computed once.
Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.
Ideally you would only use this kind of hashing for arrays that are never changed.
Bearing all that in mind, a simple wrapper would look like this:
public sealed class IntArrayKey
{
public IntArrayKey(int[] array)
{
Array = array;
_hashCode = hashCode();
}
public int[] Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in Array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
}
You can use that in place of the actual arrays for more sensible hash code generation.
As per the comments below, here's a version of the class that:
Makes a defensive copy of the array so that it cannot be modified.
Implements equality operators.
Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.
Code:
public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
public IntArrayKey(IEnumerable<int> sequence)
{
_array = sequence.ToArray();
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
public bool Equals(IntArrayKey other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return _hashCode == other._hashCode && equals(other.Array);
}
public override bool Equals(object obj)
{
return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
}
public static bool operator == (IntArrayKey left, IntArrayKey right)
{
return Equals(left, right);
}
public static bool operator != (IntArrayKey left, IntArrayKey right)
{
return !Equals(left, right);
}
public IReadOnlyList<int> Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
bool equals(IReadOnlyList<int> other) // other cannot be null.
{
if (_array.Length != other.Count)
return false;
for (int i = 0; i < _array.Length; ++i)
if (_array[i] != other[i])
return false;
return true;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in _array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
readonly int[] _array;
}
If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:
public IntArrayKey(int[] array)
{
_array = array;
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
If you know the length of the arrays you're using, you could use a Tuple.
Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Outputs
Tuple
1248
1248
I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.
The suggestions are as follows:
int[] as key (original attempt -- does not work at all, included as a benchmark)
string as key (original solution -- works, but slow)
Tuple as key (suggested by David)
ValueTuple as key (inspired by the Tuple)
Direct int[] hash as key
IntArrayKey (suggested by Matthew Watson)
int[] as key with Skeet's IEqualityComparer
int[] as key with David's IEqualityComparer
I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).
For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.
The results are as follows (ordered from best to worst):
Algorithm Works? Time usage
NonGenericSkeetEquality YES 392 ms
SkeetEquality YES 422 ms
ValueTuple YES 521 ms
QuickIntArrayKey YES 747 ms
IntArrayKey YES 972 ms
Tuple YES 1 609 ms
string YES 2 291 ms
DavidEquality YES 1 139 200 ms ***
int[] NO 336 ms
IntHash NO 386 ms
The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.
** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.
*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.
I have two lists that I am trying to compare. So I have created a class that implements the IEqualityComparer interface, please see below in the bottom section of code.
When I step through my code, the code goes through my GetHashCode implementation but not the Equals? I do not really understand the GetHashCode method, despite reading around on the internet and what exactly it is doing.
List<FactorPayoffs> missingfactorPayoffList =
factorPayoffList.Except(
factorPayoffListOrg,
new FactorPayoffs.Comparer()).ToList();
List<FactorPayoffs> missingfactorPayoffListOrg =
factorPayoffListOrg.Except(
factorPayoffList,
new FactorPayoffs.Comparer()).ToList();
So in the two lines of code above the two lists return me every item, telling me that the two lists do not contain any items that are the same. This is not true, there is only row that is different. I'm guessing this is happening because the Equals method is not getting called which in turns makes me wonder if my GetHashCode method is working as its supposed to?
class FactorPayoffs
{
public string FactorGroup { get; set; }
public string Factor { get; set; }
public DateTime dtPrice { get; set; }
public DateTime dtPrice_e { get; set; }
public double Ret_USD { get; set; }
public class Comparer : IEqualityComparer<FactorPayoffs>
{
public bool Equals(FactorPayoffs x, FactorPayoffs y)
{
return x.dtPrice == y.dtPrice &&
x.dtPrice_e == y.dtPrice_e &&
x.Factor == y.Factor &&
x.FactorGroup == y.FactorGroup;
}
public int GetHashCode(FactorPayoffs obj)
{
int hash = 17;
hash = hash * 23 + (obj.dtPrice).GetHashCode();
hash = hash * 23 + (obj.dtPrice_e).GetHashCode();
hash = hash * 23 + (obj.Factor ?? "").GetHashCode();
hash = hash * 23 + (obj.FactorGroup ?? "").GetHashCode();
hash = hash * 23 + (obj.Ret_USD).GetHashCode();
return hash;
}
}
}
Your Equals and GetHashCode implementations should involve the exact same set of properties; they do not.
In more formal terms, GetHashCode must always return the same value for two objects that compare equal. With your current code, two objects that differ only in the Ret_USD value will always compare equal but are not guaranteed to have the same hash code.
So what happens is that LINQ calls GetHashCode on two objects you consider equal, gets back different values, concludes that since the values were different the objects cannot be equal so there's no point at all in calling Equals and moves on.
To fix the problem, either remove the Ret_USD factor from GetHashCode or introduce it also inside Equals (whatever makes sense for your semantics of equality).
GetHashCode is intended as a fast but rough estimate of equality, so many operations potentially involving large numbers of comparisons start by checking this result instead of Equals, and only use Equals when necessary. In particular, if x.GetHashCode()!=y.GetHashCode(), then we already know x.Equals(y) is false, so there is no reason to call Equals. Had x.GetHashCode()==y.GetHashCode(), then x might equal y, but only a call to Equals will give a definite answer.
If you implement GetHashCode in a way that causes GetHashCode to be different for two objects where Equals returns true, then you have a bug in your code and many collection classes and algorithms relying on these methods will silently fail.
If you want to force the execution of the Equals you can implement it as follows
public int GetHashCode(FactorPayoffs obj) {
return 1;
}
Rewrite you GetHashCode implementation like this, to match the semantics of your Equals implementation.
public int GetHashCode(FactorPayoffs obj)
{
unchecked
{
int hash = 17;
hash = hash * 23 + obj.dtPrice.GetHashCode();
hash = hash * 23 + obj.dtPrice_e.GetHashCode();
if (obj.Factor != null)
{
hash = hash * 23 + obj.Factor.GetHashCode();
}
if (obj.FactorGroup != null)
{
hash = hash * 23 + obj.FactorGroup.GetHashCode();
}
return hash;
}
}
Note, you should use unchecked because you don't care about overflows. Additionaly, coalescing to string.Empty is pointlessy wasteful, just exclude from the hash.
See here for the best generic answer I know,
It seems that this problem has already been encountered by quite a few people:
List not working as expected
Contains always giving false
So I saw the answers and tried to implement the override of Equals and of GetHashCode but there seems that I am coding something wrong.
This is the situation: I have a list of Users(Class), each user has a List and a Name property, the list property contains licenses. I am trying to do a
if (!users.Contains(currentUser))
but it is not working as expected. And this is the code I did to override the Equals and GetHashCode:
public override bool Equals(object obj)
{
return Equals(obj as User);
}
public bool Equals(User otherUser)
{
if (ReferenceEquals(otherUser, null))
return false;
if (ReferenceEquals(this, otherUser))
return true;
return this._userName.Equals(otherUser.UserName) &&
this._licenses.SequenceEqual<string>(otherUser.Licenses);
}
public override int GetHashCode()
{
int hash = 13;
if (!_licenses.Any() && !_userName.Equals(""))
{
unchecked
{
foreach (string str in Licenses)
{
hash *= 7;
if (str != null) hash = hash + str.GetHashCode();
}
hash = (hash * 7) + _userName.GetHashCode();
}
}
return hash;
}
thank you for your suggestions and help in advance!
EDIT 1:
this is the code where I am doing the List.Contains, I am trying to see if the list already contains certain user, if not then add the user that isn't there. The Contains only works the first time, when currentUser changes then the User inside the list changes to the current user maybe this is a problem that is unrelated to the equals, any ideas?
if (isIn)
{
if (!listOfLicenses.Contains(items[3]))
listOfLicenses.Add(items[3]);
if (!users.Contains(currentUser))
{
User user2Add = new User();
user2Add.UserName = currentUser.UserName;
users.Add(user2Add);
userIndexer++;
}
if (users[userIndexer - 1].UserName.Equals(currentUser.UserName))
{
users[userIndexer - 1].Licenses.Add(items[3]);
}
result.Rows.Add();
}
Well, one problem with your hash code - if either there are no licences or the username is empty, you're ignoring the other component. I'd rewrite it as:
public override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 31 + _userName.GetHashCode();
foreach (string licence in Licences)
{
hash = hash * 31 + licences.GetHashCode();
}
return hash;
}
}
Shorter and simpler. It doesn't matter if you use the hash code of the empty string, or if you iterate over an empty collection.
That said, I'd have expected the previous code to work anyway. Note that it's order sensitive for the licences... oh, and List<T> won't use GetHashCode anyway. (You should absolutely override it appropriately, but it won't be the cause of the error.)
It would really help if you could show a short but complete program demonstrating the problem - I strongly suspect that you'll find it's actually a problem with your test data.
After users[userIndexer - 1].Licenses.Add(items[3]) , users[userIndexer - 1] is not the same user anymore. You have changed the Licences which is used in equality comparison(in User.Equals).
--EDIT
See below code
public class Class
{
static void Main(string[] args)
{
User u1 = new User("1");
User u2 = new User("1");
Console.WriteLine(u1.Equals(u2));
u2.Lic = "2";
Console.WriteLine(u1.Equals(u2));
}
}
public class User
{
public string Lic;
public User(string lic)
{
this.Lic = lic;
}
public override bool Equals(object obj)
{
return (obj as User).Lic == Lic;
}
}
You need you implement Equals and GetHashcode for the License class, otherwise SequenceEqual will not work.
Does your class implement IEquatable<User>? From your equality methods it appears it does but just checking.
The documentation for List.Contains states that:
This method determines equality by using the default equality
comparer, as defined by the object's implementation of the
IEquatable(T).Equals method for T (the type of values in the list)
It is very important to make sure that the value returned by GetHashCode never ever ever changes for a specific instance of an object. If the value changes then lists and dictionaries won't work correctly.
Think of GetHashCode as "GetPrimaryKey". You would not change the primary key of a user record in a database if someone added a new license to the user. Likewise you mustn't change the GetHashCode.
It appears from your code that you are changing the licenses collection and you're using that to calculate your hash code. So that is probably causing your issue.
Now, it is perfectly legitimate to use a constant value for every hash code you produce - you could just return 42 for every instance, for example. This will force calling Equals to determine if two objects are equal or not. All that having distinct hash codes does is short circuits the need to call Equals.
If the _userName field doesn't change then just return its hash code and see it that works.
I am using the following query
var queryList1Only = (from file in list1
select file).Except(list2, myFileCompare);
while myFileCompare does a comparison of 2 files based on the name and length.
The query was returning the results if the list1 and list2 were small (say 100 files while I tested), then I increased the list1 to 30,000 files and list2 to 20,000 files and the query now says "Function Evaluation Timed Out".
I searched online and found debugging could cause it, so I removed all the breakpoints and ran the code, now the program just froze, without any output for queryList1Only I am trying to print out to check it.
EDIT:
This is the code for myFileCompare
class FileCompare : System.Collections.Generic.IEqualityComparer<System.IO.FileInfo>
{
public FileCompare() { }
public bool Equals(System.IO.FileInfo f1, System.IO.FileInfo f2)
{
return (f1.Name == f2.Name && f1.Directory.Name == f2.Directory.Name &&
f1.Length == f2.Length);
}
// Return a hash that reflects the comparison criteria. According to the
// rules for IEqualityComparer<T>, if Equals is true, then the hash codes must
// also be equal. Because equality as defined here is a simple value equality, not
// reference identity, it is possible that two or more objects will produce the same
// hash code.
public int GetHashCode(System.IO.FileInfo fi)
{
string s = String.Format("{0}{1}", fi.Name, fi.Length);
return s.GetHashCode();
}
}
What are you need to do with the items returned by a query?
Basically such heavy operations would be great to execute simultaneously in a separate thread to avoid the situations you've just faced.
EDIT: An idea
As a case you can try following algorithm:
Sort items in both arrays using QuickSort (List<T>.Sort() uses it by default), it will be pretty fast with good implementation of GetHashCode()
Then in well known for() loop traverse list and compare elements with the same index
When count of any array reaches maximum index of an other list - select all items from latter list as different (basically they are not exists in former list at all).
I believe with sorted arrays you'll give much better performance. I believe complexity of Except() is O(m*n).
EDIT: An other idea, should be really fast
From one server store items in Set<T>
Then loop through second array and search within a Set<T>, it would be VERY fast! Basically O(mlogm) + O(n) because you need to traverse only single array and search within a set with good hash function (use GetHashCode() I've provided with an updated logic) is very quick. Try it out!
// some kind of C# pseudocode ;)
public IEnumerable<FileInfo> GetDifference()
{
ISet<FileInfo> firstServerFilesMap = new HashSet<FileInfo>();
// adding items to set
firstServerFilesMap.Add();
List<FileInfo> secondServerFiles = new List<FileInfo>();
// adding items to list
firstServerFilesMap.Add();
foreach (var secondServerFile in secondServerFiles)
{
if (!firstServerFilesMap.Contains(secondServerFile))
{
yield return secondServerFile;
}
}
}
EDIT: More details regarding equality logic were provided in comments
Try out this impelmentation
public bool Equals(System.IO.FileInfo f1, System.IO.FileInfo f2)
{
if ( f1 == null || f2 == null)
{
return false;
}
return (f1.Name == f2.Name && f1.Directory.Name == f2.Directory.Name &&
f1.Length == f2.Length);
}
public int GetHashCode(System.IO.FileInfo fi)
{
unchecked
{
int hash = 17;
hash = hash * 23 + fi.Name.GetHashCode();
hash = hash * 23 + fi.Directory.Name.GetHashCode();
hash = hash * 23 + fi.Length.GetHashCode();
return hash;
}
}
Useful links:
GetHashCode Guidelines in C#
What is the best algorithm for an overridden System.Object.GetHashCode?
I haven't tried this myself, but here is an idea:
Implement list1 as HashSet, this way:
HashSet<FileInfo> List1 = new HashSet<FileInfo>(myFileCompare);
Add all files:
foreach(var file in files)
{
List1.Add(file);
}
Then remove elements:
List1.ExceptWith(list2);
Then enumerate:
foreach(var file in List1)
{
//do something
}
I think it's faster, but as I said, I haven't tried it. Here is a link with general information on HashSet.
Edit:
Or better yet, you can initialize and add data in one step:
HashSet<FileInfo> List1 = new HashSet<FileInfo>(files, myFileCompare);
I'd recommend removing the length from the hash code, and just doing fi.FullName. That still holds the uniqueness guideline, though there may (under some cases, where you think length is needed to distinguish) be hash collisions. But that is probably preferable to a longer "Except" execution. Similarly, change your equality comparison from being name and directory, to fullname, that would probably be more performant as well.