I believe this question applies equally well to C# as to Java, because both require that {c,C}ompareTo be consistent with {e,E}quals:
Suppose I want my equals() method to be the same as a reference check, i.e.:
public bool equals(Object o) {
return this == o;
}
In that case, how do I implement compareTo(Object o) (or its generic equivalent)? Part of it is easy, but I'm not sure about the other part:
public int compareTo(Object o) {
MyClass other = (MyClass)o;
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
I can't just blindly return 1 or -1, because the solution should adhere to the normal requirements of compareTo. I can check all the instance fields, but if they are all equal, I'd still like compareTo to return a value other than 0. It should be true that a.compareTo(b) == -(b.compareTo(a)), and the ordering should stay consistent as long as the objects' state doesn't change.
I don't care about ordering across invocations of the virtual machine, however. This makes me think that I could use something like memory address, if I could get at it. Then again, maybe that won't work, because the Garbage Collector could decide to move my objects around.
hashCode is another idea, but I'd like something that will be always unique, not just mostly unique.
Any ideas?
First of all, if you are using Java 5 or above, you should implement Comparable<MyClass> rather than the plain old Comparable, therefore your compareTo method should take parameters of type MyClass, notObject:
public int compareTo(MyClass other) {
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
As of your question, Josh Bloch in Effective Java (Chapter 3, Item 12) says:
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compare-
To(x)) for all x and y. (This implies that x.compareTo(y) must throw an exception
if and only if y.compareTo(x) throws an exception.)
This means that if c == 0 in the above code, you must return 0.
That in turn means that you can have objects A and B, which are not equal, but their comparison returns 0. What does Mr. Bloch have to say about this?
It is strongly recommended, but not strictly required, that (x.compareTo(y)
== 0) == (x.equals(y)). Generally speaking, any class that implements
the Comparable interface and violates this condition should clearly indicate
this fact. The recommended language is “Note: This class has a natural
ordering that is inconsistent with equals.”
And
A class whose compareTo method imposes an order
that is inconsistent with equals will still work, but sorted collections containing
elements of the class may not obey the general contract of the appropriate collection
interfaces (Collection, Set, or Map). This is because the general contracts
for these interfaces are defined in terms of the equals method, but sorted collections
use the equality test imposed by compareTo in place of equals. It is not a
catastrophe if this happens, but it’s something to be aware of.
Update: So IMHO with your current class, you can not make compareTo consistent with equals. If you really need to have this, the only way I see would be to introduce a new member, which would give a strict natural ordering to your class. Then in case all the meaningful fields of the two objects compare to 0, you could still decide the order of the two based on their special order values.
This extra member may be an instance counter, or a creation timestamp. Or, you could try using a UUID.
In Java or C#, generally speaking, there is no fixed ordering of objects. Instances can be moved around by the garbage collector while executing your compareTo, or the sort operation that's using your compareTo.
As you stated, hash codes are generally not unique, so they're not usable (two different instances with the same hash code bring you back to the original question). And the Java Object.toString implementation which many people believe to surface an object id (MyObject#33c0d9d), is nothing more than the object's class name followed by the hash code. As far as I know, neither the JVM nor the CLR have a notion of an instance id.
If you really want a consistent ordering of your classes, you could try using an incrementing number for each new instance you create. Mind you, incrementing this counter must be thread safe, so it's going to be relatively expensive (in C# you could use Interlocked.Increment).
Two objects don't need to be reference equal to be in the same equivalence class. In my opinion, it should be perfectly acceptable for two different objects to be the same for a comparison, but not reference equal. It seems perfectly natural to me, for example, that if you hydrated two different objects from the same row in the database, that they would be the same for comparison purposes, but not reference equal.
I'd actually be more inclined to modify the behavior of equals to reflect how they are compared rather than the other way around. For most purposes that I can think of this would be more natural.
The generic equivalent is easier to deal with in my opinion, depends what your external requirements are, this is a IComparable<MyClass> example:
public int CompareTo(MyClass other) {
if (other == null) return 1;
if (this == other) {
return 0;
} else {
return foo.CompareTo(other.foo);
}
}
If the classes are equal or if foo is equal, that's the end of the comparison, unless there's something secondary to sort on, in that case add it as the return if foo.CompareTo(other.foo) == 0
If your classes have an ID or something, then compare on that as secondary, otherwise don't worry about it...the collection they're stored it and it's order in arriving at these classes to compare is what's going to determine the final order in the case of equal objects or equal object.foo values.
Related
Why it is not a good practice to compare two objects by serializing them and then compare the strings like in the following example?
public class Obj
{
public int Prop1 { get; set; }
public string Prop2 { get; set; }
}
public class Comparator<T> : IEqualityComparer<T>
{
public bool Equals(T x, T y)
{
return JsonConvert.SerializeObject(x) == JsonConvert.SerializeObject(y);
}
public int GetHashCode(T obj)
{
return JsonConvert.SerializeObject(obj).GetHashCode();
}
}
Obj o1 = new Obj { Prop1 = 1, Prop2 = "1" };
Obj o2 = new Obj { Prop1 = 1, Prop2 = "2" };
bool result = new Comparator<Obj>().Equals(o1, o2);
I have tested it and it works, it is generic so it could stand for a great diversity of objects, but what I am asking is which are the downsides of this approach for comparing objects?
I have seen it has been suggested in this question and it received some upvotes but I can't figure it out why this is not considered the best way, if somebody wants to compare just the values of the properties of two objects?
EDIT : I am strictly talking about Json serialize, not XML.
I am asking this because I want to create a simple and generic Comparator for a Unit Test project, so the performance of comparison does not bother me so much, as I know this may be one of the biggest down-sides. Also the typeless problem can be handled using in case of Newtonsoft.Json the TypeNameHandling property set to All.
The primary problem is that it is inefficient
As an example imagine this Equals function
public bool Equals(T x, T y)
{
return x.Prop1 == y.Prop1
&& x.Prop2 == y.Prop2
&& x.Prop3 == y.Prop3
&& x.Prop4 == y.Prop4
&& x.Prop5 == y.Prop5
&& x.Prop6 == y.Prop6;
}
if prop1 are not the same then the other 5 compares never need to be checked, if you did this with JSON you would have to convert the entire object into a JSON string then compare the string every time, this is on top of serialization being an expensive task all on its own.
Then the next problem is serialization is designed for communication e.g. from memory to a file, across a network, etc. If you have leveraged serialization for comparison you can degrade your ability to use it for it normal use, i.e. you can't ignore fields not required for transmission because ignoring them might break your comparer.
Next JSON in specific is Type-less which means than values than are not in anyway shape or form equal may be mistaken for being equal, and in the flipside values that are equal may not compare as equal due to formatting if they serialize to the same value, this is again unsafe and unstable
The only upside to this technique is that is requires little effort for the programmer to implement
You probably going to keep adding a bounty to the question until somebody tells you that it is just fine to do this. So you got it, don't hesitate to take advantage of the NewtonSoft.Json library to keep the code simple. You just need some good arguments to defend your decision if your code is ever reviewed or if somebody else takes over the maintenance of the code.
Some of the objections they may raise, and their counter-arguments:
This is very inefficient code!
It certainly is, particularly GetHashCode() can make your code brutally slow if you ever use the object in a Dictionary or HashSet.
Best counter-argument is to note that efficiency is of little concern in a unit test. The most typical unit test takes longer to get started than to actually execute and whether it takes 1 millisecond or 1 second is not relevant. And a problem you are likely to discover very early.
You are unit-testing a library you did not write!
That is certainly a valid concern, you are in effect testing NewtonSoft.Json's ability to generate a consistent string representation of an object. There is cause to be alarmed about this, in particular floating point values (float and double) are never not a problem. There is also some evidence that the library author is unsure how to do it correctly.
Best counter-argument is that the library is widely used and well maintained, the author has released many updates over the years. Floating point consistency concerns can be reasoned away when you make sure that the exact same program with the exact same runtime environment generates both strings (i.e. don't store it) and you make sure the unit-test is built with optimization disabled.
You are not unit-testing the code that needs to be tested!
Yes, you would only write this code if the class itself provides no way to compare objects. In other words, does not itself override Equals/GetHashCode and does not expose a comparator. So testing for equality in your unit test exercise a feature that the to-be-tested code does not actually support. Something that a unit test should never do, you can't write a bug report when the test fails.
Counter argument is to reason that you need to test for equality to test another feature of the class, like the constructor or property setters. A simple comment in the code is enough to document this.
By serializing your objects to JSON, you are basically changing all of your objects to another data type and so everything that applies to your JSON library will have an impact on your results.
So if there is a tag like [ScriptIgnore] in one of the objects, your code will simply ignore it since it has been omitted from your data.
Also, the string results can be the same for objects that are not the same. like this example.
static void Main(string[] args)
{
Xb x1 = new X1()
{
y1 = 1,
y2 = 2
};
Xb x2 = new X2()
{
y1 = 1,
y2= 2
};
bool result = new Comparator<Xb>().Equals(x1, x2);
}
}
class Xb
{
public int y1 { get; set; }
}
class X1 : Xb
{
public short y2 { get; set; }
}
class X2 : Xb
{
public long y2 { get; set; }
}
So as you see x1 has a different type from x2 and even the data type of the y2 is different for those two, but the json results will be the same.
Other than that, since both x1 and x2 are from type Xb, I could call your comparator without any problems.
I would like to correct the GetHashCode at the beginning.
public class Comparator<T> : IEqualityComparer<T>
{
public bool Equals(T x, T y)
{
return JsonConvert.SerializeObject(x) == JsonConvert.SerializeObject(y);
}
public int GetHashCode(T obj)
{
return JsonConvert.SerializeObject(obj).GetHashCode();
}
}
Okay, next, we discuss the problem of this method.
First, it just won't work for types with looped linkage.
If you have a property linkage as simple as A -> B -> A, it fails.
Unfortunately, this is very common in lists or map that interlink together.
Worst, there is hardly an efficient generic loop detection mechanism.
Second, comparison with serialization is just inefficient.
JSON needs reflection and lots of type judging before successfully compile its result.
Therefore, your comparer will become a serious bottleneck in any algorithm.
Usually, even if in thousands of records cases, JSON is considered slow enough.
Third, JSON has to go over every property.
It will become a disaster if your object links to any big object.
What if your object links to a big file?
As a result, C# simply leaves the implementation to user.
One has to know his class thoroughly before creating a comparator.
Comparison requires good loop detection, early termination and efficiency consideration.
A generic solution simply does not exist.
These are some of the downsides:
a) Performance will be increasingly bad the deeper your object tree is.
b) new Obj { Prop1 = 1 } Equals
new Obj { Prop1 = "1" } Equals
new Obj { Prop1 = 1.0 }
c) new Obj { Prop1 = 1.0, Prop2 = 2.0 } Not Equals
new Obj { Prop2 = 2.0, Prop1 = 1.0 }
First, I notice that you say "serialize them and then compare the strings." In general, ordinary string comparison will not work for comparing XML or JSON strings, you have to be a little more sophisticated than that. As a counterexample to string comparison, consider the following XML strings:
<abc></abc>
<abc/>
They are clearly not string equal but they definitely "mean" the same thing. While this example might seem contrived, it turns out that there are quite a few cases where string comparison doesn't work. For example, whitespace and indentation are significant in string comparison but may not be significant in XML.
The situation isn't all that much better for JSON. You can do similar counterexamples for that.
{ abc : "def" }
{
abc : "def"
}
Again, clearly these mean the same thing, but they're not string-equal.
Essentially, if you're doing string comparison you're trusting the serializer to always serialize a particular object in exactly the same way (without any added whitespace, etc), which ends up being remarkably fragile, especially given that most libraries do not, to my knowledge, provide any such guarantee. This is particularly problematic if you update the serialization libraries at some point and there's a subtle difference in how they do the serialization; in this case, if you try to compare a saved object that was serialized with the previous version of the library with one that was serialized with the current version then it wouldn't work.
Also, just as a quick note on your code itself, the "==" operator is not the proper way to compare objects. In general, "==" tests for reference equality, not object equality.
One more quick digression on hash algorithms: how reliable they are as a means of equality testing depends on how collision resistant they are. In other words, given two different, non-equal objects, what's the probability that they'll hash to the same value? Conversely, if two objects hash to the same value, what are the odds that they're actually equal? A lot of people take it for granted that their hash algorithms are 100% collision resistant (i.e. two objects will hash to the same value if, and only if, they're equal) but this isn't necessarily true. (A particularly well-known example of this is the MD5 cryptographic hash function, whose relatively poor collision resistance has rendered it unsuitable for further use). For a properly-implemented hash function, in most cases the probability that two objects that hash to the same value are actually equal is sufficiently high to be suitable as a means of equality testing but it's not guaranteed.
Object comparison using serialize and then comparing the strings representations in not effective in the following cases:
When a property of type DateTime exists in the types that need to be compared
public class Obj
{
public DateTime Date { get; set; }
}
Obj o1 = new Obj { Date = DateTime.Now };
Obj o2 = new Obj { Date = DateTime.Now };
bool result = new Comparator<Obj>().Equals(o1, o2);
It will result false even for objects very close created in time, unless they don't share the exactly same property.
For objects that have double or decimal values which need to be compared with an Epsilon to verify if they are eventually very close to each other
public class Obj
{
public double Double { get; set; }
}
Obj o1 = new Obj { Double = 22222222222222.22222222222 };
Obj o2 = new Obj { Double = 22222222222222.22222222221 };
bool result = new Comparator<Obj>().Equals(o1, o2);
This will also return false even the double values are really close to each other, and in the programs which involves calculation, it will become a real problem, because of the loss of precision after multiple divide and multiply operations, and the serialize does not offer the flexibility to handle these cases.
Also considering the above cases, if one wants not to compare a property, it will face the problem of introducing a serialize attribute to the actual class, even if it is not necessary and it will lead to code pollution or problems went it will have to actually use serialization for that type.
Note: These are some of the actual problems of this approach, but I am looking forward to find others.
For unit tests you don`t need write own comparer. :)
Just use modern frameworks. For example try FluentAssertions library
o1.ShouldBeEquivalentTo(o2);
Serialization was made for storing an object or sending it over a pipe (network) that is outside of the current execution context. Not for doing something inside the execution context.
Some serialized values might not be considered equal, which in fact they are : decimal "1.0" and integer "1" for instance.
For sure you can just like you can eat with a shovel but you don't because you might break your tooth!
You can use System.Reflections namespace to get all the properties of the instance like in this answer. With Reflection you can compare not only public properties, or fields (like using Json Serialization), but also some private, protected, etc. to increase the speed of calculation. And of course, it's obvious that you don't have to compare all properties or fields of instance if two objects are different (excluding the example when only the last property or field of object differs).
The question is simple. Can a type that can change its internal state without it being observable from the outside be considered immutable?
Simplified example:
public struct Matrix
{
bool determinantEvaluated;
double determinant;
public double Determinant
{
get //asume thread-safe correctness in implementation of the getter
{
if (!determinantEvaluated)
{
determinant = getDeterminant(this);
determinantEvaluated = true;
}
return determinant;
}
}
}
UPDATE: Clarified the thread-safeness issue, as it was causing distraction.
It depends.
If you are documenting for authors of client code or reasoning as an author of client code, then you are concerned with the interface of the component (that is, its externally observable state and behavior) and not with its implementation details (like the internal representation).
In this sense, a type is immutable even if it caches state, even if it initializes lazily, etc - as long as these mutations aren't observable externally. In other words, a type is immutable if it behaves as immutable when used through its public interface (or its other intended use cases, if any).
Of course, this can be tricky to get right (with mutable internal state, you may need to concern yourself with thread safety, serialization/marshaling behavior, etc). But assuming you do get it right (to the extent you need, at least) there's no reason not to consider such a type immutable.
Obviously, from the point of view of a compiler or an optimizer, such a type is typically not considered immutable (unless the compiler is sufficiently intelligent or has some "help" like hints or prior knowledge of some types) and any optimizations that were intended for immutable types may not be applicable, if this is the case.
Yes, immutable can change its state, providing that the changes are
unseen for other components of the software (usually caches). Quite
like quantum physics: an event should have an observer to be an event.
In your case a possible implementation is something like that:
public class Matrix {
...
private Lazy<Double> m_Determinant = new Lazy<Double>(() => {
return ... //TODO: Put actual implementation here
});
public Double Determinant {
get {
return m_Determinant.Value;
}
}
}
Note, that Lazy<Double> m_Determinant has a changing state
m_Determinant.IsValueCreated
which is, however, unobservable.
I'm going to quote Clojure author Rich Hickey here:
If a tree falls in the woods, does it make a sound?
If a pure function mutates some local data in order to produce an immutable return value, is that ok?
It is perfectly reasonable to mutate objects that are expose APIs which are immutable to the outside for performance reasons. The important thing about immutable object is their immutability to the outside. Everything that is encapsulated within them is fair game.
In a way in garbage collected languages like C# all objects have some state because of the GC. As a consumer that should not usually concern you.
I'll stick my neck out...
No, an immutable object cannot change its internal state in C# because observing its memory is an option and thus you can observe the uninitialised state. Proof:
public struct Matrix
{
private bool determinantEvaluated;
private double determinant;
public double Determinant
{
get
{
if (!determinantEvaluated)
{
determinant = 1.0;
determinantEvaluated = true;
}
return determinant;
}
}
}
then...
public class Example
{
public static void Main()
{
var unobserved = new Matrix();
var observed = new Matrix();
Console.WriteLine(observed.Determinant);
IntPtr unobservedPtr = Marshal.AllocHGlobal(Marshal.SizeOf(typeof (Matrix)));
IntPtr observedPtr = Marshal.AllocHGlobal(Marshal.SizeOf(typeof(Matrix)));
byte[] unobservedMemory = new byte[Marshal.SizeOf(typeof (Matrix))];
byte[] observedMemory = new byte[Marshal.SizeOf(typeof (Matrix))];
Marshal.StructureToPtr(unobserved, unobservedPtr, false);
Marshal.StructureToPtr(observed, observedPtr, false);
Marshal.Copy(unobservedPtr, unobservedMemory, 0, Marshal.SizeOf(typeof (Matrix)));
Marshal.Copy(observedPtr, observedMemory, 0, Marshal.SizeOf(typeof (Matrix)));
Marshal.FreeHGlobal(unobservedPtr);
Marshal.FreeHGlobal(observedPtr);
for (int i = 0; i < unobservedMemory.Length; i++)
{
if (unobservedMemory[i] != observedMemory[i])
{
Console.WriteLine("Not the same");
return;
}
}
Console.WriteLine("The same");
}
}
The purpose of specifying a type to be immutable is to establish the following invariant:
If two instances of an immutable type are ever observed to be equal, any publicly-observable reference to one may be replaced with a reference to the other without affecting the behavior of either.
Because .NET provides the ability to compare any two references for equality, it's not possible to achieve perfect equivalence among immutable instances. Nonetheless, the above invariant is still very useful if one regards reference-equality checks as being outside the realm of things for which a class object is responsible.
Note that under this rule, a subclass may define fields beyond those included in an immutable base class, but must not expose them in such a fashion as to violate the above invariant. Further, a class may include mutable fields provided that they never change in any way that affects a class's visible state. Consider something like the hash field in Java's string class. If it's non-zero, the hashCode value of the string is equal to the value stored in the field. If it's zero, the hashCode value of the string is the result of performing certain calculations on the immutable character sequence encapsulated by the string. Storing the result of the aforementioned calculations into the hash field won't affect the hash code of the string; it will merely speed up repeated requests for the value.
Consider the following objects:
class Route
{
public int Origin { get; set; }
public int Destination { get; set; }
}
Route implements equality operators.
class Routing
{
public List<Route> Paths { get; set; }
}
I used the code below to implement GetHashCode method for the Routing object and it seems to work but I wonder if that's the right way to do it? I rely on equality checks and as I'm uncertain I thought I'll ask you guys. Can I just sum the hash codes or do I need to do more magic in order to guarantee the desired effect?
public override int GetHashCode() =>
{
return (Paths != null
? (Paths.Select(p => p.GetHashCode())
.Sum())
: 0);
}
I checked several GetHashCode() questions here as well as MSDN and Eric Lippert's article on this topic but couldn't find what I'm looking for.
I think your solution is fine. (Much later remark: LINQ's Sum method will act in checked context, so you can very easily get an OverflowException which means it is not so fine, after all.) But it is more usual to do XOR (addition without carry). So it could be something like
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths)
hc ^= p.GetHashCode();
return hc;
}
Addendum (after answer was accepted):
Remember that if you ever use this type Routing in a Dictionary<Routing, Whatever>, a HashSet<Routing> or another situation where a hash table is used, then your instance will be lost if someone alters (mutates) the Routing after it has been added to the collection.
If you're sure that will never happen, use my code above. Dictionary<,> and so on will still work if you make sure no-one alters the Routing that is referenced.
Another choice is to just write
public override int GetHashCode()
{
return 0;
}
if you believe the hash code will never be used. If every instace returns 0 for hash code, you will get very bad performance with hash tables, but your object will not be lost. A third option is to throw a NotSupportedException.
The code from Jeppe Stig Nielsen's answer works but it could lead to a lot of repeating hash code values. Let's say you are hashing a list of ints in the range of 0-100, then your hash code would be guarnteed to be between 0 and 255. This makes for a lot of collisions when used in a Dictionary. Here is an improved version:
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths) {
hc ^= p.GetHashCode();
hc = (hc << 7) | (hc >> (32 - 7)); //rotale hc to the left to swipe over all bits
}
return hc;
}
This code will at least involve all bits over time as more and more items are hashed in.
As a guideline, the hash of an object must be the same over the object's entire lifetime. I would leave the GetHashCode function alone, and not overwrite it. The hash code is only used if you want to put your objects in a hash table.
You should read Eric Lippert's great article about hash codes in .NET: Guidelines and rules for GetHashCode.
Quoted from that article:
Guideline: the integer returned by GetHashCode should never change
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
The GetHashCode function you implemented will not return the same hash code over the lifetime of the object. If you use this function, you will run into trouble if you add those objects to a hash table: the Contains method will not work.
I don't think it's a right way to do, cause to dtermine the final hashcode it has to be unique for specifyed object. In your case you do a Sum(), which can produce the same result with different hashcodes in collection (at the end hashcodes are just integers).
If your intention is to determine equality based on the content of the collection, at this point just compare these cillections between two objects. It could be time consuming operation, by the way.
Is it ok to call GetHashCode as a method to test equality from inside the Equals override?
For example, is this code acceptable?
public class Class1
{
public string A
{
get;
set;
}
public string B
{
get;
set;
}
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
return other != null && other.GetHashCode() == this.GetHashCode();
}
public override int GetHashCode()
{
int result = 0;
result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
return result;
}
}
The others are right; your equality operation is broken. To illustrate:
public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}
I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.
Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.
In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.
No, it is not ok, because it's not
equality <=> hashcode equality.
It's just
equality => hashcode equality.
or in the other direction:
hashcode inequality => inequality.
Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.
No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false
You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.
If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.
For example:
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
if (other == null || other.GetHashCode() != this.GetHashCode())
return false;
// the hash codes are the same so you have to do a full object compare.
}
You cannot say that just because the hash codes are equal then the objects must be equal.
The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.
So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.
There is one case where using hashcodes as a short-cut on equality comparisons makes sense.
Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).
There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).
Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).
Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).
Now, when we come to insert our logic is going to be something like:
Get hash code for object.
Get slot for object.
If slot is empty, place object in it and return.
If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).
So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:
private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
||
(
newHash == oldHash // fast, false positives, no fast negatives
&&
_cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
);
}
Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.
Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.
It's wrong implementation, as others have stated why.
You should short circuit the equality check using GetHashCode like:
if (other.GetHashCode() != this.GetHashCode()
return false;
in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.
In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.
To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here
var watch = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
action(); //Equals and GetHashCode called here to test for performance.
}
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);
I created to following code to verify the uniqueness of a series of "tuples":
struct MyTuple
{
public MyTuple(string a, string b, string c)
{
ValA = a; ValB = b; ValC = c;
}
private string ValA;
private string ValB;
private string ValC;
}
...
HashSet<MyTuple> tupleList = new HashSet<MyTuple>();
If I'm correct, I will not end up with two tuples with the same values in my HashSet thanks to the fact that I'm using a struct. I could not have the same behavior with a class without implementing IEquatable or something like that (I didn't dig too much how to do that).
I want to know if there is some gotcha about what I do. Performance wise, I wouldn't expect the use of a struct to be a problem considering that the string inside are reference types.
Edit:
I want my HashSet to never contains two tuples having string with the same values. In other words, I want the string to behave like values types.
The gotcha is that it will not work. If two strings are "a", they can still be different references. That case would break your implementation.
Implement Equals() and GetHashCode() properly (e.g. using the ones from the supplied strings, and take care with NULL references in your struct), and possibly IEquatable<MyTuple> to make it even nicer.
Edit: The default implementation is explicitly not suitable to be used in hash tables and sets. This is clearly stated in the ValueType.GetHashCode() implementation (added emphasis):
The GetHashCode method applies to
types derived from ValueType. One or
more fields of the derived type is
used to calculate the return value. If
you call the derived type's
GetHashCode method, the return value
is not likely to be suitable for use
as a key in a hash table.
Additionally, if the value of one or
more of those fields changes, the
return value might become unsuitable
for use as a key in a hash table. In
either case, consider writing your own
implementation of the GetHashCode
method that more closely represents
the concept of a hash code for the
type.
You should always implement Equals() and GetHashCode() as "pair", and this is even more obvious since the ValueType.Equals() is terribly inefficient and unreliable (using reflection, unknown method of equality comparison). Also, there is the performance problem when not overriding those two (structs will get boxed when calling the default implementations).
Your approach should work, but you should make the string values read-only, as Lucero said.
You could also take a look at the new .NET 4.0 Tuple types. Although they are implemented as classes (because of supporting up to quite many parameters), they implement the new interface IStructuralEquatable which is intended exactly for your purpose.
It depends on what you mean by "same values". If you want the strings themselves to be unequal to one another—rather than just the references—then you're going to have to write your own Equals(MyTuple) method.