How can the method Dictionary.Add be O(1) amortized? - c#

I am trying to get a better understanding of how hashtables and dictionaries work in C# from a complexity perspective (but I guess the language is not an important factor, that's probably just a theoretical question).
I know that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
However, let's look at that code:
public class Foo {
public Foo() { }
public override int GetHashCode() {
return 5; //arbitrary value, purposely a constant
}
}
static void Main(string[] args) {
Dictionary<Foo, int> test = new Dictionary<Foo,int>();
Foo a = new Foo();
Foo b = new Foo();
test .Add(a, 5);
test .Add(b, 6); //1. no exception raised, even though GetHashCode() returns the same hash
test .Add(a, 10); //2. exception raised
}
I understand than behind the scenes there is a collision of hashes at 1. and there's probably a separate chaining to handle it.
However, at 2. the argument exception is raised. That means that internally the Dictionary keeps a track of each key inserted after having determined its hash. That also means that each time we add an entry to our dictionary, it checks if the key has not already been inserted using the equals method.
My question is, why is it considered to be O(1) complexity when it seems like it should be O(n) if it checks the already inserted keys?

But it doesn't have to check all of the keys. It only has to check keys that hash to the same value. And, as you say, a good hash code will minimize the number of hash collisions, so on average it doesn't have to make any key comparisions at all.
Remember, the rules for GetHashCode say that if a.HashCode <> b.HashCode, then a <> b. But if a.HashCode == b.GetHashCode, a might be equal to b.
Also, you say:
Iknow that the method Add of a Dictionary is supposed to be O(1) if Count is less than the capacity (which is kind of obvious).
That's not entirely true. That's the ideal, assuming a perfect hash function that will give a unique number for every key. But the perfect hash function doesn't exist, in the general case, so typically you'll see O(1) (or very close to it) performance until Count exceeds some fairly large percentage of the capacity: say 85% or 90%.

The answer is simple and difficult.
Simple part: It is because (you can check it by yourself)
a.Equals(b) == false
If you want exception when adding "b" just implement also Equals method.
No difficult part:
Default object implementation of Equals call RuntimeHelpers.Equals. Source of RuntimeHelpers is here. Unfortunately the method is extern:
[System.Security.SecuritySafeCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
public new static extern bool Equals(Object o1, Object o2);
What is excatly implementation of this method, I do not know, but I think it base on pointers (so adresses in memory).

Related

implement GetHashCode() for objects that contain collections

Consider the following objects:
class Route
{
public int Origin { get; set; }
public int Destination { get; set; }
}
Route implements equality operators.
class Routing
{
public List<Route> Paths { get; set; }
}
I used the code below to implement GetHashCode method for the Routing object and it seems to work but I wonder if that's the right way to do it? I rely on equality checks and as I'm uncertain I thought I'll ask you guys. Can I just sum the hash codes or do I need to do more magic in order to guarantee the desired effect?
public override int GetHashCode() =>
{
return (Paths != null
? (Paths.Select(p => p.GetHashCode())
.Sum())
: 0);
}
I checked several GetHashCode() questions here as well as MSDN and Eric Lippert's article on this topic but couldn't find what I'm looking for.
I think your solution is fine. (Much later remark: LINQ's Sum method will act in checked context, so you can very easily get an OverflowException which means it is not so fine, after all.) But it is more usual to do XOR (addition without carry). So it could be something like
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths)
hc ^= p.GetHashCode();
return hc;
}
Addendum (after answer was accepted):
Remember that if you ever use this type Routing in a Dictionary<Routing, Whatever>, a HashSet<Routing> or another situation where a hash table is used, then your instance will be lost if someone alters (mutates) the Routing after it has been added to the collection.
If you're sure that will never happen, use my code above. Dictionary<,> and so on will still work if you make sure no-one alters the Routing that is referenced.
Another choice is to just write
public override int GetHashCode()
{
return 0;
}
if you believe the hash code will never be used. If every instace returns 0 for hash code, you will get very bad performance with hash tables, but your object will not be lost. A third option is to throw a NotSupportedException.
The code from Jeppe Stig Nielsen's answer works but it could lead to a lot of repeating hash code values. Let's say you are hashing a list of ints in the range of 0-100, then your hash code would be guarnteed to be between 0 and 255. This makes for a lot of collisions when used in a Dictionary. Here is an improved version:
public override int GetHashCode()
{
int hc = 0;
if (Paths != null)
foreach (var p in Paths) {
hc ^= p.GetHashCode();
hc = (hc << 7) | (hc >> (32 - 7)); //rotale hc to the left to swipe over all bits
}
return hc;
}
This code will at least involve all bits over time as more and more items are hashed in.
As a guideline, the hash of an object must be the same over the object's entire lifetime. I would leave the GetHashCode function alone, and not overwrite it. The hash code is only used if you want to put your objects in a hash table.
You should read Eric Lippert's great article about hash codes in .NET: Guidelines and rules for GetHashCode.
Quoted from that article:
Guideline: the integer returned by GetHashCode should never change
Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable
If an object's hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn't find it.
The GetHashCode function you implemented will not return the same hash code over the lifetime of the object. If you use this function, you will run into trouble if you add those objects to a hash table: the Contains method will not work.
I don't think it's a right way to do, cause to dtermine the final hashcode it has to be unique for specifyed object. In your case you do a Sum(), which can produce the same result with different hashcodes in collection (at the end hashcodes are just integers).
If your intention is to determine equality based on the content of the collection, at this point just compare these cillections between two objects. It could be time consuming operation, by the way.

GetHashCode for Dictionary items

I override the Equals method for one of my class. In the method, I check the equality of each pair of a dictionary with those of the dictionary of another instance, like the following does
public override bool Equals (object obj)
{
...
// compare to make sure all <key, value> pair of this.dict have
// the match in obj.dict
...
}
Now, I need to override the GetHashCode method as well as what is suggested.
Do I need to do that for all the keys of the dictionary, or keys plus values?
Basically, would the following be good or overkill?
public override int GetHashCode ()
{
int iHash = 0;
foreach (KeyValuePair<string, T> pair in this.dict)
{
iHash ^= pair.Key.GetHashCode();
iHash ^= pair.Value.GetHashCode();
}
return iHash;
}
Going along with what #Mitch Wheat linked to, that's not the best way to do a GetHashCode() if you use this class with a Dictionary or HashSet.
Imagine your internal Dictionary had only one entry. Your hash is now the value of that single KeyValuePair. You stick the whole class in a HashSet. You add another item to your internal Dictionary. Now the hashcode of your class has changed, because you're iterating over two items in your class.
When you call HashSet.Contains(obj), it calls obj.GetHashCode() which has now changed, even though its the same class instance. HashSet.Contains() will find that it doesn't contain this new hash and return false, never calling Equals (which will return true if the references are the same).
Suddenly its like your object has disappeared from the HashSet, even though the class is in there, with an outdated hash.
You really don't want your hash to change. It's okay to have collisions in your GetHashCode, because if it collides, it will call the (slower) .Equals() method. It's a handy optimization, that, if improperly implemented, can cause some headaches along the way.
As a side note, as pointed out in the link above, it's a good idea to multiply your hash by a prime number before ^ with another value. Helps with keeping the has unique.
Do you plan to use the object in a HashSet? You really only need to implement GetHashCode if the object is going to be used in such a way that it requires it be uniquely identifiable by its hash. It is good practice to always implement GetHashCode taking into account the same fields that are used in equality, but not always necessary.
If it is necessary in your case, I believe you have the right idea.

Using GetHashCode to test equality in Equals override

Is it ok to call GetHashCode as a method to test equality from inside the Equals override?
For example, is this code acceptable?
public class Class1
{
public string A
{
get;
set;
}
public string B
{
get;
set;
}
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
return other != null && other.GetHashCode() == this.GetHashCode();
}
public override int GetHashCode()
{
int result = 0;
result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
return result;
}
}
The others are right; your equality operation is broken. To illustrate:
public static void Main()
{
var c1 = new Class1() { A = "apahaa", B = null };
var c2 = new Class1() { A = "abacaz", B = null };
Console.WriteLine(c1.Equals(c2));
}
I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.
Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.
In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.
No, it is not ok, because it's not
equality <=> hashcode equality.
It's just
equality => hashcode equality.
or in the other direction:
hashcode inequality => inequality.
Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.
No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false
You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.
If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.
For example:
public override bool Equals(object obj)
{
Class1 other = obj as Class1;
if (other == null || other.GetHashCode() != this.GetHashCode())
return false;
// the hash codes are the same so you have to do a full object compare.
}
You cannot say that just because the hash codes are equal then the objects must be equal.
The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.
So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.
There is one case where using hashcodes as a short-cut on equality comparisons makes sense.
Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).
There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).
Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).
Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).
Now, when we come to insert our logic is going to be something like:
Get hash code for object.
Get slot for object.
If slot is empty, place object in it and return.
If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).
So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:
private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
||
(
newHash == oldHash // fast, false positives, no fast negatives
&&
_cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
);
}
Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.
Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.
It's wrong implementation, as others have stated why.
You should short circuit the equality check using GetHashCode like:
if (other.GetHashCode() != this.GetHashCode()
return false;
in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.
In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.
To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here
var watch = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
action(); //Equals and GetHashCode called here to test for performance.
}
watch.Stop();
Console.WriteLine(watch.Elapsed.TotalMilliseconds);

Why is ValueType.GetHashCode() implemented like it is?

From ValueType.cs
**Action: Our algorithm for returning the hashcode is a little bit complex. We look
** for the first non-static field and get it's hashcode. If the type has no
** non-static fields, we return the hashcode of the type. We can't take the
** hashcode of a static member because if that member is of the same type as
** the original type, we'll end up in an infinite loop.
I got bitten by this today when I was using a KeyValuePair as a key in a Dictionary (it stored xml attribute name (enum) and it's value (string)), and expected for it to have it's hashcode computed based on all its fields, but according to implementation it only considered the Key part.
Example (c/p from Linqpad):
void Main()
{
var kvp1 = new KeyValuePair<string, string>("foo", "bar");
var kvp2 = new KeyValuePair<string, string>("foo", "baz");
// true
(kvp1.GetHashCode() == kvp2.GetHashCode()).Dump();
}
The first non-static field I guess means the first field in declaratin order, which could also cause trouble when changing variable order in source for whatever reason, and believing it doesn't change the code semantically.
The actual implementation of ValueType.GetHashCode() doesn't quite match the comment. It has two versions of the algorithm, fast and slow. It first checks if the struct contains any members of a reference type and if there is any padding between the fields. Padding is empty space in a structure value, created when the JIT compiler aligns the fields. There's padding in a struct that contains bool and int (3 bytes) but no padding when it contains int and int, they fit snugly together.
Without a reference and without padding, it can do the fast version since every bit in the structure value is a bit that belongs to a field value. It simply xors 4 bytes at a time. You'll get a 'good' hash code that considers all the members. Many simple structure types in the .NET framework behave this way, like Point and Size.
Failing that test, it does the slow version, the moral equivalent of reflection. That's what you get, your KeyValuePair<> contains references. And this one only checks the first candidate field, like the comment says. This is surely a perf optimization, avoiding burning too much time.
Yes, nasty detail and not that widely known. It is usually discovered when somebody notices that their collection code sucks mud.
One more excruciating detail: the fast version has a bug that bytes when the structure contains a field of a type decimal. The values 12m and 12.0m are logically equal but they don't have the same bit pattern. GetHashCode() will say that they are not equal. Ouch.
UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!
I didn't implement it and I haven't talked to the people who did. But I can point out a few things.
(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)
First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.
So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:
1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.
2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.
So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.
A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:
// (Updated example based on good comment!)
struct Control
{
string name;
int x;
int y;
}
What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.
Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.
Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.
Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.
Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.
Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.
It should still obey the contract of GetHashCode even if the field order changes: equal values will have equal hash codes, within the lifetime of that process.
In particular:
Non-equal values don't have to have non-equal hash codes
Hash codes don't have to be consistent across processes (you can change an implementation, rebuild, and everything should still work - you shouldn't be persisting hash codes, basically)
Now I'm not saying that ValueType's implementation is a great idea - it will cause performance suckage in various ways... but I don't think it's actually broken.
Well, there are pros and cons to any implementation of GetHashCode(). These are of course the things we weigh up when implementing our own, but in the case of ValueType.GetHashCode() there is a particular difficulty in that they don't have much information on what the actual details of the concrete type will be. Of course, this often happens to us when we are creating an abstract class or one intended to be the base of classes that will add a lot more in terms of state, but in those cases we have an obvious solution of just using the default implementation of object.GetHashCode() unless a derived class cares to override it there.
With ValueType.GetHashCode() they don't have this luxury as the primary difference between a value type and a reference type is, despite the popularity of talking about implementation details of stack vs. heap, the fact that for a value type equivalence relates to value while for an object type equivalence relates to identity (even when an object defines a different form of equivalence by overriding Equals() and GetHashCode() the concept of reference-equality still exists and is still useful.
So, for the Equals() method the implementation is obvious; check the two objects are the same type, and if it is then check also that all fields are equal (actually there's an optimisation which does a bitwise comparison in some cases, but that's an optimisation on the same basic idea).
What to do for GetHashCode()? There simply is no perfect solution. One thing they could do is some sort of mult-then-add or shift-then-xor on every field. That would likely give a pretty good hashcode, but could be expensive if there were a lot of fields (never mind that its not recommended to have value-types that have lots of fields, the implementer has to consider that they still can, and indeed there might even be times when it makes sense, though I honestly can't imagine a time where it both makes sense and it also makes sense to hash it). If they knew some fields were rarely different between instances they could ignore those fields and still have a pretty good hashcode, while also being quite fast. Finally, they can ignore most fields, and hope that the one(s) they don't ignore vary in value most of the time. They went for the most extreme version of the latter.
(The matter of what is done when there is no instance fields is another matter and a pretty good choice, such value types are equal to all other instances of the same type, and they've a hashcode that matches that).
So, it's an implementation that sucks if you are hashing lots of values where the first field is the same (or otherwise returns the same hashcode), but other implementations would suck in other cases (Mono goes for xoring all the fields' hashcodes together, better in your case, worse in others).
The matter of changing field order doesn't matter, as hashcode is quite clearly stated as only remaining valid for the lifetime of a process and not being suitable for most cases where they could be persisted beyond that (can be useful in some caching situations where it doesn't hurt if things aren't found correctly after a code change).
So, not great, but nothing would be perfect. It goes to show that one must always consider both sides of what "equality" means when using an object as a key. It's easily fixed in your case with:
public class KVPCmp<TKey, TValue> : IEqualityComparer<KeyValuePair<TKey, TValue>>, IEqualityComparer
{
bool IEqualityComparer.Equals(object x, object y)
{
if(x == null)
return y == null;
if(y == null)
return false;
if(!(x is KeyValuePair<TKey, TValue>) || !(y is KeyValuePair<TKey, TValue>))
throw new ArgumentException("Comparison of KeyValuePairs only.");
return Equals((KeyValuePair<TKey, TValue>) x, (KeyValuePair<TKey, TValue>) y);
}
public bool Equals(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
{
return x.Key.Equals(y.Key) && x.Value.Equals(y.Value);
}
public int GetHashCode(KeyValuePair<TKey, TValue> obj)
{
int keyHash = obj.GetHashCode();
return ((keyHash << 16) | (keyHash >> 16)) ^ obj.Value.GetHashCode();
}
public int GetHashCode(object obj)
{
if(obj == null)
return 0;
if(!(obj is KeyValuePair<TKey, TValue>))
throw new ArgumentException();
return GetHashCode((KeyValuePair<TKey, TValue>)obj);
}
}
Use this as your comparator when creating your dictionary, and all should be well (you only need the generic comparator methods really, but leaving the rest in does no harm and can be useful to have sometimes).
Thank you all on very, very informative answers. I knew that there had to be some rationale in that decision but I wish it was documented better. I'm not able to use v4 of the framework so there is no Tuple<>, and that was the main reason I decided to piggyback on KeyValuePair struct. But I guess there is no cutting corners and I'll have to roll my own. Once more, thank you all.

C#/Java: Proper Implementation of CompareTo when Equals tests reference identity

I believe this question applies equally well to C# as to Java, because both require that {c,C}ompareTo be consistent with {e,E}quals:
Suppose I want my equals() method to be the same as a reference check, i.e.:
public bool equals(Object o) {
return this == o;
}
In that case, how do I implement compareTo(Object o) (or its generic equivalent)? Part of it is easy, but I'm not sure about the other part:
public int compareTo(Object o) {
MyClass other = (MyClass)o;
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
I can't just blindly return 1 or -1, because the solution should adhere to the normal requirements of compareTo. I can check all the instance fields, but if they are all equal, I'd still like compareTo to return a value other than 0. It should be true that a.compareTo(b) == -(b.compareTo(a)), and the ordering should stay consistent as long as the objects' state doesn't change.
I don't care about ordering across invocations of the virtual machine, however. This makes me think that I could use something like memory address, if I could get at it. Then again, maybe that won't work, because the Garbage Collector could decide to move my objects around.
hashCode is another idea, but I'd like something that will be always unique, not just mostly unique.
Any ideas?
First of all, if you are using Java 5 or above, you should implement Comparable<MyClass> rather than the plain old Comparable, therefore your compareTo method should take parameters of type MyClass, notObject:
public int compareTo(MyClass other) {
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
As of your question, Josh Bloch in Effective Java (Chapter 3, Item 12) says:
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compare-
To(x)) for all x and y. (This implies that x.compareTo(y) must throw an exception
if and only if y.compareTo(x) throws an exception.)
This means that if c == 0 in the above code, you must return 0.
That in turn means that you can have objects A and B, which are not equal, but their comparison returns 0. What does Mr. Bloch have to say about this?
It is strongly recommended, but not strictly required, that (x.compareTo(y)
== 0) == (x.equals(y)). Generally speaking, any class that implements
the Comparable interface and violates this condition should clearly indicate
this fact. The recommended language is “Note: This class has a natural
ordering that is inconsistent with equals.”
And
A class whose compareTo method imposes an order
that is inconsistent with equals will still work, but sorted collections containing
elements of the class may not obey the general contract of the appropriate collection
interfaces (Collection, Set, or Map). This is because the general contracts
for these interfaces are defined in terms of the equals method, but sorted collections
use the equality test imposed by compareTo in place of equals. It is not a
catastrophe if this happens, but it’s something to be aware of.
Update: So IMHO with your current class, you can not make compareTo consistent with equals. If you really need to have this, the only way I see would be to introduce a new member, which would give a strict natural ordering to your class. Then in case all the meaningful fields of the two objects compare to 0, you could still decide the order of the two based on their special order values.
This extra member may be an instance counter, or a creation timestamp. Or, you could try using a UUID.
In Java or C#, generally speaking, there is no fixed ordering of objects. Instances can be moved around by the garbage collector while executing your compareTo, or the sort operation that's using your compareTo.
As you stated, hash codes are generally not unique, so they're not usable (two different instances with the same hash code bring you back to the original question). And the Java Object.toString implementation which many people believe to surface an object id (MyObject#33c0d9d), is nothing more than the object's class name followed by the hash code. As far as I know, neither the JVM nor the CLR have a notion of an instance id.
If you really want a consistent ordering of your classes, you could try using an incrementing number for each new instance you create. Mind you, incrementing this counter must be thread safe, so it's going to be relatively expensive (in C# you could use Interlocked.Increment).
Two objects don't need to be reference equal to be in the same equivalence class. In my opinion, it should be perfectly acceptable for two different objects to be the same for a comparison, but not reference equal. It seems perfectly natural to me, for example, that if you hydrated two different objects from the same row in the database, that they would be the same for comparison purposes, but not reference equal.
I'd actually be more inclined to modify the behavior of equals to reflect how they are compared rather than the other way around. For most purposes that I can think of this would be more natural.
The generic equivalent is easier to deal with in my opinion, depends what your external requirements are, this is a IComparable<MyClass> example:
public int CompareTo(MyClass other) {
if (other == null) return 1;
if (this == other) {
return 0;
} else {
return foo.CompareTo(other.foo);
}
}
If the classes are equal or if foo is equal, that's the end of the comparison, unless there's something secondary to sort on, in that case add it as the return if foo.CompareTo(other.foo) == 0
If your classes have an ID or something, then compare on that as secondary, otherwise don't worry about it...the collection they're stored it and it's order in arriving at these classes to compare is what's going to determine the final order in the case of equal objects or equal object.foo values.

Categories

Resources