When would == be overridden in a different way to .equals? - c#

I understand the difference between == and .equals. There are plenty of other questions on here that explain the difference in detail e.g. this one: What is the difference between .Equals and == this one: Bitwise equality amongst many others.
My question is: why have them both (I realise there must be a very good reason) - they both appear to do the same thing (unless overridden differently).
When would == be overloaded in a different way to how .equals is overridden?

== is bound statically, at compile-time, because operators are always static. You overload operators - you can't override them. Equals(object) is executed polymorphically, because it's overridden.
In terms of when you'd want them to be different...
Often reference types will override Equals but not overload == at all. It can be useful to easily tell the difference between "these two references refer to the same object" and "these two references refer to equal objects". (You can use ReferenceEquals if necessary, of course - and as Eric points out in comments, that's clearer.) You want to be really clear about when you do that, mind you.
double has this behavior for NaN values; ==(double, double) will always return false when either operand is NaN, even if they're the same NaN. Equals can't do that without invalidating its contract. (Admittedly GetHashCode is broken for different NaN values, but that's a different matter...)
I can't remember ever implementing them to give different results, personally.

My question is: why have them both (I realise there must be a very good reason)
If there's a good reason it has yet to be explained to me. Equality comparisons in C# are a godawful mess, and were #9 on my list of things I regret about the design of C#:
http://www.informit.com/articles/article.aspx?p=2425867
Mathematically, equality is the simplest equivalence relation and it should obey the rules: x == x should always be true, x == y should always be the same as y == x, x == y and x != y should always be opposite valued, if x == y and y == z are true then x == z must be true. C#'s == and Equals mechanisms guarantee none of these properties! (Though, thankfully, ReferenceEquals guarantees all of them.)
As Jon notes in his answer, == is dispatched based on the compile-time types of both operands, and .Equals(object) and .Equals(T) from IEquatable<T> are dispatched based on the runtime type of the left operand. Why are either of those dispatch mechanisms correct? Equality is not a predicate that favours its left hand side, so why should some but not all of the implementations do so?
Really what we want for user-defined equality is a multimethod, where the runtime types of both operands have equal weight, but that's not a concept that exists in C#.
Worse, it is incredibly common that Equals and == are given different semantics -- usually that one is reference equality and the other is value equality. There is no reason by which the naive developer would know which was which, or that they were different. This is a considerable source of bugs. And it only gets worse when you realize that GetHashCode and Equals must agree, but == need not.
Were I designing a new language from scratch, and I for some crazy reason wanted operator overloading -- which I don't -- then I would design a system that would be much, much more straightforward. Something like: if you implement IComparable<T> on a type then you automatically get <, <=, ==, !=, and so on, operators defined for you, and they are implemented so that they are consistent. That is x<=y must have the semantics of x<y || x==y and also the semantics of !(x>y), and that x == y is always the same as y == x, and so on.
Now, if your question really is:
How on earth did we get into this godawful mess?
Then I wrote down some thoughts on that back in 2009:
https://blogs.msdn.microsoft.com/ericlippert/2009/04/09/double-your-dispatch-double-your-fun/
The TLDR is: framework designers and language designers have different goals and different constraints, and they sometimes do not take those factors into account in their designs in order to ensure a consistent, logical experience across the platform. It's a failure of the design process.
When would == be overloaded in a different way to how .equals is overridden?
I would never do so unless I had a very unusual, very good reason. When I implement arithmetic types I always implement all of the operators to be consistent with each other.

One case that can come up is when you have a previous codebase that depends on reference equality via ==, but you decide you want to add value equality checking. One way to do this is to implement IEquatable<T>, which is great, but now what about all that existing code that was assuming only references were equal? Should the inherited Object.Equals be different from how IEquatable<T>.Equals works? This doesn't have an easy answer, as ideally you want all of those functions/operators to act in a consistent way.
For a concrete case in the BCL where this happened, look at TimeZoneInfo. In that particular case, == and Object.Equals were kept the same, but it's not clear-cut that this was the best choice.
As an aside, one way you can mitigate the above problem is to make the class immutable. In this case, code is less likely be broken by having previously relied on reference equality, since you can't mutate the instance via a reference and invalidate an equality that was previously checked.

Generally, you want them to do the same thing, particularly if your code is going to be used by anyone other than yourself and the person next to you. Ideally, for anyone who uses your code, you want to adhere to the principle of least surprise, which having randomly different behaviours violates. Having said this:
Overloading equality is generally a bad idea, unless a type is immutable, and sealed. If you're at the stage where you have to ask questions about it, then the odds of getting it right in any other case are slim. There are lots of reasons for this:
A. Equals and GetHashCode play together to enable dictionaries and hash sets to work - if you have an inconsistent implementation (or if the hash code changes over time) then one of the following can occur:
Dictionaries/sets start performing as effectively linear-time lookups.
Items get lost in dictionaries/sets
B. What were you really trying to do? Generally, the identity of an object in an object-oriented language IS it's reference. So having two equal objects with different references is just a waste of memory. There was probably no need to create a duplicate in the first place.
C. What you often find when you start implementing equality for objects is that you're looking for a definition of equality that is "for a particular purpose". This makes it a really bad idea to burn your one-and-only Equals for this - much better to define different EqualityComparers for the uses.
D. As others have pointed out, you overload operators but override methods. This means that unless the operators call the methods, horribly amusing and inconsistent results occur when someone tries to use == and finds the wrong (unexpected) method gets called at the wrong level of the hierarchy.

Related

C# 7.2 use of "in parameter" for operators

In C# 7.2, we saw the introduction of the in modifier for method parameters to pass read-only references to objects. I'm working on a new .NET Standard project using 7.2, and out of curiosity I tried compiling with the in keyword on the parameters for the equality operators for a struct.
i.e. - public static bool operator == (in Point l, in Point r)
not - public static bool operator == (Point l, Point r)
I was initially a bit surprised that this worked, but after thinking about it a bit I realized that there is probably no functional difference between the two versions of the operator. I wanted to confirm these suspicions, but after a somewhat thorough search around, I can't find anything that explicitly talks about using the in keyword in operator overloads.
So my question is whether or not this actually has a functional difference, and if it does, is there any particular reason to encourage or discourage the use of in with operator arguments. My initial thoughts are that there is no difference, particularly if the operator is inlined. However, if it does make a difference, it seems like in parameters should be used everywhere (everywhere that readonly references make sense, that is), as they provide a speed bonus, and, unlike ref and out, don't require the user to prepend the those keywords when passing objects. This would allow more efficient value-type object passing without a single change on the user of the methods and operators.
Overall, this may go beyond the sort of small-scale optimizations that most C# developers worry about, but I am curious as to whether or not it has an effect.
whether or not this actually has a functional difference... My initial thoughts are that there is no difference, particularly if the operator is inlined
Since the operator == overload is invoked like a regular static method in MSIL, it has the functional difference. It can help to avoid unnecessary copying like in a regular method.
is there any particular reason to encourage or discourage the use of in with operator arguments.
According to this article it is recommended to apply in modifier when value types are larger than System.IntPtr.Size. But it is important that the value type should be readonly struct. Otherwise in modifier can harm the performance because the compiler will create a defensive copy when calling struct's methods and properties since they can change the state of the argument.

Why are casting and conversion operations are syntactically indistinguishable?

Stack Overflow has several questions about casting boxed values: 1, 2.
The solution requires first to unbox the value and only after that cast it to another type. Nevertheless, boxed value "knows" its own type, and I see no reason why conversion operator could not be called.
Moreover, the same issue is valid for reference types:
void Main()
{
object obj = new A();
B b = (B)obj;
}
public class A
{
}
public class B {}
This code throws InvalidCastException. So it's not the matter of value vs reference type; it's how compiler behaves.
For the upper code it emits castclass B, and for the code
void Main()
{
A obj = new A();
B b = (B)obj;
}
public class A
{
public static explicit operator B(A obj)
{
return new B();
}
}
public class B
{
}
it emits call A.op_Explicit.
Aha! Here compiler sees that A has an operator and calls it. But what then happens if B inherits from A? Not so fast, compiler is quite clever...it just says:
A.explicit operator B(A)': user-defined conversions to or from a
derived class are not allowed
Ha! No ambiguity!
but why on Earth did they allow two rather different operations to look the same?! What was the reason?
Your observation is, as far as I can tell, the observation that I made here:
http://ericlippert.com/2009/03/03/representation-and-identity/
There are two basic usages of the cast operator in C#:
(1) My code has an expression of type B, but I happen to have more information than the compiler does. I claim to know for certain that at runtime, this object of type B will actually always be of derived type D. I will inform the compiler of this claim by inserting a cast to D on the expression. Since the compiler probably cannot verify my claim, the compiler might ensure its veracity by inserting a run-time check at the point where I make the claim. If my claim turns out to be inaccurate, the CLR will throw an exception.
(2) I have an expression of some type T which I know for certain is not of type U. However, I have a well-known way of associating some or all values of T with an “equivalent” value of U. I will instruct the compiler to generate code that implements this operation by inserting a cast to U. (And if at runtime there turns out to be no equivalent value of U for the particular T I’ve got, again we throw an exception.)
The attentive reader will have noticed that these are opposites. A neat trick, to have an operator which means two contradictory things, don’t you think?
So apparently you are one of the "attentive readers" I called out who have noticed that we have one operation that logically means two rather different things. This is a good observation!
Your question is "why is that the case?" This is not a good question! :-)
As I have noted many times on this site, I cannot answer "why" questions satisfactorily. "Because that's what the specification says" is a correct answer but unsatisfactory. Really what the questioner is usually looking for is a summary of the language design process.
When the C# language design team designs features the debates can go on for literally months, they can involve a dozen people discussing many different proposals each with their own pros and cons, that generate hundreds of pages of notes. Even if I had the relevant information from the late 1990s meetings about cast operations, which I don't, it seems hard to summarize it concisely in a manner that would be satisfactory to the original questioner.
Moreover, in order to satisfactorily answer this question one would of course have to discuss the entire historical perspective. C# was designed to be immediately productive for existing C, C++ and Java programmers, and so it borrows many of the conventions of these languages, including its basic mechanisms for conversion operators. In order to properly answer the question we would have to discuss the history of the cast operator in C, C++ and Java as well. This seems like far too much information to expect in an answer on StackOverflow.
Frankly, the most likely explanation is that this decision was not the result of long debate between the merits of different positions. Rather, it's likely the language design team considered how it is done in C, C++ and Java, made a reasonable compromise position that didn't look too terrible, and moved on to other more interesting business. A proper answer would therefore be almost entirely historical; why did Ritchie design the cast operator like he did for C? I don't know, and we can't ask him.
My advice to you is that you stop asking "why?" questions about the history of programming language design and instead ask specific technical questions about specific code, questions that have a short answers.
Conversion operators are essentially "glorified method calls", so the compiler (as opposed to the runtime) already needs to know that you want to use the conversion operator and not a typecast. Basically the compiler needs to check whether a conversion exists to be able to generate the appropriate bytecode for it.
Your first code sample essentially looks like "convert from object to B", as the compiler has no idea that variable can only contain an A. According to the rules that means the compiler must emit a typecast operation.
Your second code sample is obvious to the compiler, because "convert from A to B" can be done with the conversion operator.

Implement GetHashCode on a class that has wildcard Equatability

Suppose I want to be able to compare 2 lists of ints and treat one particular value as a wild card.
e.g.
If -1 is a wild card, then
{1,2,3,4} == {1,2,-1,4} //returns true
And I'm writing a class to wrap all this logic, so it implements IEquatable and has the relevant logic in public override bool Equals()
But I have always thought that you more-or-less had to implement GetHashCode if you were overriding .Equals(). Granted it's not enforced by the compiler, but I've been under the impression that if you don't then you're doing it wrong.
Except I don't see how I can implement .GetHashCode() without either breaking its contract (objects that are Equal have different hashes), or just having the implementation be return 1.
Thoughts?
This implementation of Equals is already invalid, as it is not transitive. You should probably leave Equals with the default implementation, and write a new method like WildcardEquals (as suggested in the other answers here).
In general, whenever you have changed Equals, you must implement GetHashCode if you want to be able to store the objects in a hashtable (e.g. a Dictionary<TKey, TValue>) and have it work correctly. If you know for certain that the objects will never end up in a hashtable, then it is in theory optional (but it would be safer and clearer in that case to override it to throw a "NotSupportedException" or always return 0).
The general contract is to always implement GetHashCode if you override Equals, as you can't always be sure in advance that later users won't put your objects in hashtables.
In this case, I would create a new or extension method, WildcardEquals(other), instead of using the operators.
I wouldn't recommend hiding this kind of complexity.
From a logical point of view, we break the concept of equality. It is not transitive any longer. So in case of wildcards, A==B and B==C does not mean that A==C.
From a technical pount of view, returning the same value from GetHashCode() is not somenting unforgivable.
The only possible idea I see is to exploit at least the length, e.g.:
public override int GetHashCode()
{
return this.Length.GetHashCode()
}
It's recommended, but not mandatory at all. If you don't need that custom implementation of GetHashCode, just don't do it.
GetHashCode is generally only important if you're going to be storing elements of your class in some kind of collection, such as a set. If that's the case here then I don't think you're going to be able to achieve consistent semantics since as #AlexD points out equality is no longer transitive.
For example, (using string globs rather than integer lists) if you add the strings "A", "B", and "*" to a set, your set will end up with either one or two elements depending on the order you add them in.
If that's not what you want then I'd recommend putting the wildcard matching into a new method (e.g. EquivalentTo()) rather than overloading equality.
Having GetHashCode() always return a constant is the only 'legal' way of fulfilling the equals/hashcode constraint.
It'll potentially be inefficient if you put it in a hashmap, or similar, but that might be fine (non-equal hashcodes imply non-equality, but equal hashcodes imply nothing).
I think this is the only possible valid option there. Hashcodes essentially exist as keys to look things up by quickly, and since your wildcard must match every item, its key for lookup must equal every item's key, so they must all be the same.
As others have noted though, this isn't what equals is normally for, and breaks assumptions that many other things may use for equals (such as transitivity - EDIT: turns out this is actually contractual requirement, so no-go), so it's definitely worth at least considering comparing these manually, or with an explicitly separate equality comparer.
Since you've changed what "equals" means (adding in wildcards changes things dramatically) then you're already outside the scope of the normal use of Equals and GetHashCode. It's just a recommendation and in your case it seems like it doesn't fit. So don't worry about it.
That said, make sure you're not using your class in places that might use GetHashCode. That can get you in a load of trouble and be hard to debug if you're not watching for it.
It is generally expected that Equals(Object) and IEquatable<T>.Equals(T) should implement equivalence relations, such that if X is observed to be equal to Y, and Y is observed to be equal to Z, and none of the items have been modified, X may be assumed to be equal to Z; additionally, if X is equal to Y and Y does not equal Z, then X may be assumed not to equal Z either. Wild-card and fuzzy comparison methods are do not implement equivalence relations, and thus Equals should generally not be implemented with such semantics.
Many collections will kinda-sorta work with objects that implement Equals in a way that doesn't implement an equivalence relation, provided that any two objects that might compare equal to each other always return the same hash code. Doing this will often require that many things that would compare unequal to return the same hash code, though depending upon what types of wildcard are supported it may be possible to separate items to some degree.
For example, if the only wildcard which a particular string supports represents "arbitrary string of one or more digits", one could hash the string by converting all sequences of consecutive digits and/or string-of-digit wildcard characters into a single "string of digits" wildcard character. If # represents any digit, then the strings abc123, abc#, abc456, and abc#93#22#7 would all be hashed to the same value as abc#, but abc#b, abc123b, etc. could hash to a different value. Depending upon the distribution of strings, such distinctions may or may not yield better performance than returning a constant value.
Note that even if one implements GetHashCode in such a fashion that equal objects yield equal hashes, some collections may still get behave oddly if the equality method doesn't implement an equivalence relation. For example, if a collection foo contains items with keys "abc1" and "abc2", attempts to access foo["abc#"] might arbitrarily return the first item or the second. Attempts to delete the key "abc#" may arbitrarily remove one or both items, or may fail after deleting one item (its expected post-condition wouldn't be met, since abc# would be in the collection even after deletion).
Rather than trying to jinx Equals to compare hash-code equality, an alternative approach is to have a dictionary which holds for each possible wildcard string that would match at least one main-collection string a list of the strings it might possibly match. Thus, if there are many strings which would match abc#, they could all have different hash codes; if a user enters "abc#" as a search request, the system would look up "abc#" in the wild-card dictionary and receive a list of all strings matching that pattern, which could then be looked up individually in the main dictionary.

When to specify constraint `T : IEquatable<T>` even though it is not strictly required?

In short, I am looking for guidance on which of the following two methods should be preferred (and why):
static IEnumerable<T> DistinctA<T>(this IEnumerable<T> xs)
{
return new HashSet<T>(xs);
}
static IEnumerable<T> DistinctB<T>(this IEnumerable<T> xs) where T : IEquatable<T>
{
return new HashSet<T>(xs);
}
Argument in favour of DistinctA: Obviously, the constraint on T is not required, because HashSet<T> does not require it, and because instances of any T are guaranteed to be convertible to System.Object, which provides the same functionality as IEquatable<T> (namely the two methods Equals and GetHashCode). (While the non-generic methods will cause boxing with value types, that's not what I'm concerned about here.)
Argument in favour of DistinctB: The generic parameter constraint, while not strictly necessary, makes visible to callers that the method will compare instances of T, and is therefore a signal that Equals and GetHashCode should work correctly for T. (After all, defining a new type without explicitly implementing Equals and GetHashCode happens very easily, so the constraint might help catch some errors early.)
Question: Is there an established and documented best practice that recommends when to specify this particular constraint (T : IEquatable<T>), and when not to? And if not, is one of the above arguments flawed in any way? (In that case, I'd prefer well-thought-out arguments, not just personal opinions.)
While my comment at Pieter's answer is fully true, I've rethought the exact case of Distinct that you refer to.
This is a LINQ method contract, not just a method.
LINQ is meant to be a common fascade implemented by various providers. Linq2Objects may require an IEquatable, Linq2Sql may require IEquatable too, but Linq2Sql may not require and even not use at all and completely ignore the IEquatableness as the comparison is made by the DB SQL engine.
Therefore, at the layer of LINQ method definitions, it does not make sense to specify the requirement for IEquatable. It would limit and constrain the future LINQ providers to some things they really do not need to care for in their specific domains, and note that LINQ is actually very often all about domain-specificness, as very often the LINQ expressions and parameters are never actually run as code, but they are analyzed and retranslated to other constructs like SQL or XPaths.. Ommission of constraints in this case is reasonable, because you cannot really know what your future-and-unknown-domain-provider will really need to request from the caller. Linq-to-Mushrooms may need to use an IToxinEquality instead of IEquatable!
However, when you are designing interfaces/contracts that will clearly be used as runnable code (not just expression trees that will only work as configuration or declarations), then I do not see no valid point for not providing the constraints.
Start by considering when it might matter which of the two mechanisms is used; I can think of only two:
When the code is being translated to another language (either a subsequent version of C#, or a related language like Java, or a completly dissimilar language such as Haskell). In this case the second definition is clearly better by providing the translator, whether automated or manual, with more information.
When a user unfamiliar with the code is reading it to learn how to invoke the method. Again, I believe the second is clearly better by providing more information readily to such a user.
I cannot think of any circumstance in which the fist definition would be preferred, and where it actually matters beyond personal preference.
Others thoughts?

Why does x++ have higher precedence than ++x?

What’s the point of post increment ++ operator having higher precedence than preincrement ++ operator? Thus, is there a situation where x++ having same level of precedence as ++x would cause an expression to return a wrong result?
Let's start with defining some terms, so that we're all talking about the same thing.
The primary operators are postfix "x++" and "x--", the member access operator "x.y", the call operator "f(x)", the array dereference operator "a[x]", and the new, typeof, default, checked, unchecked and delegate operators.
The unary operators are "+x", "-x", "~x", "!x", "++x", "--x" and the cast "(T)x".
The primary operators are by definition of higher precedence than the unary operators.
Your question is
is there a situation where x++ having same level of precedence as ++x would cause an expression to return a wrong result?
It is not at all clear to me what you mean logically by "the wrong result". If we changed the rules of precedence in such a way that the value of an expression changed then the new result would be the right result. The right result is whatever the rules say the right result is. That's how we define "the right result" -- the right result is what you get when you correctly apply the rules.
We try to set up the rules so that they are useful and make it easy to express the meaning you intend to express. Is that what you mean by "the wrong result" ? That is, are you asking if there is a situation where one's intuition about what the right answer is would be incorrect?
I submit to you that if that is the case, then this is not a helpful angle to pursue because almost no one's intuition about the "correct" operation of the increment operators actually matches the current specification, much less some hypothetical counterfactual specification. In almost every C# book I have edited, the author has in some subtle or gross way mis-stated the meaning of the increment operators.
These are side-effecting operations with unusual semantics, and they come out of a language - C - with deliberately vague operational semantics. We have tried hard in the definition of C# to make the increment and decrement operators sensible and strictly defined, but it is impossible to come up with something that makes intuitive sense to everyone, since everyone has a different experience with the operators in C and C++.
Perhaps it would be helpful to approach the problem from a different angle. Your question presupposes a counterfactual world in which postfix and prefix ++ are specified to have the same precedence, and then asks for a criticism of that design choice in that counterfactual world. But there are many different ways that could happen. We could make them have the same precedence by putting both into the "primary" category. Or we could make them have the same precedence by putting them both into the "unary" category. Or we could invent a new level of precedence between primary and unary. Or below unary. Or above primary. We could also change the associativity of the operators, not just their precedence.
Perhaps you could clarify the question as to which counterfactual world you'd like to have criticized. Given any of those counterfactuals, I can give you a criticism of how that choice would lead to code that was unnecessarily verbose or confusing, but without a clear concept of the counterfactual design you're asking for criticism on, I worry that I'd spend a lot of time criticising something other than what you actually have in mind.
Make a specific proposed design change, and we'll see what its consequences are.
John, you have answered the question yourself: these two constructions are mostly used for function calls: ++x - when you want first to increase the value and then call a function, and x++ when you want to call a function, and then make an increase. That might be very useful, depending on the context. Looking at return x++ vs return ++x I see no point for error: the code means exactly how it reads :) The only problem is the programmer, who might use these two constructions without understanding the operator's precedence, and thus missing the meaning.

Categories

Resources