One of the stated benefits benefits of using the IIncrementalGenerator over ISourceGenerator is that different stages of the pipeline can recognize that the results of the current iteration is the same as a previous iteration and use cached results.
In order for this to work, presumably the type parameter of any IncrementalValueProvider or IncrementalValuesProvider would need to be value equatable, (as opposed to the default reference equatable) which is presumably why you see a lot of source generators implemented using record
This leads to the question, if the IncrementalValueProvider equality function includes an ISymbol representing the same semantic object, will they be considered equal for the purposes of the caching comparison during different iterations of the pipeline?
Would it be better to just forgo referencing an ISymbol and just extract the data I need out of it (name, namespace, members, etc.)?
I have dug into the Equals functions of the Symbol implementations, and it is unclear, the base class Symbol.Equals() does use reference comparison, but this appears to be overridden in most (possibly all) of the derived classes. and a spot check of these overrides appear that they are attempting to have a value equality, but again, there are a lot to check, and even if it does end up using the reference equality check, it is also possible that the symbol references are cached between runs and even a reference equals check will be true.
Would it be better to just forgo referencing an ISymbol and just extract the data I need out of it (name, namespace, members, etc.)?
Don't include symbols in the pipeline. Not only that they will compare unequal, but they also root compilations in memory and can lead to high memory usage.
The following small snippet shows how symbols compare unequal, even if they are resulting from the exact same syntax tree:
var trees = new[] { SyntaxFactory.ParseSyntaxTree("public class C { }") };
var comp1 = CSharpCompilation.Create(null, trees);
var s1 = comp1.GetTypeByMetadataName("C");
var comp2 = CSharpCompilation.Create(null, trees);
var s2 = comp2.GetTypeByMetadataName("C");
Console.WriteLine(s1.Equals(s2, SymbolEqualityComparer.Default));
But even if you provide your own equality comparer, rooting compilations is still an issue.
Related
I'm trying to exclude entities to be added to database if they already exist there. So I decided newBillInstances.Except(dbContext.BillInstances) would be best approach for that. However it doesn't work at all (no entities are excluded) though for List<string> it works perfectly.
I read this discussion and actual decription of .Except() in MSDN. It states the class to be used in .Except() should implement IEqualityComparer<T> to use default comparer.
Actually the MSDN article doesn't fully describe process of comparison of two instances. I still don't get why both Equals() and GetHashObject() have to be overridden.
I have implemented IEqualityComparer<BillInstance> interface and put break points in boths methods, but while calling .Except(IEnumerable) it's not used. Only when I changed to .Except(IEnumerable, new BillInstanceComparer()) I've cough break in GetHashCode() but no breaks where in Equals().
Then I have implemented IEqualityComparer<BillInstance> right in BillInstance class and expected it would be used while using .Except(IEnumerable) but breaks weren't hit in both methods.
So I've got two questions:
What should be done to use .Except(IEnumerable)?
Why Equals() isn't used at all? Is it used only in case hash codes of two instances are same?
Because the Equals() is used only if two objects have the same GetHashCode(). If there are no objects that have the same GetHashCode() then there is no chance of using the Equals().
Internally the Except() uses a Set<> (you can see it here), that is an internal class that you should consider to be equivalent to HashSet<>. This class uses the hash of the object to "index" them, then uses the Equals() to check if two objects that have the same hash are the same or different-but-with-the-same-hash.
Link to other relevant answer: https://stackoverflow.com/a/371348/613130
Somewhere in the code a set or a map/dictionary is hidden.
These guys typically contains a number of buckets which grows with the number of elements stored in the set. An element is partitioned into buckets based on the hash code and the actual identity comparison within the bucket is done using equals.
So the hash code is used to find the correct bucket (why GetHashCode is needed) whereupon equals is used to compare it to other elements in the buckets.
That's why you need to implement both.
Ok, from the IEnumerable source (thanks to m0sa) I've understood internals of calling Except(IEnumerable):
enumerable1.Except(enumerable2) calls ExceptIterator(enumerable1, enumerable2, null) where null is supposed to be an instance of IEquitableComparer.
ExceptIterator() creates an instance of internal class Set passing null as comparer.
Since comparer is null the property EqualityComparer<TElement>.Default is used.
Default property creates a comparer for TElement unless it's already created by calling CreateComparer(). Specifically 2 points were interesting for me:
If TElement implements IEquatable interface, then as far as I understood some generic comparer for IEquatable is created. I believe it would use then IEquatable.GetHashCode() and IEquatable.Equals().
For general cases (not type of byte, not implementing IEquatable, not Nullable, not enum) ObjectEqualityComparer instance is returned. ObjectEqualityComparer.GetHashCode() and ObjectEqualityComparer.Equals() generally call corresponding methods of the TElement.
So this gave me understanding for my case (each instance of BillInstance is generally immutable) it should be sufficient to override Object.GetHashCode() and Object.Equals().
Okay, I'm reading up on all the advice on how to override object.Equals and == for value and reference types. In short, always override equality for structs and don't override equality for Reference types unless you have some unusual circumstance like class that wraps a single string. (But don't make a struct unless it is small, even if semantically or DDD terms it is a value type)
But most of my types that hold data are DTOS-- classes with lots of properties. They have more properties that is suitable for a struct (more than 16 bytes) and will be consumed by developers who will expect == and object.Equals to behave as usual. All three scenarios come up-- needing to check for equality by reference, value (especially in unit testing) and by key (especially when working with data that came from or is going to a relational database.)
Is there a .NET framework way to implement equality-by-value or equality-by-key without stomping the default behavior of object.Equals? Or must I create my own ad hoc interface, like ISameByValue<T>, ISameByKey<T>?
Create IEqualityComparer types. This allows you to create any number of different types capable of comparing your object by any number of different definitions of equality, all without changing any behavior on the type itself.
I had a problem where I had implemented property based comparisons in the overridden Equals method (to implement HasChanges type functionality), but it caused all sorts of problems when I updated property values of items in a collection.
My solution (found by helpful users of this website) was to move the property based comparisons into a new, custom method and to return the default object.Equals value instead. However, this meant that there was no longer any based comparisons when calling the Equals method.
The solution was then to provide custom implementations of the IEqualityComparer<T> Interface and to pass the instances through to any methods that require object comparisons, like the IEnumerable Intersect or Except methods for example:
if (digitalServiceProvider.PriceTiers[index].Territories.Count > 0 &&
digitalServiceProvider.PriceTiers[index].Territories.Intersect(
release.TerritorialRights, new CountryEqualityComparer()).Count() == 0) { ... }
Suppose I want to be able to compare 2 lists of ints and treat one particular value as a wild card.
e.g.
If -1 is a wild card, then
{1,2,3,4} == {1,2,-1,4} //returns true
And I'm writing a class to wrap all this logic, so it implements IEquatable and has the relevant logic in public override bool Equals()
But I have always thought that you more-or-less had to implement GetHashCode if you were overriding .Equals(). Granted it's not enforced by the compiler, but I've been under the impression that if you don't then you're doing it wrong.
Except I don't see how I can implement .GetHashCode() without either breaking its contract (objects that are Equal have different hashes), or just having the implementation be return 1.
Thoughts?
This implementation of Equals is already invalid, as it is not transitive. You should probably leave Equals with the default implementation, and write a new method like WildcardEquals (as suggested in the other answers here).
In general, whenever you have changed Equals, you must implement GetHashCode if you want to be able to store the objects in a hashtable (e.g. a Dictionary<TKey, TValue>) and have it work correctly. If you know for certain that the objects will never end up in a hashtable, then it is in theory optional (but it would be safer and clearer in that case to override it to throw a "NotSupportedException" or always return 0).
The general contract is to always implement GetHashCode if you override Equals, as you can't always be sure in advance that later users won't put your objects in hashtables.
In this case, I would create a new or extension method, WildcardEquals(other), instead of using the operators.
I wouldn't recommend hiding this kind of complexity.
From a logical point of view, we break the concept of equality. It is not transitive any longer. So in case of wildcards, A==B and B==C does not mean that A==C.
From a technical pount of view, returning the same value from GetHashCode() is not somenting unforgivable.
The only possible idea I see is to exploit at least the length, e.g.:
public override int GetHashCode()
{
return this.Length.GetHashCode()
}
It's recommended, but not mandatory at all. If you don't need that custom implementation of GetHashCode, just don't do it.
GetHashCode is generally only important if you're going to be storing elements of your class in some kind of collection, such as a set. If that's the case here then I don't think you're going to be able to achieve consistent semantics since as #AlexD points out equality is no longer transitive.
For example, (using string globs rather than integer lists) if you add the strings "A", "B", and "*" to a set, your set will end up with either one or two elements depending on the order you add them in.
If that's not what you want then I'd recommend putting the wildcard matching into a new method (e.g. EquivalentTo()) rather than overloading equality.
Having GetHashCode() always return a constant is the only 'legal' way of fulfilling the equals/hashcode constraint.
It'll potentially be inefficient if you put it in a hashmap, or similar, but that might be fine (non-equal hashcodes imply non-equality, but equal hashcodes imply nothing).
I think this is the only possible valid option there. Hashcodes essentially exist as keys to look things up by quickly, and since your wildcard must match every item, its key for lookup must equal every item's key, so they must all be the same.
As others have noted though, this isn't what equals is normally for, and breaks assumptions that many other things may use for equals (such as transitivity - EDIT: turns out this is actually contractual requirement, so no-go), so it's definitely worth at least considering comparing these manually, or with an explicitly separate equality comparer.
Since you've changed what "equals" means (adding in wildcards changes things dramatically) then you're already outside the scope of the normal use of Equals and GetHashCode. It's just a recommendation and in your case it seems like it doesn't fit. So don't worry about it.
That said, make sure you're not using your class in places that might use GetHashCode. That can get you in a load of trouble and be hard to debug if you're not watching for it.
It is generally expected that Equals(Object) and IEquatable<T>.Equals(T) should implement equivalence relations, such that if X is observed to be equal to Y, and Y is observed to be equal to Z, and none of the items have been modified, X may be assumed to be equal to Z; additionally, if X is equal to Y and Y does not equal Z, then X may be assumed not to equal Z either. Wild-card and fuzzy comparison methods are do not implement equivalence relations, and thus Equals should generally not be implemented with such semantics.
Many collections will kinda-sorta work with objects that implement Equals in a way that doesn't implement an equivalence relation, provided that any two objects that might compare equal to each other always return the same hash code. Doing this will often require that many things that would compare unequal to return the same hash code, though depending upon what types of wildcard are supported it may be possible to separate items to some degree.
For example, if the only wildcard which a particular string supports represents "arbitrary string of one or more digits", one could hash the string by converting all sequences of consecutive digits and/or string-of-digit wildcard characters into a single "string of digits" wildcard character. If # represents any digit, then the strings abc123, abc#, abc456, and abc#93#22#7 would all be hashed to the same value as abc#, but abc#b, abc123b, etc. could hash to a different value. Depending upon the distribution of strings, such distinctions may or may not yield better performance than returning a constant value.
Note that even if one implements GetHashCode in such a fashion that equal objects yield equal hashes, some collections may still get behave oddly if the equality method doesn't implement an equivalence relation. For example, if a collection foo contains items with keys "abc1" and "abc2", attempts to access foo["abc#"] might arbitrarily return the first item or the second. Attempts to delete the key "abc#" may arbitrarily remove one or both items, or may fail after deleting one item (its expected post-condition wouldn't be met, since abc# would be in the collection even after deletion).
Rather than trying to jinx Equals to compare hash-code equality, an alternative approach is to have a dictionary which holds for each possible wildcard string that would match at least one main-collection string a list of the strings it might possibly match. Thus, if there are many strings which would match abc#, they could all have different hash codes; if a user enters "abc#" as a search request, the system would look up "abc#" in the wild-card dictionary and receive a list of all strings matching that pattern, which could then be looked up individually in the main dictionary.
In several programming languages there are Set collections which are supposed to be implementations of the mathematical concept of a finite set.
However, this is not necessarily true, for example in C# and Java, both implementations of HashSet<T> allow you to add any HashSet<T> collection as a member of itself. Which by the modern definition of a mathematical set is not allowed.
Background:
According to naive set theory, the definition of a set is:
A set is a collection of distinct objects.
However, this definition lead famously to Russel's Paradox as well as other paradoxes. For convenience, Russel's Paradox is:
Let R be the set of all sets that are not members of themselves. If R
is not a member of itself, then its definition dictates that it must
contain itself, and if it contains itself, then it contradicts its own
definition as the set of all sets that are not members of themselves.
So according to modern set theory (See: ZFC), the definition of a set is:
A set is a collection of distinct objects, none of which is the set
itself.
Specifically, this is a result of the axiom of regularity.
Well so what? What are the implications of this? Why is this question on StackOverflow?
One of the implications of Russel's Paradox is that not all collections are sets. Additionally, this was the point where mathematicians dropped the definition of a set as being the usual English definition. So I believe this question has a lot of weight when it comes to programming language design in general.
Question(s):
So why would programming languages, who in some form, use these principles in their very design just ignore it in the implementation of the Set in their languages libraries?
Secondly, is this a common occurrence with other implementations of mathematical concepts?
Perhaps I'm being a bit nit picky, but if these are to be true implementations of Sets, then why would part of the very definition be ignored?
Update
Addition of C# and Java code snippets exemplifying behavior:
Java Snippet:
Set<Object> hashSet = new HashSet<Object>();
hashSet.add(1);
hashSet.add("Tiger");
hashSet.add(hashSet);
hashSet.add('f');
Object[] array = hashSet.toArray();
HashSet<Object> hash = (HashSet<Object>)array[3];
System.out.println("HashSet in HashSet:");
for (Object obj : hash)
System.out.println(obj);
System.out.println("\nPrinciple HashSet:");
for (Object obj : hashSet)
System.out.println(obj);
Which prints out:
HashSet in HashSet:
f
1
Tiger
[f, 1, Tiger, (this Collection)]
Principle HashSet:
f
1
Tiger
[f, 1, Tiger, (this Collection)]
C# Snippet:
HashSet<object> hashSet = new HashSet<object>();
hashSet.Add(1);
hashSet.Add("Tiger");
hashSet.Add(hashSet);
hashSet.Add('f');
object[] array = hashSet.ToArray();
var hash = (HashSet<object>)array[2];
Console.WriteLine("HashSet in HashSet:");
foreach (object obj in hash)
Console.WriteLine(obj);
Console.WriteLine("\nPrinciple HashSet:");
foreach (object obj in hashSet)
Console.WriteLine(obj);
Which prints out:
HashSet in HashSet:
1
Tiger
System.Collections.Generic.HashSet`1[System.Object]
f
Principle HashSet:
1
Tiger
System.Collections.Generic.HashSet`1[System.Object]
f
Update 2
In regards to Martijn Courteaux's second point which was that it could be done in the name of computational efficiency:
I made two test collections in C#. They were identical, except in the Add method of one of the them - I added the following check: if (this != obj) where obj is the item being added to the collection.
I clocked both of them separately where they were to add 100,000 random integers:
With Check: ~ 28 milliseconds
Without Check: ~ 21 milliseconds
This is a fairly significant performance jump.
Programming language sets really aren't like ZFC sets, but for quite different reasons than you suppose:
You can't form a set by comprehension (i.e. set of all objects such that ...). Note that this already blocks all (I believe) naive set theory paradoxes, so they are irrelevant.
They usually can't be infinite.
There exist objects which aren't sets (in ZFC there are only sets).
They are usually mutable (i.e. you can add/remove elements to/from the set).
The objects they contain can be mutable.
So the answer to
So why would programming languages, who in some form, use these principles in their very design just ignore it in the implementation of the Set in their languages libraries?
is that the languages don't use these principles.
I can't speak for C#, but as far as Java is concerned a set is a set. If you look at the javadoc for the Set interface, you will see (emphasis mine):
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Whether the prohibition is actively enforced is unclear (adding a HashSet to itself does not throw any exceptions for example), but at least it is clearly documented that you should not try because the behaviour would then be unspecified.
Well, I think that this is because of some reasons:
For very specific programming purposes, you might want to create a set that contains itself. When you are doing this, you are not really caring about what a set mathematically means and you just want to enjoy the functionality a Set offers: "Adding" elements to the set without creating duplicate entries. (I have to be honest I can't think of a situation where you want to do that.)
For performance purposes. The chance you want to use a Set and let it contain itself is very very rare. So, it would be a waste of computing power, energy, time, performance and environmental health to check each time you try to add an element if it would be the set itself.
In java the set is mathematical set. You can insert objects, but only one can be in the set. Information from javadoc about set:
"A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction."
Source: http://docs.oracle.com/javase/7/docs/api/
I created to following code to verify the uniqueness of a series of "tuples":
struct MyTuple
{
public MyTuple(string a, string b, string c)
{
ValA = a; ValB = b; ValC = c;
}
private string ValA;
private string ValB;
private string ValC;
}
...
HashSet<MyTuple> tupleList = new HashSet<MyTuple>();
If I'm correct, I will not end up with two tuples with the same values in my HashSet thanks to the fact that I'm using a struct. I could not have the same behavior with a class without implementing IEquatable or something like that (I didn't dig too much how to do that).
I want to know if there is some gotcha about what I do. Performance wise, I wouldn't expect the use of a struct to be a problem considering that the string inside are reference types.
Edit:
I want my HashSet to never contains two tuples having string with the same values. In other words, I want the string to behave like values types.
The gotcha is that it will not work. If two strings are "a", they can still be different references. That case would break your implementation.
Implement Equals() and GetHashCode() properly (e.g. using the ones from the supplied strings, and take care with NULL references in your struct), and possibly IEquatable<MyTuple> to make it even nicer.
Edit: The default implementation is explicitly not suitable to be used in hash tables and sets. This is clearly stated in the ValueType.GetHashCode() implementation (added emphasis):
The GetHashCode method applies to
types derived from ValueType. One or
more fields of the derived type is
used to calculate the return value. If
you call the derived type's
GetHashCode method, the return value
is not likely to be suitable for use
as a key in a hash table.
Additionally, if the value of one or
more of those fields changes, the
return value might become unsuitable
for use as a key in a hash table. In
either case, consider writing your own
implementation of the GetHashCode
method that more closely represents
the concept of a hash code for the
type.
You should always implement Equals() and GetHashCode() as "pair", and this is even more obvious since the ValueType.Equals() is terribly inefficient and unreliable (using reflection, unknown method of equality comparison). Also, there is the performance problem when not overriding those two (structs will get boxed when calling the default implementations).
Your approach should work, but you should make the string values read-only, as Lucero said.
You could also take a look at the new .NET 4.0 Tuple types. Although they are implemented as classes (because of supporting up to quite many parameters), they implement the new interface IStructuralEquatable which is intended exactly for your purpose.
It depends on what you mean by "same values". If you want the strings themselves to be unequal to one another—rather than just the references—then you're going to have to write your own Equals(MyTuple) method.