Most efficient way to compare two List<T> - c#

I'm trying to compare two lists. Here's the extension method I'm trying to use:
public static bool EqualsAll<T>(this IList<T> a, IList<T> b)
{
if (a == null || b == null)
return (a == null && b == null);
if (a.Count != b.Count)
return false;
EqualityComparer<T> comparer = EqualityComparer<T>.Default;
for (int i = 0; i < a.Count; i++)
{
if (!comparer.Equals(a[i], b[i]))
return false;
}
return true;
}
I've already asked this question here. But I need more information about that. the answerer said that It's better to use SequenceEqual instead of for loop.
Now my question is that which one of two approaches is the most efficient way for comparing two List<T>?

I'm sure they're both relatively equal in performance since both are doing a sequence equal check. SequenceEqual doesn't do any magic - it still loops.
As for which one to use, go with SequenceEqual - it's built into the framework, other programmers already know it's functionality, and most importantly, there's no need to reinvent the wheel.

As IList<T> implements IEnumerable<T>, as documented here, I would believe that
a.SequenceEqual(b)
is a reasonable way to do the comparison, with the exentsion method being documented here.

The most efficient way to compare to List<T> objects is to not go through interface dispatch by using IList. Instead, specialize the type to List<T>. The arguments should be List<T>.
This saves a lot of indirect calls. It's clearly faster than the IEnumerable based SequenceEquals while requires two indirect calls per element.
Also, you need to cache the count.
It would be best if List<T> had a built-in method, or implemented a certain interface or allowed access to its internal buffer. But none of that is available.
I guess the fastest possible way is to runtime-compile a function that returns you the internal buffers of each of those lists. Then, you can compare arrays which is much faster. Clearly, this relies on Full Trust and undocumented internals... Proceed with care.
There are also things you can do to avoid the cost of the comparer. It really depends on how important this issue is. You can go to great length to optimize this.

Related

Where are the methods from the IEnumerable interface implemented for built in classes of the MSDN?

I've been searching in the MSDN to try to see a class in which the methods from IEnumerator are implemented. For example, the ArrayList class. Is it possible to visualize the helper class in which the IEnumerator interface is implemented so that the GetEnumerator() from IEnumerable interface is able to return an instance of this class and the foreach functionality becomes available?
I realize that there is no practical use of this. This would be only for "academic" purpouses, to better understand how the developers from the language built it.
ArrayList is considered to be an integral part of the platform, so you may not necessarily find it where you think it should be. (It is also obsolete, and you should prefer a generic collection such as List<T>.) For example, in .NET, it is not defined in the collections solution, but rather in mscorlib, in mscorlib/system/collections/arraylist.cs. The definition of GetEnumerator is essentially just a single line:
return new ArrayListEnumeratorSimple(this);
There is also an overload which iterates over a sublist, which is again essentially a single line:
return new ArrayListEnumerator(this, index, count);
The definition of ArrayListEnumerator is again rather simple, if you strip out argument validation and such, it is basically just:
public bool MoveNext() {
if (index < endIndex) {
currentElement = list[++index];
return true;
}
else {
index = endIndex + 1;
}
return false;
}
public Object Current => currentElement;
The definition of ArrayListEnumeratorSimple is ironically more complex, but still essentially the same.
Note that the .NET source code is quite old, and ArrayList is one of the oldest classes in there.
For a more modern perspective, you could look at CoreFX's implementation in src/System.Runtime.Extensions/src/System/Collections/ArrayList.cs, but apart from a more modern coding style, and the use of a sentinel value instead of null, it is exactly the same.

Is there any performance difference between compare method and compare class?

Are there any difference in performance between
List<T>.Sort Method (Comparison<T>)
and
List<T>.Sort Method (IComparer<T>)?
Does exists any structural (software architectural) benefits?
When do you use the compare method instead of compare class and vice versa?
EDIT:
The List<T>.Sort Method (IComparer<T>) is faster. Thanks Jim Mischel!
The performance difference is around 1% on my PC.
It seems that the compare class is the faster one.
The difference is that the first accepts a method (anonymous or not) and the second accepts an instance of a comparer object. Sometimes it is easier to define complex and customizeable comparer classes rather than write everything inside a single function.
I prefer the first for simple sorting in one dimension and the latter for multidimensional sorting in e.g. data grids.
Using a comparer you can have private members which can often help with caching. This is useful in certain scenarios (again, in complex sorting of a large data set displayed in a grid).
As I recall, List.Sort(Comparer<T>) instantiates an IComparer<T> and then calls List.Sort(IComparer<T>).
It looks something like this:
class SortComparer<T>: IComparer<T>
{
private readonly Comparison<T> _compare;
public SortComparer(Comparison<T> comp)
{
_compare = comp;
}
public int Compare(T x, T y)
{
return _compare(x, y);
}
}
public Sort(Comparison<T> comp)
{
Sort(new SortComparer(comp));
}
So they really end up doing the same thing. When I timed this stuff (back in .NET 3.5), Sort(IComparer<T>) was slightly faster because it didn't have to do the extra dereference on every call. But the difference really wasn't big enough to worry about. This is definitely a case of use whatever works best in your code rather than what performs the fastest.
A little more about it, including information about default IComparer implementations: Of Comparison and IComparer

C#/Java: Proper Implementation of CompareTo when Equals tests reference identity

I believe this question applies equally well to C# as to Java, because both require that {c,C}ompareTo be consistent with {e,E}quals:
Suppose I want my equals() method to be the same as a reference check, i.e.:
public bool equals(Object o) {
return this == o;
}
In that case, how do I implement compareTo(Object o) (or its generic equivalent)? Part of it is easy, but I'm not sure about the other part:
public int compareTo(Object o) {
MyClass other = (MyClass)o;
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
I can't just blindly return 1 or -1, because the solution should adhere to the normal requirements of compareTo. I can check all the instance fields, but if they are all equal, I'd still like compareTo to return a value other than 0. It should be true that a.compareTo(b) == -(b.compareTo(a)), and the ordering should stay consistent as long as the objects' state doesn't change.
I don't care about ordering across invocations of the virtual machine, however. This makes me think that I could use something like memory address, if I could get at it. Then again, maybe that won't work, because the Garbage Collector could decide to move my objects around.
hashCode is another idea, but I'd like something that will be always unique, not just mostly unique.
Any ideas?
First of all, if you are using Java 5 or above, you should implement Comparable<MyClass> rather than the plain old Comparable, therefore your compareTo method should take parameters of type MyClass, notObject:
public int compareTo(MyClass other) {
if (this == other) {
return 0;
} else {
int c = foo.CompareTo(other.foo)
if (c == 0) {
// what here?
} else {
return c;
}
}
}
As of your question, Josh Bloch in Effective Java (Chapter 3, Item 12) says:
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compare-
To(x)) for all x and y. (This implies that x.compareTo(y) must throw an exception
if and only if y.compareTo(x) throws an exception.)
This means that if c == 0 in the above code, you must return 0.
That in turn means that you can have objects A and B, which are not equal, but their comparison returns 0. What does Mr. Bloch have to say about this?
It is strongly recommended, but not strictly required, that (x.compareTo(y)
== 0) == (x.equals(y)). Generally speaking, any class that implements
the Comparable interface and violates this condition should clearly indicate
this fact. The recommended language is “Note: This class has a natural
ordering that is inconsistent with equals.”
And
A class whose compareTo method imposes an order
that is inconsistent with equals will still work, but sorted collections containing
elements of the class may not obey the general contract of the appropriate collection
interfaces (Collection, Set, or Map). This is because the general contracts
for these interfaces are defined in terms of the equals method, but sorted collections
use the equality test imposed by compareTo in place of equals. It is not a
catastrophe if this happens, but it’s something to be aware of.
Update: So IMHO with your current class, you can not make compareTo consistent with equals. If you really need to have this, the only way I see would be to introduce a new member, which would give a strict natural ordering to your class. Then in case all the meaningful fields of the two objects compare to 0, you could still decide the order of the two based on their special order values.
This extra member may be an instance counter, or a creation timestamp. Or, you could try using a UUID.
In Java or C#, generally speaking, there is no fixed ordering of objects. Instances can be moved around by the garbage collector while executing your compareTo, or the sort operation that's using your compareTo.
As you stated, hash codes are generally not unique, so they're not usable (two different instances with the same hash code bring you back to the original question). And the Java Object.toString implementation which many people believe to surface an object id (MyObject#33c0d9d), is nothing more than the object's class name followed by the hash code. As far as I know, neither the JVM nor the CLR have a notion of an instance id.
If you really want a consistent ordering of your classes, you could try using an incrementing number for each new instance you create. Mind you, incrementing this counter must be thread safe, so it's going to be relatively expensive (in C# you could use Interlocked.Increment).
Two objects don't need to be reference equal to be in the same equivalence class. In my opinion, it should be perfectly acceptable for two different objects to be the same for a comparison, but not reference equal. It seems perfectly natural to me, for example, that if you hydrated two different objects from the same row in the database, that they would be the same for comparison purposes, but not reference equal.
I'd actually be more inclined to modify the behavior of equals to reflect how they are compared rather than the other way around. For most purposes that I can think of this would be more natural.
The generic equivalent is easier to deal with in my opinion, depends what your external requirements are, this is a IComparable<MyClass> example:
public int CompareTo(MyClass other) {
if (other == null) return 1;
if (this == other) {
return 0;
} else {
return foo.CompareTo(other.foo);
}
}
If the classes are equal or if foo is equal, that's the end of the comparison, unless there's something secondary to sort on, in that case add it as the return if foo.CompareTo(other.foo) == 0
If your classes have an ID or something, then compare on that as secondary, otherwise don't worry about it...the collection they're stored it and it's order in arriving at these classes to compare is what's going to determine the final order in the case of equal objects or equal object.foo values.

Should I return an array or a collection from a function?

What's the preferred container type when returning multiple objects of the same type from a function?
Is it against good practice to return a simple array (like MyType[]), or should you wrap it in some generic container (like ICollection<MyType>)?
Thanks!
Eric Lippert has a good article on this. In case you can't be bothered to read the entire article, the answer is: return the interface.
Return an IEnumerable<T> using a yield return.
I would return an IList<T> as that gives the consumer of your function the greatest flexibility. That way if the consumer of your function only needed to enumerate the sequence they can do so, but if they want to use the sequence as a list they can do that as well.
My general rule of thumb is to accept the least restrictive type as a parameter and return the richest type I can. This is, of course, a balancing act as you don't want to lock yourself into any particular interface or implementation (but always, always try to use an interface).
This is the least presumptuous approach that you, the API developer, can take. It is not up to you to decide how a consumer of your function will use what they send you - that is why you would return an IList<T> in this case as to give them the greatest flexibility. Also for this same reason you would never presume to know what type of parameter a consumer will send you. If you only need to iterate a sequence sent to you as a parameter then make the parameter an IEnumerable<T> rather than a List<T>.
EDIT (monoxide): Since it doesn't look like the question is going to be closed, I just want to add a link from the other question about this: Why arrays are harmful
Why not List<T>?
From the Eric Lippert post mentioned by others, I thought I will highlight this:
If I need a sequence I’ll use
IEnumerable<T>, if I need a mapping
from contiguous numbers to data I’ll
use a List<T>, if I need a mapping
across arbitrary data I’ll use a
Dictionary<K,V>, if I need a set I’ll
use a HashSet<T>. I simply don’t need
arrays for anything, so I almost never
use them. They don’t solve a problem I
have better than the other tools at my
disposal.
A good piece of advice that I've oft heard quoted is this:
Be liberal in what you accept, precise in what you provide.
In terms of designing your API, I'd suggest you should be returning an Interface, not a concrete type.
Taking your example method, I'd rewrite it as follows:
public IList<object> Foo()
{
List<object> retList = new List<object>();
// Blah, blah, [snip]
return retList;
}
The key is that your internal implementation choice - to use a List - isn't revealed to the caller, but you're returning an appropriate interface.
Microsoft's own guidelines on framework development recommend against returning specific types, favoring interfaces. (Sorry, I couldn't find a link for this)
Similarly, your parameters should be as general as possible - instead of accepting an array, accept an IEnumerable of the appropriate type. This is compatible with arrays as well as lists and other useful types.
Taking your example method again:
public IList<object> Foo(IEnumerable<object> bar)
{
List<object> retList = new List<object>();
// Blah, blah, [snip]
return retList;
}
If the collection that is being returned is read-only, meaning you never want the elements to in the collection to be changed, then use IEnumerable<T>. This is the most basic representation of a read-only sequence of immutable (at least from the perspective of the enumeration itself) elements.
If you want it to be a self-contained collection that can be changed, then use ICollection<T> or IList<T>.
For example, if you wanted to return the results of searching for a particular set of files, then return IEnumerable<FileInfo>.
However, if you wanted to expose the files in a directory, however, you would expose IList/ICollection<FileInfo> as it makes sense that you would want to possibly change the contents of the collection.
return ICollection<type>
The advantage to generic return types, is that you can change the underlying implementation without changing the code that uses it. The advantage to returning the specific type, is you can use more type specific methods.
Always return an interface type that presents the greatest amount of functionality to the caller. So in your case ICollection<YourType> ought to be used.
Something interesting to note is that the BCL developers actually got this wrong in some place of the .NET framework - see this Eric Lippert blog post for that story.
Why not IList<MyType>?
It supports direct indexing which is hallmark for an array without removing the possibility to return a List<MyType> some day. If you want to suppress this feature, you probably want to return IEnumerable<MyType>.
It depends on what you plan to do with the collection you're returning. If you're just iterating, or if you only want the user to iterate, then I agree with #Daniel, return IEnumerable<T>. If you actually want to allow list-based operations, however, I'd return IList<T>.
Use generics. It's easier to interoperate with other collections classes and the type system is more able to help you with potential errors.
The old style of returning an array was a crutch before generics.
What ever makes your code more readable, maintainable and easier for YOU.
I would have used the simple array, simpler==better most of the time.
Although I really have to see the context to give the right answer.
There are big advantages to favouring IEnumerable over anything else, as this gives you the greatest implementation flexibility and allows you to use yield return or Linq operators for lazy implementation.
If the caller wants a List<T> instead they can simply call ToList() on whatever you returned, and the overall performance will be roughly the same as if you had created and returned a new List<T> from your method.
Array is harmful, but ICollection<T> is also harmful.
ICollection<T> cannot guarantee the object will be immutable.
My recommendation is to wrap the returning object with ReadOnlyCollection<T>

Which do you prefer for interfaces: T[], IEnumerable<T>, IList<T>, or other?

Ok, I'm hoping the community at large will aid us in solving a workplace debate that has been ongoing for a while. This has to do with defining interfaces that either accept or return lists of some type. There are several ways of doing this:
public interface Foo
{
Bar[] Bars { get; }
IEnumerable<Bar> Bars { get; }
ICollection<Bar> Bars { get; }
IList<Bar> Bars { get; }
}
My own preference is to use IEnumerable for arguments and arrays for return values:
public interface Foo
{
void Do(IEnumerable<Bar> bars);
Bar[] Bars { get; }
}
My argument for this approach is that the implementation class can create a List directly from the IEnumerable and simply return it with List.ToArray(). However some believe that IList should be returned instead of an array. The problem I have here is that now your required again to copy it with a ReadOnlyCollection before returning. The option of returning IEnumerable seems troublesome for client code?
What do you use/prefer? (especially with regards to libraries that will be used by other developers outside your organization)
My preference is IEnumerable<T>. Any other of the suggested interfaces gives the appearance of allowing the consumer to modify the underlying collection. This is almost certainly not what you want to do as it's allowing consumers to silently modify an internal collection.
Another good one IMHO, is ReadOnlyCollection<T>. It allows for all of the fun .Count and Indexer properties and unambiguously says to the consumer "you cannot modify my data".
I don't return arrays - they really are a terrible return type to use when creating an API - if you truly need a mutable sequence use the IList<T> or ICollection<T> interface or return a concrete Collection<T> instead.
Also I would suggest that you read Arrays considered somewhat harmful by Eric Lippert:
I got a moral question from an author
of programming language textbooks the
other day requesting my opinions on
whether or not beginner programmers
should be taught how to use arrays.
Rather than actually answer that
question, I gave him a long list of my
opinions about arrays, how I use
arrays, how we expect arrays to be
used in the future, and so on. This
gets a bit long, but like Pascal, I
didn't have time to make it shorter.
Let me start by saying when you
definitely should not use arrays, and
then wax more philosophical about the
future of modern programming and the
role of the array in the coming world.
For property collections that are indexed (and the indices have necessary semantic meaning), you should use ReadOnlyCollection<T> (read only) or IList<T> (read/write). It's the most flexible and expressive. For non-indexed collections, use IEnumerable<T> (read only) or ICollection<T> (read/write).
Method parameters should use IEnumerable<T> unless they 1) need to add/remove items to the collection (ICollection<T>) or 2) require indexes for necesary semantic purposes (IList<T>). If the method can benefit from indexing availability (such as a sorting routine), it can always use as IList<T> or .ToList() when that fails in the implementation.
I think about this in terms of writing the most useful code possible: code that can do more.
Put in those terms, it means I like to accept the weakest interface possible as method arguments, because that makes my code useful from more places. In this case, that's an IEnumerable<T>. Have an array? You can call my method. Have a List? You can call my method. Have an iterator block? You get the idea.
It also means I like my methods to return the strongest interface that is convenient, so that code that relies on the method can easily do more. In this case, that would be IList<T>. Note that this doesn't mean I will construct a list just so I can return it. It just means that if I already have some that implements IList<T>, I may as well use it.
Note that I'm a little unconventional with regards to return types. A more typical approach is to also return weaker types from methods to avoid locking yourself into a specific implementation.
I would prefer IEnumerable as it is the most highlevel of the interfaces giving the end user the opportunity to re-cast as he wishes. Even though this may provide the user with minimum functionality to begin with (basically only enumeration) it would still be enough to cover virtually any need, especially with the extension methods, ToArray(), ToList() etc.
IEnumerable<T> is very useful for lazy-evaluated iteration, especially in scenarios that use method chaining.
But as a return type for a typical data access tier, a Count property is often useful, and I would prefer to return an ICollection<T> with a Count property or possibly IList<T> if I think typical consumers will want to use an indexer.
This is also an indication to the caller that the collection has actually been materialized. And thus the caller can iterate through the returned collection without getting exceptions from the data access tier. This can be important. For example, a service may generate a stream (e.g. SOAP) from the returned collection. It can be awkward if an exception is thrown from the data access layer while generating the stream due to lazy-evaluated iteration, as the output stream is already partially written when the exception is thrown.
Since the Linq extension methods were added to IEnumerable<T>, I've found that my use of the other interfaces has declined considerably; probably around 80%. I used to use List<T> religiously as it had methods that accepted delegates for lazy evaluation like Find, FindAll, ForEach and the like. Since that's available through System.Linq's extensions, I've replaced all those references with IEnumerable<T> references.
I wouldn't go with array, its a type that allows modification yet doesn't have add/remove... kind of like the worst of the pack. If I want to allow modifications, then I would use a type that supports add/remove.
When you want to prevent modifications, you are already wrapping it/copying it, so I don't see what's wrong with a an IEnumerable or a ReadOnlyCollection. I would go with the later ... something I don't like about IEnumerable is that its lazy by nature, yet when you are using with pre-loaded data only to wrap it, calling code that works with it tends to assume pre-loaded data or have extra "unnecessary" lines :( ... that can get ugly results during change.

Categories

Resources