This question already has answers here:
LINQ, Where() vs FindAll()
(5 answers)
Closed 7 years ago.
I have an IEnumerable<T> that I wanted to filter based on a LINQ predicate. I tried using Where on the IEnumerable as I normally do, but this time I stumbled upon something interesting. When calling Where on the IEnumerable, with the predicate, i get an empty list in return. I know it has to produce a list with two items in it. If I instead use FindAll, with the same predicate, it then produces the correct result.
Can anyone explain to me, why this is happening? I always thought that Where was kind of a lazy version of FindAll, that also returned an IEnumerable instead of a List. There must be more to it than that? (I have done some research, but to no avail.)
Code:
IEnumerable<View> views = currentProject.Views.Where(
v => v.Entries.Any(e => e.Type == InputType.IMAGE || e.Type == InputType.VIDEO));
IEnumerable<View> views = currentProject.Views.FindAll(
v => v.Entries.Any(e => e.Type == InputType.IMAGE || e.Type == InputType.VIDEO));
You can find your answer here: LINQ, Where() vs FindAll() . Basically if you call .ToList() on your "Where" they would be the same.
You can find more about the differences between deferred and immediate execution: https://code.msdn.microsoft.com/LINQ-Query-Execution-ce0d3b95
My best guess would be that something happens between calling Where, which creates an enumerator and the place in your code where the results are actually used (i.e. where MoveNext and (get_)Current of that enumerator are actually called, e.g. from ToList).
Yes, Where is a lazy version of findall. FindAll() is a function on the List type, it's not a LINQ extension method like Where. The FindAll method on List, which is an instance method that returns a new List with the same element type. FindAll can only be used on List instances whereas LINQ extension methods work on any type that implements IEnumerable.
The main difference (besides what they're implemented on: IEnumerable vs. List) is that Where implements deferred execution, where it doesn't actually do the lookup until you need it, (using it in a foreach loop for example). FindAll is an immediate execution method.
I will refer to a data structure called an expression tree to understand deferred execution, you need only grasp that an expression tree is a data structure like a list or queue. It holds a LINQ to SQL query not the results of the query, but the actual elements of the query itself.
To understand the Where Working we need to see that if we write a code
var query = from customer in db.Customers
where customer.City == "Paris"
select customer;
Query does not execute here whereas it execute in the foreach loop
TO understand LINQ and Deferred Execution
Related
This question already has answers here:
Which method performs better: .Any() vs .Count() > 0?
(11 answers)
Closed 1 year ago.
I believe the answer to this question is well explained here:LINQ Ring: Any() vs Contains() for Huge Collections
But my question is specific for the current implementation
IEnumerable<T> msgs = null;
/// ...
/// some method calls which returns a long list of messages
/// The return type of the method is IEnumerable<T>
/// List<T> ret = new List<T>();
/// ...
/// return ret
/// ...
if (msgs.Any())
object= msgs.Last();
The msgs is an in memory collection (IEnumerable) said. How does Any() work here? There's no condition for this Any() method call, isn't it just O(1) instead? Or it still looks through each element?
I assume that IEnumerable<BaseJournalMessage> msgs is not a collection like an array or list, otherwise the Any and Last would be no problem(but you have performance issues). So it seems to be an expensive LINQ query which gets executed twice, once at Any and again at Last.
Any needs to enumerate the sequence to see if there is at least one. Last needs to enumerate it fully to get the last one. You can make it more efficient in this way:
BaseJournalMessage last = msgs.LastOrDefault();
if (last != null)
time = last.JournalTime;
To explain a bit more. Consider msg was an array:
IEnumerable<BaseJournalMessage> msgs = new BaseJournalMessage[0];
Here Any is simple and efficient since it just needs to check if the enumerator from the array has one element, same with other collections. The complexity is O(1).
Now consider that it's a complex query, like it seems to be in your case. Here the complexity of a following Any is clearly not O(1).:
IEnumerable<BaseJournalMessage> msgs = hugeMessageList
.Where(msg => ComplexMethod(msg) && OtherComplexCondition(msg))
.OrderBy(msg => msg.SomeProperty);
This is not a collection since you don't append ToList/ToArray/ToHashSet. Instead it's a deferred executed LINQ query. You will execute it every time it will be enumerated. That could be a foreach-loop, an Any or Last call or any other method that enumerates it. Sometimes it's useful to always get the currrent state, but normally you should materialize the query to a collection if you have to access it multiple times. So append ToList and everything's fine.
Have a look at the term "deferred execution" in each LINQ method(as for example Where, Select or OrderBy) if you want to know whether it's executing a query or not. You can chain as many deferred executed methods as you want without actually executing the query. But if a method contain "forces immediate query evaluation"(like for example ToList) the query gets executed(so avoid those methods in a middle of a query).
How does Any() work here? There's
no condition for this Any() method call, isn't it just O(1) instead?
Or it still looks through each element?
As for LinQ-To-Object, implemented in System.Linq.Enumerable static class, the implementation of Any() just gets the IEnumerator and invokes MoveNext(). If the result is true, Any() returns true itself. Otherwise it returns false. It never iterates any further.
So it is a pure O(1) algorithm.
EDIT: I have to correct myself: The time depends on the enumerable "Any" iterates. I had a misconception of the Big O notation and the meaning of "O(1)" and "O(n)".
This is the source code (source available at GitHub these days):
public static bool Any<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
using (IEnumerator<TSource> e = source.GetEnumerator()) {
if (e.MoveNext()) return true;
}
return false;
}
I have some doubts over how Enumerators work, and LINQ. Consider these two simple selects:
List<Animal> sel = (from animal in Animals
join race in Species
on animal.SpeciesKey equals race.SpeciesKey
select animal).Distinct().ToList();
or
IEnumerable<Animal> sel = (from animal in Animals
join race in Species
on animal.SpeciesKey equals race.SpeciesKey
select animal).Distinct();
I changed the names of my original objects so that this looks like a more generic example. The query itself is not that important. What I want to ask is this:
foreach (Animal animal in sel) { /*do stuff*/ }
I noticed that if I use IEnumerable, when I debug and inspect "sel", which in that case is the IEnumerable, it has some interesting members: "inner", "outer", "innerKeySelector" and "outerKeySelector", these last 2 appear to be delegates. The "inner" member does not have "Animal" instances in it, but rather "Species" instances, which was very strange for me. The "outer" member does contain "Animal" instances. I presume that the two delegates determine which goes in and what goes out of it?
I noticed that if I use "Distinct", the "inner" contains 6 items (this is incorrect as only 2 are Distinct), but the "outer" does contain the correct values. Again, probably the delegated methods determine this but this is a bit more than I know about IEnumerable.
Most importantly, which of the two options is the best performance-wise?
The evil List conversion via .ToList()?
Or maybe using the enumerator directly?
If you can, please also explain a bit or throw some links that explain this use of IEnumerable.
IEnumerable describes behavior, while List is an implementation of that behavior. When you use IEnumerable, you give the compiler a chance to defer work until later, possibly optimizing along the way. If you use ToList() you force the compiler to reify the results right away.
Whenever I'm "stacking" LINQ expressions, I use IEnumerable, because by only specifying the behavior I give LINQ a chance to defer evaluation and possibly optimize the program. Remember how LINQ doesn't generate the SQL to query the database until you enumerate it? Consider this:
public IEnumerable<Animals> AllSpotted()
{
return from a in Zoo.Animals
where a.coat.HasSpots == true
select a;
}
public IEnumerable<Animals> Feline(IEnumerable<Animals> sample)
{
return from a in sample
where a.race.Family == "Felidae"
select a;
}
public IEnumerable<Animals> Canine(IEnumerable<Animals> sample)
{
return from a in sample
where a.race.Family == "Canidae"
select a;
}
Now you have a method that selects an initial sample ("AllSpotted"), plus some filters. So now you can do this:
var Leopards = Feline(AllSpotted());
var Hyenas = Canine(AllSpotted());
So is it faster to use List over IEnumerable? Only if you want to prevent a query from being executed more than once. But is it better overall? Well in the above, Leopards and Hyenas get converted into single SQL queries each, and the database only returns the rows that are relevant. But if we had returned a List from AllSpotted(), then it may run slower because the database could return far more data than is actually needed, and we waste cycles doing the filtering in the client.
In a program, it may be better to defer converting your query to a list until the very end, so if I'm going to enumerate through Leopards and Hyenas more than once, I'd do this:
List<Animals> Leopards = Feline(AllSpotted()).ToList();
List<Animals> Hyenas = Canine(AllSpotted()).ToList();
There is a very good article written by: Claudio Bernasconi's TechBlog here: When to use IEnumerable, ICollection, IList and List
Here some basics points about scenarios and functions:
A class that implement IEnumerable allows you to use the foreach syntax.
Basically it has a method to get the next item in the collection. It doesn't need the whole collection to be in memory and doesn't know how many items are in it, foreach just keeps getting the next item until it runs out.
This can be very useful in certain circumstances, for instance in a massive database table you don't want to copy the entire thing into memory before you start processing the rows.
Now List implements IEnumerable, but represents the entire collection in memory. If you have an IEnumerable and you call .ToList() you create a new list with the contents of the enumeration in memory.
Your linq expression returns an enumeration, and by default the expression executes when you iterate through using the foreach. An IEnumerable linq statement executes when you iterate the foreach, but you can force it to iterate sooner using .ToList().
Here's what I mean:
var things =
from item in BigDatabaseCall()
where ....
select item;
// this will iterate through the entire linq statement:
int count = things.Count();
// this will stop after iterating the first one, but will execute the linq again
bool hasAnyRecs = things.Any();
// this will execute the linq statement *again*
foreach( var thing in things ) ...
// this will copy the results to a list in memory
var list = things.ToList()
// this won't iterate through again, the list knows how many items are in it
int count2 = list.Count();
// this won't execute the linq statement - we have it copied to the list
foreach( var thing in list ) ...
Nobody mentioned one crucial difference, ironically answered on a question closed as a duplicated of this.
IEnumerable is read-only and List is not.
See Practical difference between List and IEnumerable
The most important thing to realize is that, using Linq, the query does not get evaluated immediately. It is only run as part of iterating through the resulting IEnumerable<T> in a foreach - that's what all the weird delegates are doing.
So, the first example evaluates the query immediately by calling ToList and putting the query results in a list.
The second example returns an IEnumerable<T> that contains all the information needed to run the query later on.
In terms of performance, the answer is it depends. If you need the results to be evaluated at once (say, you're mutating the structures you're querying later on, or if you don't want the iteration over the IEnumerable<T> to take a long time) use a list. Else use an IEnumerable<T>. The default should be to use the on-demand evaluation in the second example, as that generally uses less memory, unless there is a specific reason to store the results in a list.
The advantage of IEnumerable is deferred execution (usually with databases). The query will not get executed until you actually loop through the data. It's a query waiting until it's needed (aka lazy loading).
If you call ToList, the query will be executed, or "materialized" as I like to say.
There are pros and cons to both. If you call ToList, you may remove some mystery as to when the query gets executed. If you stick to IEnumerable, you get the advantage that the program doesn't do any work until it's actually required.
I will share one misused concept that I fell into one day:
var names = new List<string> {"mercedes", "mazda", "bmw", "fiat", "ferrari"};
var startingWith_M = names.Where(x => x.StartsWith("m"));
var startingWith_F = names.Where(x => x.StartsWith("f"));
// updating existing list
names[0] = "ford";
// Guess what should be printed before continuing
print( startingWith_M.ToList() );
print( startingWith_F.ToList() );
Expected result
// I was expecting
print( startingWith_M.ToList() ); // mercedes, mazda
print( startingWith_F.ToList() ); // fiat, ferrari
Actual result
// what printed actualy
print( startingWith_M.ToList() ); // mazda
print( startingWith_F.ToList() ); // ford, fiat, ferrari
Explanation
As per other answers, the evaluation of the result was deferred until calling ToList or similar invocation methods for example ToArray.
So I can rewrite the code in this case as:
var names = new List<string> {"mercedes", "mazda", "bmw", "fiat", "ferrari"};
// updating existing list
names[0] = "ford";
// before calling ToList directly
var startingWith_M = names.Where(x => x.StartsWith("m"));
var startingWith_F = names.Where(x => x.StartsWith("f"));
print( startingWith_M.ToList() );
print( startingWith_F.ToList() );
Play arround
https://repl.it/E8Ki/0
If all you want to do is enumerate them, use the IEnumerable.
Beware, though, that changing the original collection being enumerated is a dangerous operation - in this case, you will want to ToList first. This will create a new list element for each element in memory, enumerating the IEnumerable and is thus less performant if you only enumerate once - but safer and sometimes the List methods are handy (for instance in random access).
In addition to all the answers posted above, here is my two cents. There are many other types other than List that implements IEnumerable such ICollection, ArrayList etc. So if we have IEnumerable as parameter of any method, we can pass any collection types to the function. Ie we can have method to operate on abstraction not any specific implementation.
The downside of IEnumerable (a deferred execution) is that until you invoke the .ToList() the list can potentially change. For a really simple example of this - this would work
var persons;
using (MyEntities db = new MyEntities()) {
persons = db.Persons.ToList(); // It's mine now. In the memory
}
// do what you want with the list of persons;
and this would not work
IEnumerable<Person> persons;
using (MyEntities db = new MyEntities()) {
persons = db.Persons; // nothing is brought until you use it;
}
persons = persons.ToList(); // trying to use it...
// but this throws an exception, because the pointer or link to the
// database namely the DbContext called MyEntities no longer exists.
There are many cases (such as an infinite list or a very large list) where IEnumerable cannot be transformed to a List. The most obvious examples are all the prime numbers, all the users of facebook with their details, or all the items on ebay.
The difference is that "List" objects are stored "right here and right now", whereas "IEnumerable" objects work "just one at a time". So if I am going through all the items on ebay, one at a time would be something even a small computer can handle, but ".ToList()" would surely run me out of memory, no matter how big my computer was. No computer can by itself contain and handle such a huge amount of data.
[Edit] - Needless to say - it's not "either this or that". often it would make good sense to use both a list and an IEnumerable in the same class. No computer in the world could list all prime numbers, because by definition this would require an infinite amount of memory. But you could easily think of a class PrimeContainer which contains an
IEnumerable<long> primes, which for obvious reasons also contains a SortedList<long> _primes. all the primes calculated so far. the next prime to be checked would only be run against the existing primes (up to the square root). That way you gain both - primes one at a time (IEnumerable) and a good list of "primes so far", which is a pretty good approximation of the entire (infinite) list.
So basically i have this method.
public List<Customer> FilterCustomersByStatus(List<Customer> source, string status)
{
return (List<Customer>)source.Where(c => c.Status == status);
}
I throws me an error that it cannot cast:
Unable to cast object of type 'WhereListIterator`1[AppDataAcces.Customer]' to type 'System.Collections.Generic.List`1[AppDataAcces.Customer]'.
Why...? since the underlying type is the same, does the Enumerable.Where create a new instance of WhereListIterator and if so why would anyone do this, because thats an unnecessary loss of performance and functionality since i always have to create a new list (.ToList())
does the Enumerable.Where create a new instance of WhereListIterator
Yes.
and if so why would anyone do this
Because it allows lazy streaming behavior. Where won't have to filter all the list if its consumer wants only first or second entry. This is normal for LINQ.
because thats an unnecessary loss of performance and functionality since i always have to create a new list (.ToList())
That "loss of performance and functionality" comes from your design. You don't need List<Customer> after filtering, because it's pointless to do any modifications on it.
Update: "why is it implemented so"
Because it it implemented over IEnumerable, not IList. And thus it looks like IEnumerable, it quacks like IEnumerable.
Besides, it's just so much easier to implement it this way. Imagine for a moment that you have to write Where over IList. Which has to return IList. What should it do? Return a proxy over original list? You'll suffer huge performance penalties on every access. Return new list with filtered items? It'll be the same as doing Where().ToList(). Return original list but with all non-matching items deleted? That's what RemoveAll is for, why make another method.
And remember, LINQ tries to play functional, and tries to treat objects as immutables.
As others pointed out, you need to use ToList to convert the result to List<T>.
The reason is that Where is lazily evaluated, so Where does not really filter the data.
What it does is create an IEnumerable which filters data as needed.
Lazy evaluation has several benefits. It might be faster, it allows using Where with infinite IEnumerables, etc.
ToList forces the result to be converted to List<T>, which seems to be what you want.
The Where extension filters and returns IEnumerable<TSource> hence you need to call .ToList() to convert it back
public List<Customer> FilterCustomersByStatus(List<Customer> source, string status)
{
return source.Where(c => c.Status == status).ToList();//This will return a list of type customer
}
The difference between IEnumerable and IList is, the enumerable doesn't contain any data, it contains an iterator that goes through the data as you request the new one (for example, with a foreach loop). On the other hand, the list is a copy of the data. In your case, to create the List, ToList() method iterates through the entire data and adds them to a List object.
Depending to the usage you are planning, both have advantages and disadvantages. For example, if you are planning to use the entire data more than once, you should go with the list, but if you are planning to use it once or you are planning to query it again using linq, enumerable should be your choice.
Edit:
The answer to the question why the return type of Where is WhereListIterator instead of List is, it's partly because how Linq works. For example, if you had another Where or another Linq statement following the first, the compiler would create a single query using the entire method chain, then return the iterator for the final query. On the other hand, if the first Where would return a List that would cause each Linq method in the chain execute separately on the data.
Try this:
public List<Customer> FilterCustomersByStatus(List<Customer> source, string status)
{
return source.Where(c => c.Status == status).ToList();
}
If I know there is only one matching item in a collection, is there any way to tell Linq about this so that it will abort the search when it finds it?
I am assuming that both of these search the full collection before returning one item?
var fred = _people.Where((p) => p.Name == "Fred").First();
var bill = _people.Where((p) => p.Name == "Bill").Take(1);
EDIT: People seem fixated on FirstOrDefault, or SingleOrDefault. These have nothing to do with my question. They simply provide a default value if the collection is empty. As I stated, I know that my collection has a single matching item.
AakashM's comment is of most interest to me. I would appear my assumption is wrong but I'm interested why.
For instance, when linq to objects is running the Where() function in my example code, how does it know that there are further operations on its return value?
Your assumption is wrong. LINQ uses deferred execution and lazy evaluation a lot. What this means is that, for example, when you call Where(), it doesn't actually iterate the collection. Only when you iterate the object it returns, will it iterate the original collection. And it will do it in a lazy manner: only as much as is necessary.
So, in your case, neither query will iterate the whole collection: both will iterate it only up to the point where they find the first element, and then stop.
Actually, the second query (with Take()) won't do even that, it will iterate the source collection only if you iterate the result.
This all applies to LINQ to objects. Other providers (LINQ to SQL and others) can behave differently, but at least the principle of deferred execution usually still holds.
I think First() will not scan the whole collection. It will return immediatelly after the first match. But I suggest to use FirstOrDefault() instead.
EDIT:
Difference between First() and FirstOrDefault() (from MSDN):
The First() method throws an exception if source contains no elements. To instead return a default value when the source sequence is empty, use the FirstOrDefault() method.
Enumerable.First
Substitue .Where( by .SingleorDefault(
This will find the first and only item for you.
But you can't do this for any given number. If you need 2 items, you'll need to get the entire collection.
However, you shouldn't worry about time. The most effort is used in opening a database connection and establishing a query. Executing the query doesn't take that much time, so there's no real reason to stop a query halfway :-)
I've noticed that certain command cause LINQtoSQL to connect to the database and download the records that are part of the query, for example, .ToArray().
Does the command .Cast() cause a query to execute (and how can I tell these things in the future?). For example...
IRevision<T> current = context.GetTable(typeof(T))
.Cast<IRevision<T>>()
.SingleOrDefault(o => o.ID == recordId);
I know there is a command for .GetTable that allows you to specify a generic type, but for strange and unexplainable reasons, it cannot be used in this situation.
From Enumerable.Cast()'s remarks:
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
All of the LINQ operators will let you know if they are deferred execution or immediate query execution. Additionally, here are the standard LINQ operators which are NOT deferred:
Aggregate
All
Any
Average
Contains
Count
ElementAt
ElementAtOrDefault
First
FirstOrDefault
Last
LastOrDefault
LongCount
Max
Min
SequenceEqual
Single
SingleOrDefault
Sum
ToArray
ToDictionary
ToList
ToLookup
No, it does not. It simply will perform a cast when you iterate through the IEnumerable.
There isn't any definitive way (in code) to know whether or not a method will use deferred execution or not. The documentation is going to be your best friend here as it will tell you if it defers execution or not.
However, that doesn't mean that you can't make some assumptions if the documentation is unclear.
If you have a method that returns another list/structure (like ToList, ToArray), then it will have to execute the query in order to populate the new data structure.
If the method returns a scalar value, then it will have to execute the query to generate that scalar value.
Other than that, if it simply returns IEnumerable<T>, then it more-than-likely is deferring execution. However, that doesn't mean that it's guaranteed, it just means it is more-than-likely.
What you are looking for is called "Deferred Execution". Statements that defer execution only run when you attempt to access the data. Statements such as ToList execute immediately, as the data is needed to transform it into a list.
Cast can wait until you actually access it, so it is a deferred statement.