I was reading up on LINQ. I know that there are deferred and immediate queries. I know with deferred types running the query when it's enumerated allows any changes to the data set to be reflected in each enumeration. But I can't seem to find an answer if there's a mechanism in place to prevent the query from running if no changes occurred to the data set since the last enumeration.
I read on MSDN referring to LINQ queries:
Therefore, it follows that if a query is enumerated twice it will be executed twice.
Have I overlooked an obvious - but...?
Indeed, there is none. Actually, that's not quite true - some LINQ providers will spot trivial but common examples like:
int id = ...
var customer = ctx.Customers.SingleOrDefault(x => x.Id == id);
and will intercept that via the identity-manager, i.e. it will check whether it has a matching record already in the context; if it does: it doesn't execute anything.
You should note that the re-execution also has nothing to do with whether or not data has changed; it will re-execute (or not) regardless.
There are two take-away messages here:
don't iterate any enumerable more than once: not least, it isn't guaranteed to work at all
if you want to buffer data, put it into a list/array
So:
var list = someQuery.ToList();
will never re-execute the query, no matter how many times you iterate over list. Because list is not a query: it is a list.
A third take-away would be:
if you have a context that lives long enough that it is interesting to ask about data migration, then you are probably holding your data-context for far, far too long - they are intended to be short-lived
Related
Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?
Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)
Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.
today I noticed that when I run several LINQ-statements on big data the time taken may vary extremely.
Suppose we have a query like this:
var conflicts = features.Where(/* some condition */);
foreach (var c in conflicts) // log the conflicts
Where features is a list of objects representing rows in a table. Hence these objects are quite complex and even querying one simple property of them is a huge operation (including the actual database-query, validation, state-changes...) I suppose performing such a query takes much time. Far wrong: the first statement executes in a quite small amount of time, whereas simply looping the results takes eternally.
However, If I convert the collection retrieved by the LINQ-expression to a List using IEnumerable#ToList() the first statement runs a bit slower and looping the results is very fast. Having said this the complete duration-time of second approach is much less than when not converting to a list.
var conflicts = features.Where(/* some condition */).ToList();
foreach (var c in conflicts) // log the conflicts
So I suppose that var conflicts = features.Where does not actually query but prepares the data. But I do not understand why converting to a list and afterwards looping is so much faster then. That´s the actual question
Has anybody an explanation for this?
This statement, just declare your intention:
var conflicts = features.Where(...);
to get the data that fullfils the criteria in Where clause. Then when you write this
foreach (var c in conflicts)
The the actual query will be executed and will start getting the results. This is called lazy loading. Another term we use for this is the deffered execution. We deffer the execution of the query, until we need it's data.
On the other hand, if you had done something like this:
var conflicts = features.Where(...).ToList();
an in memory collection would have been created, in which the results of the query would had been stored. In this case the query, would had been executed immediately.
Generally speaking, as you could read in wikipedia:
Lazy loading is a design pattern commonly used in computer programming
to defer initialization of an object until the point at which it is
needed. It can contribute to efficiency in the program's operation if
properly and appropriately used. The opposite of lazy loading is eager
loading.
Update
And I suppose this in-memory collection is much faster then when doing
lazy load?
Here is a great article that answers your question.
Welcome to the wonderful world of lazy evaluation. With LINQ the query is not executed until the result is needed. Before you try to get the result (ToList() gets the result and puts it in a list) you are just creating the query. Think of it as writing code vs running the program. While this may be confusing and may cause the code to execute at unexpected times and even multiple times if you for example you foreach the query twice, this is actually a good thing. It allows you to have a piece of code that returns a query (not the result but the actual query) and have another piece of code create a new query based on the original query. For example you may add additional filters on top of the original or page it.
The performance difference you are seeing is basically the database call happening at different places in your code.
In my application there are a fair number of existing "service commands" which generally return a List<TEntity>. However, I wrote them in such a way that any queries would not be evaluated until the very last statement, when they are cast ToList<TEntity> (or at least I think I did).
Now I need to start obtaining some "context-specific" information from the commands, and I am thinking about doing the following:
Keep existing commands largely the same as they are today, but make sure they return an IEnumerable<TEntity> rather than an IList<TEntity>.
Create new commands that call the old commands but return IEnumerable<TResult> where TResult is not an entity but rather a view model, result model, etc - some representation of the data that is useful for the application.
The first case in which I have needed this is while doing a search for a Group entity. In my schema, Groups come with User-specific permissions, but it is not realistic for me to spit out the entire list of users and permissions in my result - first, because there could be many users, second, because there are many permissions, and third, because that information should not be available to insufficiently-privileged users (ie a "guest" should not be able to see what a "member" can do).
So, I want to be able to take the result of the original command, an IEnumerable<Group>, and describe how each Group ought to be transformed into a GroupResult, given a specific input of User (by Username in this case).
If I try to iterate over the result of the original command with ForEach I know this will force the execution of the result and therefore potentially result in a needlessly longer execution time. What if I wanted to further compose the result of the "new" command (that returns GroupResult) and filter out certain groups? Then maybe I would be calculating a ton of privileges for the inputted user, only to filter out the parent GroupResult objects later on anyway!
I guess my question boils down to... how do I tell C# how I'd like to transform each member of the IEnumerable without necessarily doing it at the time the method is run?
To lazily cast an enumerable from one type to another you do this:
IEnumerable<TResult> result = source.Cast<TResult>();
This assumes that the elements of the source enumerable can be cast to TResult. If they can't you need to use a standard projection with .Select(x => ... ).
Also, be careful returning IEnumerable<T> from a service or database as often there are resources that you need to open to obtain the data so now you would need make sure those resources are open whenever you try to evaluate the enumerable. Keeping a database connection open is a bad idea. I would be more inclined to return an array that you've cast as an IEnumerable<>.
However, if you really want to get an IEnmerable<> from a service or database that is truly lazy and will automatically refresh the data then you need to try Microsoft's Reactive Framework Team's "Interactive Extensions" to help with it.
They have an nice IEnumerable<> extension called Using that makes a "hot" enumerable that opens a resource for each iteration.
It would look something like this:
var d =
EnumerableEx
.Using(
() => new DB(),
db => db.Data.Where(x => x == 2));
It creates a new DB instance every time the enumerable is iterated and will dispose of the database when the enumerable is completed. Something worth considering.
Use NuGet and look for "Ix-Main" for the Interactive Extensions.
You're looking for the yield return command.
When you define a method returning an IEnumerable, and return its data by yield return, the return value is iterated over in the consuming method. This is what it could look like:
IEnumerable<GroupResult> GetGroups(string userName)
{
foreach(var group in context.Groups.Where(g => <some user-specific condition>))
{
var result = new GroupResult()
... // Further compose the result.
yield return result;
}
}
In consuming code:
var groups = GetGroups("tacos");
// At this point no eumeration has occurred yet. Any breakpoints in GetGroups
// have not been hit.
foreach(var g in groups)
{
// Now iteration in GetGroups starts!
}
I have a report in which the user can select different filters to apply to a dataset.
At design time, I don't know what filters the user will apply, so I grab the whole dataset.
For instance, if the user wants to see contacts that live in Chicago, I first grab all contacts...
IList<contact> contacts = db.contacts.ToList();
then I check the form collection for filters and apply them...
contacts = contacts.Where(t => t.state == form["state"]).ToList();
The issue is that getting all contacts to filter down from is resource intensive. Is there a way to wait to retrieve the contacts from the db until I have all the parameters for the query?
I realize I'm doing this wrong, but I need to know the correct way to approach this.
Don't call ToList().
You're getting an IEnumerable back from EF. Use the lazy evaluation to your advantage - it's especially useful with large datasets, as every call to .ToList() fills the memory with the list.
Just don't do the first call to ToList(). This pulls down all the data into memory. While you haven't done this, the data is in the form of an IEnumerable. This uses 'lazy evaluation' which means that the items to be enumerated aren't actually produced until they are requested. In the case of Linq To SQL, the database query won't be run until you do the second call to ToList().
An IEnumerable is not really a container, rather it's an object, containing a piece of code that knows how to get the next element in a sequence. This piece of code could be various things - it could be reading from an actual container, generating each value on-the-fly, or getting the values from somewhere else, like a database. A list however, is a container. When you call ToList, the next item is repeatedly extracted from the IEnumerable and placed in a List. At this point you have exactly a collection of objects in memory. You often call ToList when you want to stop having a nebulous 'thing' which gets values from somewhere and have an actual container full of actual elements. For example, an IEnumerable might not give you the same objects each time you use it. Once you have them in a List you know that they can't change. In your case you want to stay with that nebulous thing until you have decided exactly what you are going to have to ask the database for.
The way this works is fundamental to LINQ, and also is the main reason that Linq To SQL is so nice. I highly recommend that you read something about how Linq works, and also play around with List and IEnumerable a little bit. C# legend Jon Skeet's blog has loads of discussion of this, including an example reimplementation of much of Linq so you can see how it might work. If you read all of it you will be a Linq expert (and it will also be very late), but this is an introduction to how laziness works.
So I create this projection of a dictionary of items I would like to remove.
var toRemoveList =
this.outputDic.Keys.Where(key =>
this.removeDic.ContainsKey(key));
Then I iterate through the result removing from the actual dictionary
foreach(var key in toRemoveList)
this.outputDic.Remove(key);
However during that foreach an exception is thrown saying that the list was modified during the loop. But, how so? is the linq query somewhat dynamic and gets re evaluated every time the dictionary changes? A simple .ToArray() call on the end of the query solves the issues, but imo, it shouldn't even occur in the first place.
So I create this projection of a dictionary of items I would like to remove.
var toRemoveList =
this.outputDic.Keys.Where(key =>
this.removeDic.ContainsKey(key));
As I have often said, if I can teach people one thing about LINQ it is this: the result of a query expression is a query, not the results of executing the query. You now have an object that means "the keys of a dictionary such that the key is... something". It is not the results of that query, it is that query. The query is an object unto itself; it does not give you a result set until you ask for one.
Then you do this:
foreach(var key in toRemoveList)
this.outputDic.Remove(key);
So what are you doing? You are iterating over the query. Iterating over the query executes the query, so the query is iterating over the original dictionary. But you then remove an item from the dictionary, while you are iterating over it, which is illegal.
imo, it shouldn't even occur in the first place.
Your opinion about how the world should be is a common one, but doing it your way leads to deep inefficiencies. Let us suppose that creating a query executes the query immediately rather than creates a query object. What does this do?
var query = expensiveRemoteDatabase
.Where(somefilter)
.Where(someotherfilter)
.OrderBy(something);
The first call to Where produces a query, which in your world is then executed, pulling down from the remote database all records which match that query. The second call to Where then says "oh, sorry, I meant to also apply this filter here as well, can we do that whole query again, this time with the second filter?" and so then that whole record set is computed, and then we say "oh, no, wait, I forgot to tell you when you built that last query object, we're going to need to sort it, so database, can you run this query for me a third time?"
Now perhaps do you see why queries produce a query that then does not execute until it needs to?
The reason is that toRemoveList does not contain a list of things to be removed, it contains a description of how to get a list of things that can be removed.
If you step through this in a debugger using F11 you can see this quite clearly for yourself. The first point it stops is with the cursor on foreach which is what you would expect.
Next you stop at toRemoveList (the one in foreach(var key in toRemoveList)). This is where it is setting up the iterator.
When you step through var key (with F11) however it now jumps into the original definition of toRemoveList, specifically the this.removeDic.ContainsKey(key) part. Now you get an idea of what is really happening.
The foreach is calling the iterators Next method to move to the next point in the dictionary's keys and holds onto the list. When you call into this.outputDic.Remove(key); this detects that the iterator hasn't finished and therefore stops you with this error.
As everybody is saying on here, the correct way to solve this is to use ToArray()/ToList() as what these do is to give you another copy of the list. So the you have one to step through, and one to remove from.
The .ToArray() solves the issues because it forces you to evaluate the entire enumeration and cache the local values. Without doing so, when you enumerate through it the enumerable attempts to calculate the first index, return that, then return to the collection and calculate the next index. If the underlying collection you're iterating over changes, you can no longer guarantee that the enumeration will return the appropriate value.
In short: just force the evaluation with .ToArray() (or .ToList(), or whatever).
The LINQ query uses deferred execution. It streams the items one by one, retruning them as they're requested. So yes, every time you try to remove a key it changes the result which is why it throws an exception. When you invoke ToArray() it forces execution of the query which is why it works.
EDIT: This is somewhat in response to your comments. Check out iterator blocks on msdn this is the mechanism being used when your for each executes. Your query just gets turned into an expression tree and the filter, projects, operation ect is applied to the elements one by one when they're retrieved unless it is not possible.
The reason you are getting this error is because of deferred execution of linq. To explain it fully when your loop runs is actually when the data is fetch from the dictionary. Thus modification in outputdic takes place at this point of time and it is not allowed to modify the collection you are looping upon. This is why you get this error. You can get rid of this error by asking the compiler to execute it before you run the loop.
var toRemoveList =
this.outputDic.Keys.Where(key =>
this.removeDic.ContainsKey(key)).ToList();
Notice the ToList() in the above statement. It will make sure that your query has been executed and you have your list in toRemoveList.