Here is a test that i have setup this evening. It was made to prove something different, but the outcome was not quite as i expected.
I'm running a test with 10000 random queries on an IQueryable and while testing i found out that if i do the same on a List, my test is 20 times faster.
See below. My CarBrandManager.GetList originally returns an IQueryable, but now i first issue a ToList(), and then it's way faster.
Can anyone tell me something about why i see this big difference?
var sw = new Stopwatch();
sw.Start();
int queries = 10000;
//IQueryable<Model.CarBrand> carBrands = CarBrandManager.GetList(context);
List<Model.CarBrand> carBrands = CarBrandManager.GetList(context).ToList();
Random random = new Random();
int randomChar = 65;
for (int i = 0; i < queries; i++)
{
randomChar = random.Next(65, 90);
Model.CarBrand carBrand = carBrands.Where(x => x.Name.StartsWith(((char)randomChar).ToString())).FirstOrDefault();
}
sw.Stop();
lblStopWatch.Text = String.Format("Queries: {0} Elapsed ticks: {1}", queries, sw.ElapsedTicks);
There are potentially two issues at play here. First: It's not obvious what type of collection is returned from GetList(context), apart from the knowledge that it implements IQueryable. That means when you evaluate the result, it could very well be creating an SQL query, sending that query to a database, and materializing the result into objects. Or it could be parsing an XML file. Or downloading an RSS feed or invoking an OData endpoint on the internet. These would obviously take more time than simply filtering a short list in memory. (After all, how many car brands can there really be?)
But let's suppose that the implementation it returns is actually a List, and therefore the only difference you're testing is whether it's cast as an IEnumerable or as an IQueryable. Compare the method signatures on the Enumerable class's extension methods with those on Queryable. When you treat the list as an IQueryable, you are passing in Expressions, which need to be evaluated, rather than just Funcs which can be run directly.
When you're using a custom LINQ provider like Entity Framework, this gives the framework the ability to evaluate the actual expression trees and produce a SQL query and materialization plan from them. However, LINQ to Objects just wants to evaluate the lambda expressions in-memory, so it has to either use reflection or compile the expressions into Funcs, both of which have a large performance hit associated with them.
You may be tempted to just call .ToList() or .AsEnumerable() on the result set to force it to use Funcs, but from an information hiding perspective this would be a mistake. You would be assuming that you know that the data returned from the GetList(context) method is some kind of in-memory object. That may be the case at the moment, or it may not. Regardless, it's not part of the contract that is defined for the GetList(context) method, and therefore you cannot assume it will always be that way. You have to assume that the type you get back could very well be something that you can query. And even though there are probably only a dozen car brands to search through at the moment, it's possible that some day there will be thousands (I'm talking in terms of programming practice here, not necessarily saying this is the case with the car industry). So you shouldn't assume that it will always be faster to download the entire list of cars and filter them in memory, even if that happens to be the case right now.
If the CarBrandManager.GetList(context) might return an object backed by a custom LINQ provider (like an Entity Framework collection), then you probably want to leave the data cast as an IQueryable: even though your benchmark shows it being 20 times faster to use a list, that difference is so small that no user is ever going to be able to tell the difference. You may one day see performance gains of several orders of magnitude by calling .Where().Take().Skip() and only loading the data you really need from the data store, whereas you'd end up loading the whole table into your system's memory if you call .ToList() on right off the bat.
However, if you know that CarBrandManager.GetList(context) will always return an in-memory list (as the name implies), it should be changed to return an IEnumerable<Model.CarBrand> instead of an IQueryable<Model.CarBrand>. Or, if you're on .NET 4.5, perhaps an IReadOnlyList<Model.CarBrand> or IReadOnlyCollection<Model.CarBrand>, depending on what contract you're willing to force your CarManager to abide by.
Related
Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?
Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)
Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.
today I noticed that when I run several LINQ-statements on big data the time taken may vary extremely.
Suppose we have a query like this:
var conflicts = features.Where(/* some condition */);
foreach (var c in conflicts) // log the conflicts
Where features is a list of objects representing rows in a table. Hence these objects are quite complex and even querying one simple property of them is a huge operation (including the actual database-query, validation, state-changes...) I suppose performing such a query takes much time. Far wrong: the first statement executes in a quite small amount of time, whereas simply looping the results takes eternally.
However, If I convert the collection retrieved by the LINQ-expression to a List using IEnumerable#ToList() the first statement runs a bit slower and looping the results is very fast. Having said this the complete duration-time of second approach is much less than when not converting to a list.
var conflicts = features.Where(/* some condition */).ToList();
foreach (var c in conflicts) // log the conflicts
So I suppose that var conflicts = features.Where does not actually query but prepares the data. But I do not understand why converting to a list and afterwards looping is so much faster then. That´s the actual question
Has anybody an explanation for this?
This statement, just declare your intention:
var conflicts = features.Where(...);
to get the data that fullfils the criteria in Where clause. Then when you write this
foreach (var c in conflicts)
The the actual query will be executed and will start getting the results. This is called lazy loading. Another term we use for this is the deffered execution. We deffer the execution of the query, until we need it's data.
On the other hand, if you had done something like this:
var conflicts = features.Where(...).ToList();
an in memory collection would have been created, in which the results of the query would had been stored. In this case the query, would had been executed immediately.
Generally speaking, as you could read in wikipedia:
Lazy loading is a design pattern commonly used in computer programming
to defer initialization of an object until the point at which it is
needed. It can contribute to efficiency in the program's operation if
properly and appropriately used. The opposite of lazy loading is eager
loading.
Update
And I suppose this in-memory collection is much faster then when doing
lazy load?
Here is a great article that answers your question.
Welcome to the wonderful world of lazy evaluation. With LINQ the query is not executed until the result is needed. Before you try to get the result (ToList() gets the result and puts it in a list) you are just creating the query. Think of it as writing code vs running the program. While this may be confusing and may cause the code to execute at unexpected times and even multiple times if you for example you foreach the query twice, this is actually a good thing. It allows you to have a piece of code that returns a query (not the result but the actual query) and have another piece of code create a new query based on the original query. For example you may add additional filters on top of the original or page it.
The performance difference you are seeing is basically the database call happening at different places in your code.
I'm looking to retrieve the last row of a table by the table's ID column. What I am currently using works:
var x = db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault();
Is there any way to get the same result with more efficient speed?
I cannot see that this queries through the entire table.
Do you not have an index on the ID column?
Can you add the results of analysing the query to your question, because this is not how it should be.
As well as the analysis results, the SQL produced. I can't see how it would be anything other than select top 1 * from MyTable order by id desc only with explicit column names and some aliasing. Nor if there's an index on id how it's anything other than a scan on that index.
Edit: That promised explanation.
Linq gives us a set of common interfaces, and in the case of C# and VB.NET some keyword support, for a variety of operations upon sources which return 0 or more items (e.g. in-memory collections, database calls, parsing of XML documents, etc.).
This allows us to express similar tasks regardless of the underlying source. Your query for example includes the source, but we could do a more general form of:
public static YourType FinalItem(IQueryable<YourType> source)
{
return source.OrderByDesending(d => d.ID).FirstOrDefault();
}
Now, we could do:
IEnumerable<YourType> l = SomeCallThatGivesUsAList();
var x = FinalItem(db.MyTable);//same as your code.
var y = FinalItem(l);//item in list with highest id.
var z = FinalItem(db.MyTable.Where(d => d.ID % 10 == 0);//item with highest id that ends in zero.
But the really important part, is that while we've a means of defining the sort of operation we want done, we can have the actual implementation hidden from us.
The call to OrderByDescending produces an object that has information on its source, and the lambda function it will use in ordering.
The call to FirstOrDefault in turn has information on that, and uses it to obtain a result.
In the case with the list, the implementation is to produce the equivalent Enumerable-based code (Queryable and Enumerable mirror each other's public members, as do the interfaces they use such as IOrderedQueryable and IOrderedEnumerable and so on).
This is because, with a list that we don't know is already sorted in the order we care about (or in the opposite order), there isn't any faster way than to examine each element. The best we can hope for is an O(n) operation, and we might get an O(n log n) operation - depending on whether the implementation of the ordering is optimised for the possibility of only one item being taken from it*.
Or to put it another way, the best we could hand-code in code that only worked on enumerables is only slightly more efficient than:
public static YourType FinalItem(IEnumerable<YourType> source)
{
YourType highest = default(YourType);
int highestID = int.MinValue;
foreach(YourType item in source)
{
curID = item.ID;
if(highest == null || curID > highestID)
{
highest = item;
highestID = curID;
}
}
return highest;
}
We can do slightly better with some micro-opts on handling the enumerator directly, but only slightly and the extra complication would just make for less-good example code.
Since we can't do any better than that by hand, and since the linq code doesn't know anything more about the source than we do, that's the best we could possibly hope for it matching. It might do less well (again, depending on whether the special case of our only wanting one item was thought of or not), but it won't beat it.
However, this is not the only approach linq will ever take. It'll take a comparable approach with an in-memory enumerable source, but your source isn't such a thing.
db.MyTable represents a table. To enumerate through it gives us the results of an SQL query more or less equivalent to:
SELECT * FROM MyTable
However, db.MyTable.OrderByDescending(d => d.ID) is not the equivalent of calling that, and then ordering the results in memory. Because queries get processed as a whole when they are executed, we actually get the result of an SQL query more or less like:
SELECT * FROM MyTable ORDER BY id DESC
Finally, the entire query db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault() results in a query like:
SELECT TOP 1 * FROM MyTable ORDER BY id DESC
Or
SELECT * FROM MyTable ORDER BY id DESC LIMIT 1
Depending upon what sort of database server you are using. Then the results get passed to code equivalent to the following ADO.NET-based code:
return dataReader.Read() ?
new MyType{ID = dataReader.GetInt32(0), dataReader.GetInt32(1), dataReader.GetString(2)}//or similar
: null;
You can't get much better.
And as for that SQL query. If there's an index on the id column (and since it looks like a primary key, there certainly should be), then that index will be used to very quickly find the row in question, rather than examining each row.
In all, because different linq providers use different means to fulfil the query, they can all try their best to do so in the best way possible. Of course, being in an imperfect world we'll no doubt find that some are better than others. What's more, they can even work to pick the best approach for different conditions. One example of this is that database-related providers can choose different SQL to take advantage of features of different versions of databases. Another is that the implementation of the the version of Count() that works with in memory enumerations works a bit like this;
public static int Count<T>(this IEnumerable<T> source)
{
var asCollT = source as ICollection<T>;
if(asCollT != null)
return asCollT.Count;
var asColl = source as ICollection;
if(asColl != null)
return asColl.Count;
int tally = 0;
foreach(T item in source)
++tally;
return tally;
}
This is one of the simpler cases (and a bit simplified again in my example here, I'm showing the idea not the actual code), but it shows the basic principle of code taking advantage of more efficient approaches when they're available - the O(1) length of arrays and the Count property on collections that is sometimes O(1) and it's not like we've made things worse in the cases where it's O(n) - and then when they aren't available falling back to a less efficient but still functional approach.
The result of all of this is that Linq tends to give very good bang for buck, in terms of performance.
Now, a decent coder should be able to match or beat its approach to any given case most of the time†, and even when Linq comes up with the perfect approach there are some overheads to it itself.
However, over the scope of an entire project, using Linq means that we can concisely create reasonably efficient code that relates to a relatively constrained number of well defined entities (generally one per table as far as databases go). In particular, the use of anonymous functions and joins means that we get queries that are very good. Consider:
var result = from a in db.Table1
join b in db.Table2
on a.relatedBs = b.id
select new {a.id, b.name};
Here we're ignoring columns we don't care about here, and the SQL produced will do the same. Consider what we would do if we were creating the objects that a and b relate to with hand-coded DAO classes:
Create a new class to represent this combination of a's id and b's name, and relevant code to run the query we need to produce instances.
Run a query to obtain all information about each a and the related b, and live with the waste.
Run a query to obtain the information on each a and b that we care of, and just set default values for the other fields.
Of these, option 2 will be wasteful, perhaps very wasteful. Option 3 will be a bit wasteful and very error prone (what if we accidentally try to use a field elsewhere in the code that wasn't set correctly?). Only option 1 will be more efficient than what the linq approach will produce, but this is just one case. Over a large project this could mean producing dozens or even hundreds or thousands of slightly different classes (and unlike the compiler, we won't necessarily spot the cases where they're actually the same). In practice, therefore, linq can do us some great favours when it comes to efficiency.
Good policies for efficient linq are:
Stay with the type of query you start with as long as you can. Whenever you grab items into memory with ToList() or ToArray etc, consider if you really need to. Unless you need to or you can clearly state the advantage doing so gives you, don't.
If you do need to move to processing in memory, favour AsEnumerable() over ToList() and the other means, so you only grab one at a time.
Examine long-running queries with SQLProfiler or similar. There are a handful of cases where policy 1 here is wrong and moving to memory with AsEnumerable() is actually better (most relate to uses of GroupBy that don't use aggregates on the non-grouped fields, and hence don't actually have a single SQL query they correspond with).
If a complicated query is hit many times, then CompiledQuery can help (less so with 4.5 since it has automatic optimisations that cover some of the cases it helps in), but it's normally better to leave that out of the first approach and then use it only in hot-spots that are efficiency problems.
You can get EF to run arbitrary SQL, but avoid it unless it's a strong gain because too much such code reduces the consistent readability using a linq approach throughout gives (I have to say though, I think Linq2SQL beats EF on calling stored procedures and even more so on calling UDFs, but even there this still applies - it's less clear from just looking at the code how things relate to each other).
*AFAIK, this particular optimisation isn't applied, but we're talking about the best possible implementation at this point, so it doesn't matter if it is, isn't, or is in some versions only.
†I'll admit though that Linq2SQL would often produce queries that use APPLY that I would not think of, as I was used to thinking of how to write queries in versions of SQLServer before 2005 introduced it, while code doesn't have those sort of human tendencies to go with old habits. It pretty much taught me how to use APPLY.
I have a List/IEnumerable of objects and I'd like to perform a calculation on some of them.
e.g.
myList.Where(f=>f.Calculate==true).Calculate();
to update myList, based on the Where clause, so that the required calulcation is performed and the entire list updated as appropriate.
The list contains "lines" where an amount is either in Month1, Month2, Month3...Month12, Year1, Year2, Year3-5 or "Long Term"
Most lines are fixed and always fall into one of these months, but some "lines" are calulcated based upon their "Maturity Date".
Oh, and just to complicate things! the list (at the moment) is of an anonymous type from a couple of linq queries. I could make it a concrete class if required though, but I'd prefer not to if I can avoid it.
So, I'd like to call a method that works on only the calculated lines, and puts the correct amount into the correct "month".
I'm not worried about the calculation logic, but rather how to get this into an easily readable method that updates the list without, ideally, returning a new list.
[Is it possible to write a lambda extension method to do both the calculation AND the where - or is this overkill anyway as Where() already exists?]
Personally, if you want to update the list in place, I would just use a simple loop. It will be much simpler to follow and maintain:
for (int i=0;i<list.Count;++i)
{
if (list[i].ShouldCalculate)
list[i] = list[i].Calculate();
}
This, at least, is much more obvious that it's going to update. LINQ has the expectation of performing a query, not mutating the data.
If you really want to use LINQ for this, you can - but it will still require a copy if you want to have a List<T> as your results:
myList = myList.Select(f => f.ShouldCalculate ? f.Calculate() : f).ToList();
This would call your Calculate() method as needed, and copy the original when not needed. It does require a copy to create a new List<T>, though, as you mentioned that was a requirement (in comments).
However, my personal preference would still be to use a loop in this case. I find the intent much more clear - plus, you avoid the unnecessary copy operation.
Edit #2:
Given this comment:
Oh, and just to complicate things! the list (at the moment) is of an anonymous type from a couple of linq queries
If you really want to use LINQ style syntax, I would recommend just not calling ToList() on your original queries. If you leave them in their original, IEnumerable<T> form, you can easily do my second option above, but on the original query:
var myList = query.Select(f => f.ShouldCalculate ? f.Calculate() : f).ToList();
This has the advantage of only constructing the list one time, and preventing the copy, as the original sequence will not get evaluated until this operation.
LINQ is mostly geared around side-effect-free queries, and anonymous types themselves are immutable (although of course they can maintain references to mutable types).
Given that you want to mutate the list in place, LINQ isn't a great fit.
As per Reed's suggestion, I would use a straight for loop. However, if you want to perform different calculations at different points, you could encapsulate this:
public static void Recalculate<T>(IList<T> list,
Func<T, bool> shouldCalculate,
Func<T, T> calculation)
{
for (int i = 0; i < list.Count; i++)
{
if (shouldCalculate(items[i]))
{
items[i] = calculation(items[i]);
}
}
}
If you really want to use this in a fluid way, you could make it return the list - but I would personally be against that, as it would then look like it was side-effect-free like LINQ.
And like Reed, I'd also prefer to do this by creating a new sequence...
Select doesn't copy or clone the objects it passes to the passed delegate, any state changes to that object will be reflected through the reference in the container (unless it is a value type).
So updating reference types is not a problem.
To replace the objects (or when working with value types1) this are more complex and there is no inbuilt solution with LINQ. A for loop is clearest (as with the other answers).
1 Remembering, of course, that mutable value types are evil.
More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?
Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.
Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.
Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.