LINQ nested groups performance

LINQ nested groups performance - c#

i have a full outer join query pulling data from an sql compact database (i use EF6 for mapping):
var query =
from entry in left.Union(right).AsEnumerable()
select new
{
...
} into e
group e by e.Date.Year into year
select new
{
Year = year.Key,
Quartals = from x in year
group x by (x.Date.Month - 1) / 3 + 1 into quartal
select new
{
Quartal = quartal.Key,
Months = from x in quartal
group x by x.Date.Month into month
select new
{
Month = month.Key,
Contracts = from x in month
group x by x.Contract.extNo into contract
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
}
};
as you can see i use nested groups to structure results.
the interesting thing is, if i remove AsEnumerable() call, the query takes 3.5x more time to execute: ~210ms vs ~60ms. And when it runs for the first time the difference is much greater: 39000(!)ms vs 1300ms.
My questions are:
What am i doing wrong, maybe those groupings should be done in a different way?
Why the first execution takes so much time? I know expression trees should be built etc, but 39 seconds?
Why linq to db is slower than linq to entities in my case? Is it generally slower and its better to load data from db if possible before processing?
thakns!

To answer your three questions:
Maybe those groupings should be done in a different way?
No. If you want nested groupings you can only do that by groupings within groupings.
You can group by multiple fields at once:
from entry in left.Union(right)
select new
{
...
} into e
group e by new
{
e.Date.Year,
Quartal = (e.Date.Month - 1) / 3 + 1,
e.Date.Month,
contract = e.Contract.extNo
} into grp
select new
{
Year = grp.Key,
Quartal = grp.Key,
Month = grp.Key,
Contracts = from x in grp
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
This will remove a lot of complexity from the generated query so it's likely to be (much) faster without AsEnumerable(). But the result is quite different: a flat group (Year, Quartal, etc, in one row), not a nested grouping.
Why the first execution takes so much time?
Because the generated SQL query is probably pretty complex and the database engine's query optimizer can't find a fast execution path.
3a. Why is linq to db slower than linq to entities in my case?
Because, apparently, in this case it's much more efficient to fetch the data into memory first and do the groupings by LINQ-to-objects. This effect will be more significant if left and right represent more or less complex queries themselves. In that case, the generated SQL can get hugely bloated, because it has to process two sources of complexity in one statement, which may lead to many repeated identical sub queries. By outsourcing the grouping, the database is probably left with a relative simple query and of course the grouping in memory is never affected by the complexity of the SQL query.
3b. Is it generally slower and its better to load data from db if possible before processing?
No, not generally. I'd even say, hardly ever. In this case it is because (as I can see) you don't filter data. If however the part before AsEnumerable() would return millions of records and you would apply filtering afterwards, the query without AsEnumerable() would probably be much faster, because the filtering is done in the database.
Therefore, you should always keep an eye on generated SQL. It's unrealistic to expect that EF will always generate a super optimized SQL statement. It hardly ever will. Its primary focus is on correctness (and it does an exceptional job there), performance is secondary. It's the developer's job to make LINQ-to-Entities and LINQ-to-object work together as a slick team.

Using AsEnumerable() will convert a type that implements IEnumerable<T> to IEnumerable<T> itself.
Read this topic https://msdn.microsoft.com/en-us/library/bb335435.aspx
AsEnumerable<TSource>(IEnumerable<TSource>) can be used to choose between query implementations when a sequence implements IEnumerable<T> but also has a different set of public query methods available. For example, given a generic class Table that implements IEnumerable<T> and has its own methods such as Where, Select, and SelectMany, a call to Where would invoke the public Where method of Table. A Table type that represents a database table could have a Where method that takes the predicate argument as an expression tree and converts the tree to SQL for remote execution. If remote execution is not desired, for example because the predicate invokes a local method, the AsEnumerable<TSource> method can be used to hide the custom methods and instead make the standard query operators available.
When you invoke AsEnumerable() first, it won't convert LINQ-to-SQL but will instead load the table in memory as the Where is enumerating it. Since now it is loaded in memory, it's execution is faster.

Related

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?

There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.

The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.

It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

How can I conditionally add where clauses and filter children in a single linq query?

I'm using entity framework and building up a linq query so the query is executed at the database to minimize data coming back and the query can have some search criteria which is optional and some ordering which is done every time. I am working with parents and children (the mummy and daddy type). The filter I am trying to implement is for age of the children.
So if I have some data like so...
parent 1
- child[0].Age = 5
- child[1].Age = 10
parent 2
- child[0].Age = 7
- child[1].Age = 23
...and I specify a minimum age of 8, my intended result to display is...
parent 1
- child[1].Age = 10
parent 2
- child[1].Age = 23
...and if I specify a minimum age of 15 I intend to display...
parent 2
- child[1].Age = 23
I can re-create my expected result with this horrible query (which I assume is actually doing more than one query):
var parents = context.Parents;
if(minimumChildAge.HasValue)
{
parents = parents.Where(parent => parent.Children.Any(child => child.Age >= minimumChildAge.Value));
foreach(var parent in parents)
{
parent.Children = parent.Children.Where(child => child.minimumChildAge.Value >= mimumumChildAge);
}
}
parents = parents.OrderBy(x => x.ParentId).Take(50);
So I tried the other method instead...
var query = from parent in context.Parents
select parent;
if (minimumChildAge.HasValue)
query = from parent in query
join child in context.Children
on parent.ParentId equals child.ParentId
where child.Age >= minimumChildAge.Value
select parent;
query = query.OrderBy(x => x.ParentId).Take(50);
When I run this in linqpad the query generated looks good. So my question...
Is this the correct way of doing this? Is there a better way? It seems a bit funny that if I now specified a maximum age that I would be writing the same joins and hoping that entity framework works it out. In addition, how does this impact lazy loading? I expect only the children which match the criteria to be returned. So when I do parent.Children does entity framework know that it just queried these and its working on a filtered collection?

Assuming your context is backed by an entity framework database or similar, then yes, your first option is going to do more than one SQL query. When you begin executing the foreach it will run a SQL query to get the parent (since you've forced enumeration on the query). Then, for each attempt to populate the Children property of a single parent object it will make another database call.
The second form should only produce a single SQL query; it will have a ton of redundant data but it will use JOIN statements to bring back all of the parent and child data in a single SQL call, then enumerate through it and populate the data on the client side as needed.
A rule of thumb I tend to follow is that, if you have fewer than 4 nested tables in your query, try to run it all at once. Both SQL and Entity Framework's query parsers seem to be very, very efficient when producing joins at that level.
If you get much beyond that, the SQL queries that EF can produce may get messy, and SQL itself (assuming MSSQL) gets less effective when you have 5+ joins on a single query. There's no hard and fast limit, because it depends on a number of specific factors, but if I find myself needing very deep nesting I tend to break it up into smaller LINQ queries and recombine them client-side.
(Side note: you can reproduce your second query in method syntax easily enough, since that's what the compiler is going to end up doing anyway, by using the Join method, but the syntax for that can get very complex; I typically go with query syntax for anything more complex then a single method call.)

Optimize the number of accesses to a database when working with IQueryable<T> and custom functions

In my C# Class Library project I have a method that needs to compute some statistics GetFaultRate, that, given a date, computes the number of products with faults over the number of products produced.
float GetFaultRate(DateTime date)
{
var products = GetProducts(date);
var faultyProducts = GetFaultyProducts(date);
var rate = (float) (faultyProducts.Count() / products.Count());
return rate;
}
Both methods, GetProducts and GetFaultyProducts take the data from a Repository class _productRepository.
IEnumerable<Product> GetProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodProducts = products.Where(p => CustomFunction(p.productionDate) == date);
return periodProducts;
}
IEnumerable<Product> GetFaultyProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodFaultyProducts = products.Where(p => CustomFunction(p.ProductionDate) == date && p.Faulty == true);
return periodFaultyProducts;
}
Where GetAll has signature:
IQueryable<Product> GetAll();
The products in the database are many and it takes a lot of time to retrieve them and convert ToList(). I need to enumerate the collection since any custom function such as CustomFunction, cannot be executed in a IQueryable<T>.
My application gets stuck for a long time before obtaining the fault rate. I guess it is because of the large number of objects to be retrieved. I can indeed remove the two functions GetProducts and GetFaultyProducts and implement the logic inside GetFaultRate. However since I have other functions that use GetProducts and GetFaultyProducts, with the latter solution I have only one access to the database but a lot of duplicate code.
What can be a good compromise?

First off, don't convert the IQueryable to a list. It forces the entire data set to be brought into memory all at once, rather than just calling Where directly on the query which will allow you to filter the data as it comes in. This will substantially decrease your memory footprint, and (very) marginally increase the runtime speed. If you need to convert an IQueryable to an IEnumerable so that the Where isn't executed by the database simply use AsEnumerable.
Next, getting all of the data is something you should avoid if at all possible, especially multiple times. You'd need to show us what your date function does, but it's possible that it is something that could be done on the database. Any filtering you can do at all at the database will substantially increase performance.
Next, you really don't need two queries here. The second query is just a subset of the first, so if you know that you'll always be using both queries then you should just just perform the first query, bring the results into memory (i.e. with a ToList that you store) and then use a Where on that to filter the results further. This will avoid another database trip as well as all of the data processing/filtering.
If you won't always be using both queries, but will sometimes use just one or the other, then you can improve the second query by filtering out on Faulty before getting all items. Add Where(p => p.Faulty) before you call AsEnumerable and filter on the date information after calling AsEnumerable (and that's if you can't convert any of the date filtering to filtering that can be done at the database).
It appears that in the end you only need to compute the ratio of items that are faulty as compared to the total. That can easily be done with a single query, rather than two.
You've said that Count is running really slowly in your code, but that's not really true. Count is simply the method that is actually enumerating your query, whereas all of the other methods were simply building the query, not executing it. However, you can cut your performance costs drastically by combining the queries entirely.
var lookup = _productRepository.GetAll()
.AsEnumerable()//if at all possible, try to re-write the `Where`
//to be a valid SQL query so that you don't need this call here
.Where(p => CustomFunction(p.productionDate) == date)
.ToLookup(product => product.Faulty);
int totalCount = lookup[true].Count() + lookup[false].Count();
double rate = lookup[true].Count() / (double) totalCount;

var products = GetProducts(date);
var periodFaultyProducts = (from p in products.AsParallel()
where p.Faulty == true
select p).AsEnumerable();

You need to reduce the number of database requests. ToList, First, FirstOrDefault, Any, Take and Count forces your query to run at a database. As Servy pointed out, AsEnumerable converts your query from IQueryable to IEnumerable. If you have to find subsets you can use Where.

Is LINQ faster on a list or a table?

I have many queries to do and I was wondering if there is a significant performance difference between querying a List and a DataTable or even a SQL server indexed table? Or maybe would it be faster if I go with another type of collection?
In general, what do you think?
Thank you!

It should almost always be faster querying anything in memory, like a List<T> or a DataTable vis-a-vis a database.
Having said that, you have to get the data into an in-memory object like a List before it can be queried, so I certainly hope you're not thinking of dumping your DB into a List<T> for fast querying. That would be a very bad idea.
Am I getting the point of your question?

You might be confusing Linq with a database query language. I would suggest reading up on Linq, particularly IQueryable vs IEnumerable.
In short, Linq is an in-code query language, which can be pointed at nearly any collection of data to perform searches, projections, aggregates, etc in a similar fashion as SQL, but not limited to RDBMSes. It is not, on its face, a DB query language like SQL; it can merely be translated into one by use of an IQueryable provider, line Linq2SQL, Linq2Azure, Linq for Entities... the list goes on.
The IEnumerable side of Linq, which works on in-memory objects that are already in the heap, will almost certainly perform better than the IQueryable side, which exists to be translated into a native query language like SQL. However, that's not because of any inherent weakness or strength in either side of the language. It is instead a factor of (usually) having to send the translated IQueryable command over a network channel and get the results over same, which will perform much more slowly than your local computer's memory.
However, the "heavy lifting" of pulling records out of a data store and creating in-memory object representations has to be done at some time, and IQueryable Linq will almost certainly be faster than instantiating ALL records as in-memory objects, THEN using IEnumerable Linq (Linq 2 Objects) to filter to get your actual data.
To illustrate: You have a table MyTable; it contains a relatively modest 200 million rows. Using a Linq provider like Linq2SQL, your code might look like this:
//GetContext<>() is a method that will return the IQueryable provider
//used to produce MyTable entitiy objects
//pull all records for the past 5 days
var results = from t in Repository.GetContext<MyTable>()
where t.SomeDate >= DateTime.Today.AddDays(-5)
&& t.SomeDate <= DateTime.Now
select t;
This will be digested by the Linq2SQL IQueryable provider into a SQL string like this:
SELECT [each of MyTable's fields] FROM MyTable WHERE SomeDate Between #p1 and #p2; #p1 = '2/26/2011', #p2 = '3/3/2011 9:30:00'
This query can be easily digested by the SQL engine to return EXACTLY the information needed (say 500 rows).
Without a Linq provider, but wanting to use Linq, you may do something like this:
//GetAllMyTable() is a method that will execute and return the results of
//"Select * from MyTable"
//pull all records for the past 5 days
var results = from t in Repository.GetAllMyTable()
where t.SomeDate >= DateTime.Today.AddDays(-5)
&& t.SomeDate <= DateTime.Now
select t;
On the surface, the difference is subtle. Behind the scenes, the devil's in those details. This second query relies on a method that retrieves and instantiates an object for every record in the database. That means it has to pull all those records, and create a space in memory for them. That will give you a list of 200 MILLION records, which isn't so modest anymore now that each of those records was transmitted over the network and is now taking up residence in your page file. The first query MAY introduce some overhead in building and then digesting the expression tree into SQL, but it's MUCH preferred over dumping an entire table into an in-memory collection and iterating over it.

LINQ and Visual Studio 2010 Memory Usage

I have a created a single LINQ query that creates a main group and then two nested groups. Within the last nest there is also a simple OrderBy. The issue I am running into is while writing the query or trying to edit it the visual studio memory consumption sky rockets to ~500MB and is eating 50% of my CPU, which makes visual studio unresponsive for a few minutes. If I comment out the query then visual studio acts just fine. So my question is why is visual studio consuming so much memory during design time for a linq query, granted it is kind of complicated?
The datatable I am using is 10732 rows long by 21 columns across
var results = from p in m_Scores.AsEnumerable()
group p by p.Field<string>("name") into x
select new
{
Name = x.Key,
Members = from z in x
group z by z.Field<string>("id") into zz
select new
{
Id = zz.Key,
Plots = from a in zz
group a by a.Field<string>("foo") into bb
select new
{
Foo = bb.Key,
Bars = bb
}.Bars.OrderBy(m => m.Field<string>("foo"))
}
};
Hardware Specs:
Dell Latitude with a 2.20GHz Dual Core Processor and 4GB ram

The problem with groups and orders is that they require knowledge of the entire collection in order to perform the operations. Same thing with aggregates like min, max, sum, avg, etc. Some of these operations cannot interrogate the true type of the IEnumerable being passed, or it wouldn't matter, since they are "destructive" in nature so they have to create a working copy. When you chain these things together, you end up with at least two copies of the full enumerable; the one produced by the previous method that is being iterated by the current method, and the one being generated by the current method. Any copy of the enumerable that has a reference outside the query (like the source enumerable) also stays in memory, and enumerables that have become orphaned stick around until the GC thread has time to dispose of and finalize them. For a large source enumerable, all of this can create a massive demand on the heap.
In addition, nesting aggregates in clauses can very quickly make the query expensive. Unlike a DBMS which can engineer a "query plan", Linq isn't quite that smart. Min(), for example, requires iteration of the whole enumerable to find the smallest value of the specified projection. When that's a criteria of a Where clause, a good DBMS will find that value once per context and then inline the value in subsequent evaluations as necessary. Linq just runs the extension method each time it is called, and when you have a condition like enumerable.Where(x=>x.Value == enumerable.Min(x2=>x2.Value)), that's an O(N^2)-complexity operation just to evaluate the filter. Add multiple levels of grouping and the Big-O can easily reach high-polynomial complexity.
Generally, you can reduce query time by performing the optimizations that a DBMS would given the same query. If the value of an aggregate can be known for the full scope of the query (e.g. result = source.Where(s=>s.Value == source.Min(x=>x.value))), evaluate that into a variable using the let clause (or an external query), and replace Min() calls with the alias. Iterating the enumerable twice is generally cheaper than iterating it N^2 times, especially if the enumerable stays in memory between iterations.
Also, make sure your query order reduces the sample space as much and as cheaply as possible before you start grouping. You may be able to make educated guesses about conditions that must be evaluated expensively, such as Where(s=>s.Value < threshold).Where(s=>s.Value == source.Min(x=>x.Value)) or more concisely, Where(s=>s.Value < threshold && s.Value == source.Min(x=>x.Value)) (the second works in C# because of lazy condition evaluation, but not all languages evaluate lazily). That cuts the number of evaluations of Min() down to the number of elements meeting the first criteria. You can use existing criteria to do the same thing, wherever the criteria A and B are independent enough that A && B == B && A.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.