How to retrieve data from very large datasets with optional parameters?

How to retrieve data from very large datasets with optional parameters? - c#

I have an app that retrieves data requested by the user. All parameters except Type are optional. If a parameter is not specified, all items are retrieved. If it is specified, only items corresponding that parameter are retrieved. For example, here I retrieve products by year of release (-1 is the default value, if the user hasn't specified one):
var products = context.Products.Where(p => p.type == Type).ToList();
if (!(Year == -1))
products = products.Where(p => p.year == Year).ToList();
This works perfectly fine for some of the years. E.g., if I search 2001, I get all entries needed. But since products has a limited size and only retrieves 1500 entries, later years are simply not retrieved, not in the products list, and it comes up as no data for that year, even though there is data in the DB.
How can I get around this problem?

One of the nice things about deferred execution on LINQ is it can help make code that has variable filtering rules a lot more neat and readable. If you're not sure what deferred execution is, in a nutshell it's a mechanism that only runs the LINQ query when you ask for the results rather than when you make the statements that comprise the query.
In essence this means we can have code like:
//always adults
var p = person.Where(x => x.Age > 18);
//we maybe filter on these
if(email != null)
p = p.Where(x => x.Email == email);
if(socialSN != null)
p = p.Where(x => x.SSN == socialSN);
var r = p.ToList(); //the query is only actually run now
The multiple calls to where here are cumulative; they will conceptually build a where clause but not execute the query until ToList is called. At this point, if a database is in use then the db sees the query with all its Where clauses and can leverage indexes and statistics
If we were to use ToList after every Where, then the first Where would hit the db and it's whole dataset would download to the client app, and the runtime would set about converting an enumerable to a list (a lot of copying and memory allocating). The subsequent Where would filter the list in the client app, enumerating it but then converting it to a list again - the big problem being its done in the memory of the client app as some naive unindexed loop, and all those millions of dollars of r&d Microsoft poured into making their SQL Server query optimizer pull huge amounts of data very quickly, are wasted :)
Consider also that that first clause in my example set- Age>18 could be huge; a million people of a spread of ages over age 12, for example - A large amount of data is true for that predicate. Email or SSN would be a far smaller dataset, probably indexed etc. It's a contrived example sure but hopefully well illustrates the point about performance; by ToList()ing too early we end up downloading too much data

Related

Entity Framework DbContext filtered query for count is extremely slow using a variable

Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.

You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?

There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.

The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.

It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

Entity Framework COUNT is doing a SELECT of all records

Profiling my code because it is taking a long time to execute, it is generating a SELECT instead of a COUNT and as there are 20,000 records it is very very slow.
This is the code:
var catViewModel= new CatViewModel();
var catContext = new CatEntities();
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
catViewModel.NumberOfCats = catAccount.Cats.Count();
It is straightforward stuff, but the code that the profiler is showing is:
exec sp_executesql N'SELECT
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy // You get the idea
FROM [dbo].[Cats] AS [Extent1]
WHERE Cats.[AccountId] = #EntityKeyValue1',N'#EntityKeyValue1 int',#EntityKeyValue1=7
I've never seen this behaviour before, any ideas?
Edit: It is fixed if I simply do this instead:
catViewModel.NumberOfRecords = catContext.Cats.Where(c => c.AccountId == accountId).Count();
I'd still like to know why the former didn't work though.

So you have 2 completely separate queries going on here and I think I can explain why you get different results. Let's look at the first one
// pull a single account record
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
// count all the associated Cat records against said account
catViewModel.NumberOfCats = catAccount.Cats.Count();
Going on the assumption that Cats has a 0..* relationship with Account and assuming you are leveraging the frameworks ability to lazily load foreign tables then your first call to catAccounts.Cats is going to result in a SELECT for all the associated Cat records for that particular account. This results in the table being brought into memory therefore the call to Count() would result in an internal check of the Count property of the in-memory collection (hence no COUNT SQL generated).
The second query
catViewModel.NumberOfRecords =
catContext.Cats.Where(c => c.AccountId == accountId).Count();
Is directly against the Cats table (which would be IQueryable<T>) therefore the only operations performed against the table are Where/Count, and both of these will be evaluated on the DB-side before execution so it's obviously a lot more efficient than the first.
However, if you need both Account and Cats then I would recommend you eager load the data on the fetch, that way you take the hit upfront once
var catAccount = catContext.Account.Include(a => a.Cats).Single(...);

Most times, when somebody accesses a sub-collection of an entity, it is because there are a limited number of records, and it is acceptable to populate the collection. Thus, when you access:
catAccount.Cats
(regardless of what you do next), it is filling that collection. Your .Count() is then operating on the local in-memory collection. The problem is that you don't want that. Now you have two options:
check whether your provider offer some mechanism to make that a query rather than a collection
build the query dynamically
access the core data-model instead
I'm pretty confident that if you did:
catViewModel.NumberOfRecords =
catContext.Cats.Count(c => c.AccountId == accountId);
it will work just fine. Less convenient? Sure. But "works" is better than "convenient".

Optimize the number of accesses to a database when working with IQueryable<T> and custom functions

In my C# Class Library project I have a method that needs to compute some statistics GetFaultRate, that, given a date, computes the number of products with faults over the number of products produced.
float GetFaultRate(DateTime date)
{
var products = GetProducts(date);
var faultyProducts = GetFaultyProducts(date);
var rate = (float) (faultyProducts.Count() / products.Count());
return rate;
}
Both methods, GetProducts and GetFaultyProducts take the data from a Repository class _productRepository.
IEnumerable<Product> GetProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodProducts = products.Where(p => CustomFunction(p.productionDate) == date);
return periodProducts;
}
IEnumerable<Product> GetFaultyProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodFaultyProducts = products.Where(p => CustomFunction(p.ProductionDate) == date && p.Faulty == true);
return periodFaultyProducts;
}
Where GetAll has signature:
IQueryable<Product> GetAll();
The products in the database are many and it takes a lot of time to retrieve them and convert ToList(). I need to enumerate the collection since any custom function such as CustomFunction, cannot be executed in a IQueryable<T>.
My application gets stuck for a long time before obtaining the fault rate. I guess it is because of the large number of objects to be retrieved. I can indeed remove the two functions GetProducts and GetFaultyProducts and implement the logic inside GetFaultRate. However since I have other functions that use GetProducts and GetFaultyProducts, with the latter solution I have only one access to the database but a lot of duplicate code.
What can be a good compromise?

First off, don't convert the IQueryable to a list. It forces the entire data set to be brought into memory all at once, rather than just calling Where directly on the query which will allow you to filter the data as it comes in. This will substantially decrease your memory footprint, and (very) marginally increase the runtime speed. If you need to convert an IQueryable to an IEnumerable so that the Where isn't executed by the database simply use AsEnumerable.
Next, getting all of the data is something you should avoid if at all possible, especially multiple times. You'd need to show us what your date function does, but it's possible that it is something that could be done on the database. Any filtering you can do at all at the database will substantially increase performance.
Next, you really don't need two queries here. The second query is just a subset of the first, so if you know that you'll always be using both queries then you should just just perform the first query, bring the results into memory (i.e. with a ToList that you store) and then use a Where on that to filter the results further. This will avoid another database trip as well as all of the data processing/filtering.
If you won't always be using both queries, but will sometimes use just one or the other, then you can improve the second query by filtering out on Faulty before getting all items. Add Where(p => p.Faulty) before you call AsEnumerable and filter on the date information after calling AsEnumerable (and that's if you can't convert any of the date filtering to filtering that can be done at the database).
It appears that in the end you only need to compute the ratio of items that are faulty as compared to the total. That can easily be done with a single query, rather than two.
You've said that Count is running really slowly in your code, but that's not really true. Count is simply the method that is actually enumerating your query, whereas all of the other methods were simply building the query, not executing it. However, you can cut your performance costs drastically by combining the queries entirely.
var lookup = _productRepository.GetAll()
.AsEnumerable()//if at all possible, try to re-write the `Where`
//to be a valid SQL query so that you don't need this call here
.Where(p => CustomFunction(p.productionDate) == date)
.ToLookup(product => product.Faulty);
int totalCount = lookup[true].Count() + lookup[false].Count();
double rate = lookup[true].Count() / (double) totalCount;

var products = GetProducts(date);
var periodFaultyProducts = (from p in products.AsParallel()
where p.Faulty == true
select p).AsEnumerable();

You need to reduce the number of database requests. ToList, First, FirstOrDefault, Any, Take and Count forces your query to run at a database. As Servy pointed out, AsEnumerable converts your query from IQueryable to IEnumerable. If you have to find subsets you can use Where.

how to append IQueryable within a loop

I have a simple foreach loop that goes through the productID's I have stored in a user's basket and looks up the product's details from the database.
As you can see from my code, what I have at present will return the very last item on screen - as the variable is overwritten within the loop. I'd like to be able to concat this so that I can display the product details for the items only in the basket.
I know I could do something very easy like store only ProductIDs in the repeater I use and onitemdatabound call the database there but I'd like to make just one database call if possible.
Currently I have the following (removed complex joins from example, but if this matters let me know):
IQueryable productsInBasket = null;
foreach (var thisproduct in store.BasketItems)
{
productsInBasket = (from p in db.Products
where p.Active == true && p.ProductID == thisproduct.ProductID
select new
{
p.ProductID,
p.ProductName,
p.BriefDescription,
p.Details,
p.ProductCode,
p.Barcode,
p.Price
});
}
BasketItems.DataSource = productsInBasket;
BasketItems.DataBind();
Thanks for your help!

It sounds like you really want something like:
var productIds = store.BasketItems.Select(x => x.ProductID).ToList();
var query = from p in db.Products
where p.Active && productIds.Contains(p.ProductID)
select new
{
p.ProductID,
p.ProductName,
p.BriefDescription,
p.Details,
p.ProductCode,
p.Barcode,
p.Price
};

In Jon's answer, which works just fine, the IQueryable will however be converted to an IEnumerable, since you call ToList() on it. This will cause the query to be executed and the answer retrieved. For your situation, this may be OK, since you want to retrieve products for a basket, and where the number of products will probably be considerably small.
I am, however, facing a similar situation, where I want to retrieve friends for a member. Friendship depends on which group two members belongs to - if they share at least one group, they are friends. I thus have to retrieve all membership for all groups for a certain member, then retrieve all members from those groups.
The ToList-approach will not be applicable in my case, since that would execute the query each time I want to handle my friends in various ways, e.g. find stuff that we can share. Retrieving all members from the database, instead of just working on the query and execute it at the last possible time, will kill performance.
Still, my first attempt at this situation was to do just this - retrieve all groups I belonged to (IQueryable), init an List result (IEnumerable), then loop over all groups and append all members to the result if they were not already in the list. Finally, since my interface enforced that an IQueryable was to be returned, I returned the list with AsIQueryable.
This was a nasty piece of code, but at least it worked. It looked something like this:
var result = new List<Member>();
foreach (var group in GetGroupsForMember(member))
result.AddRange(group.GroupMembers.Where(x => x.MemberId != member.Id && !result.Contains(x.Member)).Select(groupMember => groupMember.Member));
return result.AsQueryable();
However, this is BAD, since I add ALL shared members to a list, then convert the list to an IQueryable just to satisfy my post condition. I will retrieve all members that are affected from the database, every time I want to do stuff with them.
Imagine a paginated list - I would then just want to pick out a certain range from this list. If this is done with an IQueryable, the query is just completed with a pagination statement. If this is done with an IEnumerable, the query has already been executed and all operations are applied to the in-memory result.
(As you may also notice, I also navigate down the entity's relations (GroupMember => Member), which increases coupling can cause all kinds of nasty situations further on. I wanted to remove this behavior as well).
So, tonight, I took another round and ended up with a much simpler approach, where I select data like this:
var groups = GetGroupsForMember(member);
var groupMembers = GetGroupMembersForGroups(groups);
var memberIds = groupMembers.Select(x => x.MemberId);
var members = memberService.GetMembers(memberIds);
The two Get methods honor the IQueryable and never convert it to a list or any other IEnumerable. The third line just performs a LINQ query ontop of the IEnumerable. The last line just takes the member IDs and retrieves all members from another service, which also works exclusively with IQueryables.
This is probably still horrible in terms of performance, but I can optimize it further later on, if needed. At least, I avoid loading unnecessary data.
Let me know if I am terribly wrong here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.