Linq to SQL/Entities: Greatest N-Per group problem/performance increase

Linq to SQL/Entities: Greatest N-Per group problem/performance increase - c#

Allright, So I have too encountered what I believe is the Greatest-N-Per problem, whereas this question has been answered before I do not think it has been solved well yet with Linq. I have a table with a few million entries, so therefore queries take a lot of time. I would like these queries to take less than a second, whereas currently they spend about 10 seconds to infinity.
var query =
from MD in _context.MeasureDevice
where MD.DbdistributorMap.DbcustomerId == 6 // Filter the devices based on customer
select new
{
DbMeasureDeviceId = MD.DbMeasureDeviceId,
// includes measurements and alarms which have 1-m and 1-m relations
Measure = _context.Measures.Include(e=> e.MeasureAlarms)
.FirstOrDefault(e => e.DbMeasureDeviceId == MD.DbMeasureDeviceId && e.MeasureTimeStamp == _context.Measures
.Where(x => x.DbMeasureDeviceId == MD.DbMeasureDeviceId)
.Max(e=> e.MeasureTimeStamp)),
Address = MD.Dbaddress // includes address 1-1 relation
};
In this query I'm selecting data from 4 different tables. Firstly the MeasureDevice table which is the primary entity im after. Secondly I want the latest measurement from the measures table, which should also include alarms from another table if any exist. Lastly I need the address of the device, which is located in its own table.
There are a few thousand devices, but they have between themselves several thousands of measures which amount to several million rows in the measurement table.
I wonder if anyone has any knowledge as to either improve the performance of Linq queries using EF5, or any better method for solving the Greatest-N-Per problem. I've analyzed the query using Microsoft SQL Server Manager and the most time is spent fetching the measurements.
Query generated as requested:
SELECT [w].[DBMeasureDeviceID], [t].[DBMeasureID], [t].[AlarmDBAlarmID], [t].[batteryValue], [t].[DBMeasureDeviceID], [t].[MeasureTimeStamp], [t].[Stand], [t].[Temperature], [t].[c], [a].[DBAddressID], [a].[AmountAvtalenummere],
[a].[DBOwnerID], [a].[Gate], [a].[HouseCharacter], [a].[HouseNumber], [a].[Latitude], [a].[Longitude], [d].[DBDistributorMapID], [m1].[DBMeasureID], [m1].[DBAlarmID], [m1].[AlarmDBAlarmID], [m1].[MeasureDBMeasureID]
FROM [MeasureDevice] AS [w]
INNER JOIN [DistribrutorMap] AS [d] ON [w].[DBDistributorMapID] = [d].[DBDistributorMapID]
LEFT JOIN [Address] AS [a] ON [w].[DBAddressID] = [a].[DBAddressID]
OUTER APPLY (
SELECT TOP(1) [m].[DBMeasureID], [m].[AlarmDBAlarmID], [m].[batteryValue], [m].[DBMeasureDeviceID], [m].[MeasureTimeStamp], [m].[Stand], [m].[Temperature], 1 AS [c]
FROM [Measure] AS [m]
WHERE ([m].[MeasureTimeStamp] = (
SELECT MAX([m0].[MeasureTimeStamp])
FROM [Measure] AS [m0]
WHERE [m0].[DBMeasureDeviceID] = [w].[DBMeasureDeviceID])) AND ([w].[DBMeasureDeviceID] = [m].[DBMeasureDeviceID])
) AS [t]
LEFT JOIN [MeasureAlarm] AS [m1] ON [t].[DBMeasureID] = [m1].[MeasureDBMeasureID]
WHERE [d].[DBCustomerID] = 6
ORDER BY [w].[DBMeasureDeviceID], [d].[DBDistributorMapID], [a].[DBAddressID], [t].[DBMeasureID], [m1].[DBMeasureID], [m1].[DBAlarmID]
Entity Relations

You have navigation properties defined, so it stands that MeasureDevice should have a reference to it's Measures:
var query = _context.MeasureDevice
.Include(md => md.Measures.Select(m => m.MeasureAlarms)
.Where(md => md.DbDistributorMap.DbCustomerId == 6)
.Select(md => new
{
DbMeasureDeviceId = md.DbMeasureDeviceId,
Measure = md.Measures.OrderByDescending(m => m.MeasureTimeStamp).FirstOrDefault(),
Address = md.Address
});
The possible bugbear here is including the MeasureAlarms with the required Measure. AFAIK you cannot put an .Include() within a .Select() (Where we might have tried Measure = md.Measures.Include(m => m.MeasureAlarms)...
Caveat: It has been quite a while since I have used EF 5 (Unless you are referring to EF Core 5) If you are using the (very old) EF5 in your project I would recommend arguing for the upgrade to EF6 given EF6 did bring a number of performance and capability improvements to EF5. If you are instead using EF Core 5, the Include statement above would be slightly different:
.Include(md => md.Measures).ThenInclude(m => m.MeasureAlarms)
Rather than returning entities, my go-to advice is to use Projection to select precisely the data we need. That way we don't need to worry about eager or lazy loading. If there are details about the Measure and MeasureAlarms we need:
var query = _context.MeasureDevice
.Where(md => md.DbDistributorMap.DbCustomerId == 6)
.Select(md => new
{
md.DbMeasureDeviceId,
Measure = md.Measures
.Select(m => new
{
m.MeasureId,
m.MeasureTimestamp,
// any additional needed fields from Measure
Address = m.Address.Select(a => new
{
// Assuming Address is an entity, any needed fields from Address.
}),
Alarms = m.MeasureAlarms.Select(ma => new
{
ma.MeasureAlarmId,
ma.Label // etc. Whatever fields needed from Alarm...
}).ToList()
}).OrderByDescending(m => m.MeasureTimestamp)
.FirstOrDefault()
});
This example selects anonymous types, alternatively you can define DTOs/ViewModels and can leverage libraries like Automapper to map the fields to the respective entity values to replace all of that with something like ProjectTo<LatestMeasureSummaryDTO> where Automapper has rules to map a MeasureDevice to resolve the latest Measure and extract the needed fields.
The benefits of projection are handling otherwise complex/clumsy eager loading, building optimized payloads with only the fields a consumer needs, and resilience in a changing system where new relationships don't accidentally introduce lazy loading performance issues. For example if Measure currently only has MeasureAlarm to eager load, everything works. But down the road if a new relationship is added to Measure or MeasureAlarm and your payload containing those entities are serialized, that serialization call will now "trip" lazy loading on the new relationship unless you revisit all queries retrieving these entities and add more eager loads, or start worrying about disabling lazy loading entirely. Projections remain the same until only if and when the fields they need to return actually need to change.
Beyond that, the next thing you can investigate is to run the resulting query through an analyzer, such as within SQL Management Studio to return the execution plan and identify whether the query could benefit from indexing changes.

Related

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?

There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.

The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.

It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

Entity Framework COUNT is doing a SELECT of all records

Profiling my code because it is taking a long time to execute, it is generating a SELECT instead of a COUNT and as there are 20,000 records it is very very slow.
This is the code:
var catViewModel= new CatViewModel();
var catContext = new CatEntities();
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
catViewModel.NumberOfCats = catAccount.Cats.Count();
It is straightforward stuff, but the code that the profiler is showing is:
exec sp_executesql N'SELECT
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy // You get the idea
FROM [dbo].[Cats] AS [Extent1]
WHERE Cats.[AccountId] = #EntityKeyValue1',N'#EntityKeyValue1 int',#EntityKeyValue1=7
I've never seen this behaviour before, any ideas?
Edit: It is fixed if I simply do this instead:
catViewModel.NumberOfRecords = catContext.Cats.Where(c => c.AccountId == accountId).Count();
I'd still like to know why the former didn't work though.

So you have 2 completely separate queries going on here and I think I can explain why you get different results. Let's look at the first one
// pull a single account record
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
// count all the associated Cat records against said account
catViewModel.NumberOfCats = catAccount.Cats.Count();
Going on the assumption that Cats has a 0..* relationship with Account and assuming you are leveraging the frameworks ability to lazily load foreign tables then your first call to catAccounts.Cats is going to result in a SELECT for all the associated Cat records for that particular account. This results in the table being brought into memory therefore the call to Count() would result in an internal check of the Count property of the in-memory collection (hence no COUNT SQL generated).
The second query
catViewModel.NumberOfRecords =
catContext.Cats.Where(c => c.AccountId == accountId).Count();
Is directly against the Cats table (which would be IQueryable<T>) therefore the only operations performed against the table are Where/Count, and both of these will be evaluated on the DB-side before execution so it's obviously a lot more efficient than the first.
However, if you need both Account and Cats then I would recommend you eager load the data on the fetch, that way you take the hit upfront once
var catAccount = catContext.Account.Include(a => a.Cats).Single(...);

Most times, when somebody accesses a sub-collection of an entity, it is because there are a limited number of records, and it is acceptable to populate the collection. Thus, when you access:
catAccount.Cats
(regardless of what you do next), it is filling that collection. Your .Count() is then operating on the local in-memory collection. The problem is that you don't want that. Now you have two options:
check whether your provider offer some mechanism to make that a query rather than a collection
build the query dynamically
access the core data-model instead
I'm pretty confident that if you did:
catViewModel.NumberOfRecords =
catContext.Cats.Count(c => c.AccountId == accountId);
it will work just fine. Less convenient? Sure. But "works" is better than "convenient".

NHibernate: Object hierarchy and performance

I've a database with a Customer table. Each of these customers has a foreign key to an Installation table, which further has an foreign key to an Address table (table renamed for simplicity).
In NHibernate I'm trying to query the Customer table like this:
ISession session = tx.Session;
var customers = session.QueryOver<Customer>().Where(x => x.Country == country);
var installations = customers.JoinQueryOver(x => x.Installation, JoinType.LeftOuterJoin);
var addresses = installations.JoinQueryOver(x => x.Address, JoinType.LeftOuterJoin);
if (installationType != null)
{
installations.Where(x => x.Type == installationType);
}
return customers.TransformUsing(new DistinctRootEntityResultTransformer()).List<Customer>();
Which results in a SQL query similar to (catched by NHibernate Profiler):
SELECT *
FROM Customer this_
left outer join Installation installati1_
on this_.InstallationId = installati1_.Id
left outer join Address address2_
on installati1_.AddressId = address2_.Id
WHERE this_.CountryId = 4
and installati1_.TypeId = 1
When I execute the above SQL query in Microsoft SQL Server Management Studio it executes in about 5 seconds but returns ~200.000 records. Nevertheless it takes a lot of time to retrieve the List when running the code. I've been waiting for 10 minutes without any results. The debug-log indicated that a lot of objects are constructed and initiated because of the object hierarchy. Is there a way to fix this performance issue?

I'm not sure what you are trying to do, but loading and saving 200000 records through any OR mapper is not feasable. 200000 objects will take a lot of memory and time to be created. Depending on what you want to do, loading them in pages or make a update query directly on the database (sp or named query) can fix your performance. Batching can be done by:
criteria.SetFirstResult(START).SetMaxResult(PAGESIZE);

NHibernate Profiler shows two times in the duration column x/y, with x being the time to execute the query and y the time to initialize the objects. The first step is to determine where the problem lies. If the query is slow, get the actual query sent to the database using SQL Profiler (assuming SQL Server) and check its performance in SSMS.
However, I suspect your issue may be the logging level. If you have the logging level set to DEBUG, NHibernate will generate very verbose logs and this will significantly impact performance.
Even if you can get it to perform well with 200000 records that's more than you can display to the user in a meaningful way. You should use paging/filtering to reduce the size of the result set.

Entity Framework include vs where

My database structure is this: an OptiUser belongs to multiple UserGroups through the IdentityMap table, which is a matching table (many to many) with some additional properties attached to it. Each UserGroup has multiple OptiDashboards.
I have a GUID string which identifies a particular user (wlid in this code). I want to get an IEnumerable of all of the OptiDashboards for the user identified by wlid.
Which of these two Linq-to-Entities queries is the most efficient? Do they run the same way on the back-end?
Also, can I shorten option 2's Include statements to just .Include("IdentityMaps.UserGroup.OptiDashboards")?
using (OptiEntities db = new OptiEntities())
{
// option 1
IEnumerable<OptiDashboard> dashboards = db.OptiDashboards
.Where(d => d.UserGroups
.Any(u => u.IdentityMaps
.Any(i => i.OptiUser.WinLiveIDToken == wlid)));
// option 2
OptiUser user = db.OptiUsers
.Include("IdentityMaps")
.Include("IdentityMaps.UserGroup")
.Include("IdentityMaps.UserGroup.OptiDashboards")
.Where(r => r.WinLiveIDToken == wlid).FirstOrDefault();
// then I would get the dashboards through user.IdentityMaps.UserGroup.OptiDashboards
// (through foreach loops...)
}

You may be misunderstanding what the Include function actually does. Option 1 is purely a query syntax which has no effect on what is returned by the entity framework. Option 2, with the Include function instructs the entity framework to Eagerly Fetch the related rows from the database when returns the results of the query.
So option 1 will result in some joins, but the "select" part of the query will be restricted to the OptiDashboards table.
Option 2 will result in joins as well, but in this case it will be returning the results from all the included tables, which obviously is going to introduce more of a performance hit. But at the same time, the results will include all the related entities you need, avoiding the [possible] need for more round-trips to the database.

I think the Include will render as joins an you will the able to access the data from those tables in you user object (Eager Loading the properties).
The Any query will render as exists and not load the user object with info from the other tables.
For best performance if you don't need the additional info use the Any query

As has already been pointed out, the first option would almost certainly perform better, simply because it would be retrieving less information. Besides that, I wanted to point out that you could also write the query this way:
var dashboards =
from u in db.OptiUsers where u.WinLiveIDToken == wlid
from im in u.IdentityMaps
from d in im.UserGroup.OptiDashboards
select d;
I would expect the above to perform similarly to the first option, but you may (or may not) prefer the above form.

Data Loading Strategy/Syntax in EF4

Long time lurker, first time posting, and newly learning EF4 and MVC3.
I need help making sure I'm using the correct data loading strategy in this case as well as some help finalizing some details of the query. I'm currently using the eager loading approach outlined here for somewhat of a "dashboard" view that requires a small amount of data from about 10 tables (all have FK relationships).
var query = from l in db.Leagues
.Include("Sport")
.Include("LeagueContacts")
.Include("LeagueContacts.User")
.Include("LeagueContacts.User.UserContactDatas")
.Include("LeagueEvents")
.Include("LeagueEvents.Event")
.Include("Seasons")
.Include("Seasons.Divisions")
.Include("Seasons.Divisions.Teams")
.Where(l => l.URLPart.Equals(leagueName))
select (l);
model = (Models.League) query.First();
However, I need to do some additional filtering, sorting, and shaping of the data that I haven't been able to work out. Here are my chief needs/concerns from this point:
Several child objects still need additional filtering but I haven't been able to figure out the syntax or best approach yet. Example: "TOP 3 LeagueEvents.Event WHERE StartDate >= getdate() ORDER BY LeagueEvents.Event.StartDate"
I need to sort some of the fields. Examples: ORDERBY Seasons.StartDate, LeagueEvents.Event.StartDate, and LeagueContacts.User.SortOrder, etc.
I'm already very concerned about the overall size of the SQL generated by this query and the number of joins and am thinking that I may need a different data loading approach alltogether.(Explicit loading? Multiple QueryObjects? POCO?)
Any input, direction, or advice on how to resolve these remaining needs as well as ensuring the best performance is greatly appreciated.

Your concern about size of the query and size of the result set are tangible.
As #BrokenGlass mentioned EF doesn't allow you doing filtering or ordering on includes. If you want to order or filter relations you must use projection either to anonymous type or custom (non mapped) type:
var query = db.Leagues
.Where(l => l.URLPart.Equals(leagueName))
.Select(l => new
{
League = l,
Events = l.LeagueEvents.Where(...)
.OrderBy(...)
.Take(3)
.Select(e => e.Event)
...
});

Unfortunately EF doesn't allow to selectively load related entities using its navigation properties, it will always load all Foos if you specify Include("Foo").
You will have to do a join on each of the related entities using your Where() clauses as filters where they apply.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.