optimize linq query with related entities

optimize linq query with related entities - c#

i am new to linq, i started writing this query:
var dProjects = Projects
.Select(p => new Models.Project {
ProjectID = p.ProjectID,
Status = p.Status,
ExpiresOn = p.ExpiresOn,
LatestComments = p.ProjectComments
.OrderByDescending(pc => pc.CreatedOn)
.Select(pc => pc.Comments)
.FirstOrDefault(),
ProjectFileIDs = p.ProjectFiles
.Select(pf => pf.BinaryFileID)
.AsQueryable()
})
.AsQueryable<Models.Project>();
I already know this query will perform really slow because related entities like ProjectComments and ProjectFiles will create nested selects, though it works and gives me right results that i need.
How can i optimize this query and get the same results? One of my guesses would be using inner join but ProjectComments and ProjectFiles already has a relationship in database through keys, so not sure what we can achieve by setting the relationship again.
Basically, need to know which is the best approach to take here from performance perspective. One thing to note is i am sorting ProjectComments and only taking the most recent one. Should i be using combination of join and group by into? Help will be much appreciated. Thanks.
UPDATED:
Sorry, if i wasn't clear enough on what i am trying to do. Basically, in front end, i have a grid, which shows list of projects with latest project comments and list of all the files associated to project, so users can click on those links and actually open those documents. So the query that i have above is working and it does show the following in the grid:
Project ID (From Project table)
Status (From Project table)
ExpiresOn (From Project table)
LatestComments (latest entry from ProjectComments table which has project ID as foreign key)
ProjectFileIDs (list of file ids from ProjectFiles table which has Project ID as foreign key - i am using those File IDs and creating links so users can open those files).
So everything is working, i have it all setup, but the query is little slow. Right now we have very little data (only test data), but once this is launched, i am expecting lot of users/data and thus i want to optimize this query to the best, before it goes live. So, the goal here is to basically optimize. I am pretty sure this is not the best approach, because this will create nested selects.

In Entity Framework, you can drastically improve the performance of the queries by returning the objects back as an object graph instead of a projection. Entity Framework is extremely efficient at optimizing all but the most complex SQL queries, and can take advantage of deferred "Eager" loading vs. "Lazy" Loading (not loading related items from the db until they are actually accessed). This MSDN reference is a good place to start.
As far as your specific query is concerned, you could use this technique something like the following:
var dbProjects = yourContext.Projects
.Include(p => p.ProjectComments
.OrderByDescending(pc => pc.CreatedOn)
.Select(pc => pc.Comments)
.FirstOrDefault()
)
.Include(p => p.ProjectFileIDs)
.AsQueryable<Models.Project>();
note the .Include() being used to imply Eager Loading.
From the MDSN Reference on Loading Related Objects,
Performance Considerations
When you choose a pattern for loading related entities, consider the behavior of each approach with regard to the number and timing of connections made to the data source versus the amount of data returned by and the complexity of using a single query. Eager loading returns all related entities together with the queried entities in a single query. This means that, while there is only one connection made to the data source, a larger amount of data is returned in the initial query. Also, query paths result in a more complex query because of the additional joins that are required in the query that is executed against the data source.
Explicit and lazy loading enables you to postpone the request for related object data until that data is actually needed. This yields a less complex initial query that returns less total data. However, each successive loading of a related object makes a connection to the data source and executes a query. In the case of lazy loading, this connection occurs whenever a navigation property is accessed and the related entity is not already loaded.

Do you get any boost in performance if you add Include statements before the Select?
Example:
var dProjects = Projects
.Include(p => p.ProjectComments)
.Include(p => p.ProjectFiles)
Include allows all matching ProjectComments and ProjectFiles to be eagerly loaded. See Loading Related Entities for more details.

Related

LINQ Include slowing down performance when searching

We have the following method that allows us to search a table of Projects for a DataGrid:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
projects.Include(p => p.Client);
projects.Include(p => p.Architect);
projects.Include(p => p.ProjectManager);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
If we do not use the Include on the projects then the searching is instantaneous. However, after adding them in the search can take over 3 seconds.
We need to include the other Entities to allow the Users to search on them should they want to.
How are we able to improve performance but still keep the Include to allow searching on them?
Without Incldue the method looks like so:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.Name.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
Without Include the performance looks like so:
With Include:

The short answer is that including all the extra entities takes time and effort, thus increasing the load times.
However, there is a flaw in your assumption:
We need to include the other Entities to allow the Users to search on them should they want to.
That is not (necessarily) correct. Filtering happens on the database level. Include tells Entity Framework to load the records from the database. These are two separate things.
Look at the following examples:
_context.Projects
.Include(p => p.Architect)
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (and their architects) who have an architect named Bob.
_context.Projects
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (without architects) who have an architect named Bob; but it does not actually load the Architect object into memory.
_context.Projects
.Include(p => p.Architect)
.ToList()
This will give you a list of projects (and their architects). It will contain every project, the list is not filtered.
You only need to use Include when you want to do in-memory filtering, i.e. on a collection that was already loaded from the database.
Whether that is the case for you depends on this part:
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
If NormalizedFullProjectName (and the other properties) are database columns, then EF is able to perform the filtering at the database level. You do not need the Include for filtering the items.
If NormalizedFullProjectName (and the other properties) are not database columns, then EF will first have to load the items in memory before it can apply the filter. In this case, you do need the Include, because the architects (and others) need to be loaded in memory.
If you are only loading the related entities for filtering purposes (not display purposes), and you are doing the filtering on the database level; then you can simply remove the include statements.
If you need those related entities to be loaded (for in-memory filtering, or for display purposes), then you can't easily remove the Include statements, unless you write a Select that specifies the fields you need.
For example:
_context.Projects
.Select(p => new { Project = p, ArchitectName = p.Architect.Name })
.ToList()
This will load the project entities (in their entirety) but only the name of the architect and none of the other properties. This can be a significant performance boost if your related entities have many properties that you currently do not need.
Note: The current example uses an anonymous type. I generally advocate creating a custom type for this; but that's unrelated to the performance issue we're addressing here.
Update
Based on your update, you seemingly imply that the intended filtering happens after the objects have been loaded from the database.
This is the source of your performance problems. You are fetching a lot of data but only show part of it. The data that does not get shown still needs to be loaded, which is wasted effort.
There are separate arguments for performance here:
Load everything once - Load all the data once (which might take a long time), but then allow the user to filter the loaded data (which is very fast)
Load chunks - Only load the data that matches the applied filters. If the user changes the filters, you load the data again. The first load will be much faster, but the subsequent filtering actions will take longer compared to in-memory filtering.
What you should do here is not my decision. It's a matter of priorities. Some customers prefer one over the other. I would say that in most cases, the second option (loading chunks) is the better option here, as it prevents needlessly loading a massive dataset if the user never looks through 90% of it. That's a waste of performance and network load.
The answer I gave applies to the "load chunks" approach.
If you decide to take the "load everything once" approach, then you will have to accept the performance hit of that initial load. The best you can do is severely limit the returned data columns (like I showed with the Select) in order to minimize the performance/network cost.
I see no reasonable argument to mix these two approaches. You'll end up with both drawbacks.

How to explicit loading long navigation properties' chain using Entity Framework?

I have this code to explicit loading for an entity:
dbContext.StorageRequests.Add(storageRequest);
dbContext.SaveChanges();
//Here I want to explict loading some navigation properties
dbContext.Entry(storageRequest).Reference(c => c.Manager).Load();
dbContext.Entry(storageRequest).Reference(c => c.Facility).Load();
dbContext.Entry(storageRequest).Collection(x=> x.PhysicalObjects).Query().Include(x => x.Classification).Load();
My question is two parts:
The first one how can I load all together (I want to call Load() once)?
The second part does the above code sends query for each Load() calling which in turn hit the database to load related data?

I had a similar question withEF core. Turning on SQL logging to the debugoutput window helped answer a lot of my questions as to what it was doing, and why. In terms of your questions:
1) You can't, though you can eager load it with a series of dbContext.Collection.Include(otherCollection).ThenInclude(stuffRelatedToOtherCollection) type chains
2) Yes it does, even eager loading in one c# statement bangs out multiple queries. I presumed this was because it's too much of a complex artificial intelligence problem to do it any way other than its most naive multiple-sql, because it's hard for the framework to deal with cartesian products when multiple tables are joined together in one rectangular dataset. (A school has students and teachers, teacher:students is a many:many relationship, decomposed by class. If you wrote one query to join school, class, student and teachers, you'd get repeated data all over the place and though conceptually possible to pick through it looking for unique school, class teacher and student primary key values, you could be downloading tens of thousands of duplicated rows only to have to unique them all again. EF tends to select the school ,then school join class, then school join class join students, then school join class join teachers (if that's how you coded your school include class theninclude students then include teachers. Changing your include strategy will change the queries that are run)

Nice question! Let me answer differently, in reverse order, with new info.
2.)
Each Load() will cause a query to the database as of the documentation (Querying and Finding Entities - 10/23/2016):
A query is executed against the database when:
It is enumerated by a foreach (C#) or For Each (Visual Basic) statement.
It is enumerated by a collection operation such as ToArray, ToDictionary, or ToList.
LINQ operators such as First or Any are specified in the outermost part of the query.
The following methods are called: the Load extension method on a DbSet, DbEntityEntry.Reload, and Database.ExecuteSqlCommand.
People often uses eager loading with Include() to let EF optimize as much as possible:
in most cases, EF will combine the joins when generating SQL
// ef 6
using System.Data.Entity;
var storageRequests = dbContext.StorageRequests
.Include(r => r.PhysicalObjects.Select(p => p.Classification))
.Include(r => r.Manager)
.Include(r => r.Facility);
// evaluate "storageRequests" here by linq method or foreach
or:
// ef core
var storageRequests = dbContext.StorageRequests
.Include(r => r.PhysicalObjects)
.ThenInclude(p => p.Classification)
.Include(r => r.Manager)
.Include(r => r.Facility);
// evaluate "storageRequests" here by linq method or foreach
1.)
Only possible way I can imagine is having above code with storageRequests.Load().
You could inspect whether it:
generates single/multiple queries,
loads navigation property data along StorageRequest.
FYI: these query executions are also called network roundtrips in microsoft docs:
Multiple network roundtrips can degrade performance, especially where latency to the database is high (for example, cloud services).
Point of interest:
There is a relative new option Single vs. Split Queries (10/03/2019) in .Net Core 5.
The default is single queries (behavior described above). After that you can decide to request/load data per table instead, by adding .AsSplitQuery() to your linq query, before the evaluation. Splitted queries increases roundtrips and memory usage (does not loads distinct data) but helps performance.
There is also .AsSingleQuery() if your global choice was:
.UseSqlServer(
connectionString,
o => o.UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery));

Mark some child entities as "Never Load" in LINQ to Entities query

TL;DR
I want to write a query in LINQ to Entities and tell it that I'll never load the child entities of an entity. How do I do that without projecting?
Eg,
return (from a in this.Db.Assets
join at in this.Db.AssetTypes on a.AssetTypeId equals at.AssetTypeId
join ast in this.Db.AssetStatuses on a.AssetStatusId equals ast.AssetStatusId
select new {
a = a,
typeDesc = at.AssetTypeDesc,
statusDesc = ast.AssetStatusDesc
}).ToList().Select(anon => new AssetViewModel(anon.a, anon.typeDesc, anon.statusDesc)).ToList();
I want the entity called Asset pulled into a on the anonymous type I'm defining, and when I call ToList(), I don't want the Assets' children, Status and Type, to lazy load.
EDIT: After some random Visual Studio autcomplete investigation, much of this can be accomplished by turning off lazy loading in the DbContext:
this.Db.Configuration.LazyLoadingEnabled = false;
Unfortunately, if your work with the query results does have a few child tables, even with LazyLoadingEnabled turned off, things may still "work" for some subset of them iff the data for those children has already been loaded earlier in this DbContext -- that is, if those children have already had their context cached -- which can make for some surprising and temporarily confusing results.
That is to say, I want to explicitly load some children at query time and completely sever any relationship to other child entities.
Best would be some way to actively load some entities and to ignore the rest. That is, I could call ToList() and not have to worry about throwing off lots of db connections.
Context
I have a case where I'm hydrating a view model with the results of a LINQ to Entities query from an entity called Asset. The Asset table has a couple of child tables, Type and Status. Both Type and Status have Description fields, and my view model contains both descriptions in it. Let's pretend that's as complicated as this query gets.
So I'd like to pull everything from the Asset table joined to Type and Status in one database query, during which I pull the Type and Status descriptions. In other words, I don't want to lazy load that info.
WET (Woeful Entity reTranscription?)
What we're doing now, which does exactly what I want from a connection standpoint, is the usual .Select into the view model, with a tedious field matchup.
return (from a in this.Db.Assets
join at in this.Db.AssetTypes on a.AssetTypeId equals at.AssetTypeId
join ast in this.Db.AssetStatuses on a.AssetStatusId equals ast.AssetStatusId
select new AssetViewModel
{
AssetId = a.AssetId,
// *** LOTS of fields from Asset removed ***
AssetStatusDesc = ast.AssetStatusDesc,
AssetTypeDesc = at.AssetTypeDesc
}).ToList();
That's good in that the Status and Type child entities of Asset are never accessed, and there's no lazy load. The SQL is one join in one database hit for all the assets. Perfect.
The worry is all the repeated jive in // *** LOTS of fields from Asset removed ***. Currently, we've got that projection in every freakin query, which obviously isn't DRY. And it means that when the Asset table changes, it's rare that the new field is included in every projection (because humans), which stinks.
I don't see a quick way around the query, btw. If I want to do it in a single query, I have to have the joins. I could add wheres to it in separate methods, but I'm not sure how I'd skip the projection each time. Or I could add joins to the query in cascading methods, but then my projection is still "repository bound", which isn't best case if I'm using these sorts of queries elsewhere. But I'm betting I'm stupiding something here.
Dumb
When I tried adding a cast to my view model from asset and changing to something like this, which is beautiful from a code standpoint, though I get bitten by lazy loading -- two extra database hits per Asset, one for Status and one for Type.
return (from a in this.Db.Assets
select a).ToList().Select(asset => (AssetViewModel)asset).ToList();
Just as we would expect, since I'm using lines like...
AssetTypeDesc = a.AssetType.AssetTypeDesc,
... inside of the casting code. So that was dumb. Concise, reusable, but dumb. This is why we hate folks who use ORMs without checking the SQL. ;^)
Overly clever, sorta
But then I tried getting too clever, with a new constructor for the view model that took the asset entity & the two description values as strings, which ended up with the same lazy load issue (because, duh, the first ToList() before selecting the anonymous objects means we don't know how the Assets are going to be used, and we're stuck pulling back everything to be safe (I assume)).
//Use anon type to skirt "Only parameterless constructors
//and initializers are supported in LINQ to Entities,"
//issue.
return (from a in this.Db.Assets
join at in this.Db.AssetTypes on a.AssetTypeId equals at.AssetTypeId
join ast in this.Db.AssetStatuses on a.AssetStatusId equals ast.AssetStatusId
select new {
a = a,
typeDesc = at.AssetTypeDesc,
statusDesc = ast.AssetStatusDesc
}).ToList().Select(anon => new AssetViewModel(anon.a, anon.typeDesc, anon.statusDesc)).ToList();
If only there was some way to say, "cast these anonymous objects to a List, but don't lazy load the Asset's children while you're doing it." <<< That's my question, natch.
I've read some about DataLoadOptions.LoadWith(), which probably provides an okay solution, and I might end up just doing that, but that's not precisely what I'm asking. I think that's a global-esque setting (? I think just for the life of the data context, which should be the single controller interaction), which I might not necessarily want to set. I may also want ObjectTrackingEnabled = false, but I'm not grokking yet.
I also don't want to use an automapper.

Painfully, after some random Visual Studio autcomplete investigation, this might be as easy as turning off lazy loading in your DbContext:
this.Db.Configuration.LazyLoadingEnabled = false;
The wacky thing is that if your work with the query results does have a few child tables, even with LazyLoadingEnabled turned off, things may still "work" for some subset of them iff the data for those children has already been loaded earlier in this DbContext -- that is, if those children have already had their context cached -- which can make for some surprising and temporarily confusing results.
Better would be to be able to cherry pick what children are "lazy-loading eligible".
I may need to update the question to make it cover this variation of the original question.

Entity Framework 5 performance concerns

Right now I'm working on a pretty complex database. Our object model is designed to be mapped to the database. We're using EF 5 with POCO classes, manually generated.
Everything is working, but there's some complaining about the performances. I've never had performance problems with EF so I'm wondering if this time I just did something terribly wrong, or the problem could reside somewhere else.
The main query may be composed of dynamic parameters. I have several if and switch blocks that are conceptually like this:
if (parameter != null) { query = query.Where(c => c.Field == parameter); }
Also, for some complex And/Or combinations I'm using LinqKit extensions from Albahari.
The query is against a big table of "Orders", containing years and years of data. The average use is a 2 months range filter though.
Now when the main query is composed, it gets paginated with a Skip/Take combination, where the Take is set to 10 elements.
After all this, the IQueryable is sent through layers, reaches the MVC layer where Automapper is employed.
Here, when Automapper starts iterating (and thus the query is really executed) it calls a bunch of navigation properties, which have their own navigation properties and so on. Everything is set to Lazy Loading according to EF recommendations to avoid eager loading if you have more than 3 or 4 distinct entities to include. My scenario is something like this:
Orders (maximum 10)
Many navigation properties under Order
Some of these have other navigation under them (localization entities)
Order details (many order details per order)
Many navigation properties under each Order detail
Some of these have other navigation under them (localization entities)
This easily leads to a total of 300+ queries for a single rendered "page". Each of those queries is very fast, running in a few milliseconds, but still there are 2 main concerns:
The lazy loaded properties are called in sequence and not parallelized, thus taking more time
As a consequence of previous point, there's some dead time between each query, as the database has to receive the sql, run it, return it and so on for each query.
Just to see how it went, I tried to make the same query with eager loading, and as I predicted it was a total disaster, with a translated sql of more than 7K lines (yes, seven thousands) and way more slow overall.
Now I'm reluctant to think that EF and Linq are not the right choice for this scenario. Some are saying that if they were to write a stored procedure which fetches all the needed data, it would run tens of times faster. I don't believe that to be true, and we would lose the automatic materialization of all related entities.
I thought of some things I could do to improve, like:
Table splitting to reduce the selected columns
Turn off object tracking, as this scenario is read only (have untracked entities)
With all of this said, the main complaint is that the result page (done in MVC 4) renders too slowly, and after a bit of diagnostics it seems all "Server Time" and not "Network Time", taking about from 8 to 12 seconds of server time.
From my experience, this should not be happening. I'm wondering if I'm approaching this query need in a wrong way, or if I have to turn my attention to something else (maybe a bad configured IIS server, or anything else I'm really clueless). Needles to say, the database has its indexes ok, checked very carefully by our dba.
So if anyone has any tip, advice, best practice I'm missing about this, or just can tell me that I'm dead wrong in using EF with Lazy Loading for this scenario... you're all welcome.

For a very complex query that brings up tons of hierarchical data, stored procs won't generally help you performance-wise over LINQ/EF if you take the right approach. As you've noted, the two "out of the box" options with EF (lazy and eager loading) don't work well in this scenario. However, there are still several good ways to optimize this:
(1) Rather than reading a bunch of entities into memory and then mapping via automapper, do the "automapping" directly in the query where possible. For example:
var mapped = myOrdersQuery.Select(o => new OrderInfo { Order = o, DetailCount = o.Details.Count, ... })
// by deferring the load until here, we can bring only the information we actually need
// into memory with a single query
.ToList();
This approach works really well if you only need a subset of the fields in your complex hierarchy. Also, EF's ability to select hierarchical data makes this much easier than using stored procs if you need to return something more complex than flat tabular data.
(2) Run multiple LINQ queries by hand and assemble the results in memory. For example:
// read with AsNoTracking() since we'll be manually setting associations
var myOrders = myOrdersQuery.AsNoTracking().ToList();
var orderIds = myOrders.Select(o => o.Id);
var myDetails = context.Details.Where(d => orderIds.Contains(d.OrderId)).ToLookup(d => d.OrderId);
// reassemble in memory
myOrders.ForEach(o => o.Details = myDetails[o.Id].ToList());
This works really well when you need all the data and still want to take advantage of as much EF materialization as possible. Note that, in most cases a stored proc approach can do no better than this (it's working with raw SQL, so it has to run multiple tabular queries) but can't reuse logic you've already written in LINQ.
(3) Use Include() to manually control which associations are eager-loaded. This can be combined with #2 to take advantage of EF loading for some associations while giving you the flexibility to manually load others.

Try to think of an efficient yet simple sql query to get the data for your views.
Is it even possible?
If not, try to decompose (denormalize) your tables so that less joins is required to get data. Also, are there efficient indexes on table colums to speed up data retrieval?
If yes, forget EF, write a stored procedure and use it to get the data.
Turning tracking off for selected queries is a-must for a read-only scenario. Take a look at my numbers:
http://netpl.blogspot.com/2013/05/yet-another-orm-micro-benchmark-part-23_15.html
As you can see, the difference between tracking and notracking scenario is significant.
I would experiment with eager loading but not everywhere (so you don't end up with 7k lines long query) but in selected subqueries.

One point to consider, EF definitely helps make development time much quicker. However, you must remember that when you're returning lots of data from the DB, that EF is using dynamic SQL. This means that EF must 1. Create the SQL, 2.SQL Server then needs to create an execution plan. this happens before the query is run.
When using stored procedures, SQL Server can cache the execution plan (which can be edited for performance), which does make it faster than using EF. BUT... you can always create your stored proc and then execute it from EF. Any complex procedures or queries I would convert to stored procs and then call from EF. Then you can see your performance gain(s) and reevaluate from there.

In some cases, you can use Compiled Queries MSDN to improve query performance drastically. The idea is that if you have a common query that is run many times that might generate the same SQL call with different parameters, you compile the query tie first time it's run then pass it as a delegate, eliminating the overhead of Entity Framework re-generating the SQL for each subsequent call.

How eager does LINQ and Entity Framework load by default?

I'm new to LINQ and the Entity Framework. I've been fetching collections from the database using the following:
var Publications = from pubs in db.RecurringPublications
select pubs;
The Publications table is linked to other tables via foreign keys. I've been using this to reference properties like this:
Publications.Single().LinkedTable.LinkedTableColumn
and sometimes even further down the chain:
Publications.Single().LinkedTable.LinkedTable.LinkedLinkedTableColumn
I know you can specify lazy loading or eager loading, I was wondering how it's handled by default. Is there a maximum depth by default? Does it figure out how many joins to use at compile time?

It's only going to eager load what's in that specific table.
var Publications = from pubs in db.RecurringPublications
select pubs;
Will only get the data from your RecurringPublications table. You can specify if you want to load additional properties, but if you don't specify anything, it will only give you exactly what you ask for - nothing more.
Publications.Single().LinkedTable.LinkedTableColumn
Is lazy loading your LinkedTableColumn - now if your return is Queryable (and it is so far), it's going to do a join and return a single SQL query.
However, if the call has already been enumerated, it will make a second call.
Blog post to MSDN for info

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.