We have the following method that allows us to search a table of Projects for a DataGrid:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
projects.Include(p => p.Client);
projects.Include(p => p.Architect);
projects.Include(p => p.ProjectManager);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
If we do not use the Include on the projects then the searching is instantaneous. However, after adding them in the search can take over 3 seconds.
We need to include the other Entities to allow the Users to search on them should they want to.
How are we able to improve performance but still keep the Include to allow searching on them?
Without Incldue the method looks like so:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.Name.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
Without Include the performance looks like so:
With Include:
The short answer is that including all the extra entities takes time and effort, thus increasing the load times.
However, there is a flaw in your assumption:
We need to include the other Entities to allow the Users to search on them should they want to.
That is not (necessarily) correct. Filtering happens on the database level. Include tells Entity Framework to load the records from the database. These are two separate things.
Look at the following examples:
_context.Projects
.Include(p => p.Architect)
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (and their architects) who have an architect named Bob.
_context.Projects
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (without architects) who have an architect named Bob; but it does not actually load the Architect object into memory.
_context.Projects
.Include(p => p.Architect)
.ToList()
This will give you a list of projects (and their architects). It will contain every project, the list is not filtered.
You only need to use Include when you want to do in-memory filtering, i.e. on a collection that was already loaded from the database.
Whether that is the case for you depends on this part:
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
If NormalizedFullProjectName (and the other properties) are database columns, then EF is able to perform the filtering at the database level. You do not need the Include for filtering the items.
If NormalizedFullProjectName (and the other properties) are not database columns, then EF will first have to load the items in memory before it can apply the filter. In this case, you do need the Include, because the architects (and others) need to be loaded in memory.
If you are only loading the related entities for filtering purposes (not display purposes), and you are doing the filtering on the database level; then you can simply remove the include statements.
If you need those related entities to be loaded (for in-memory filtering, or for display purposes), then you can't easily remove the Include statements, unless you write a Select that specifies the fields you need.
For example:
_context.Projects
.Select(p => new { Project = p, ArchitectName = p.Architect.Name })
.ToList()
This will load the project entities (in their entirety) but only the name of the architect and none of the other properties. This can be a significant performance boost if your related entities have many properties that you currently do not need.
Note: The current example uses an anonymous type. I generally advocate creating a custom type for this; but that's unrelated to the performance issue we're addressing here.
Update
Based on your update, you seemingly imply that the intended filtering happens after the objects have been loaded from the database.
This is the source of your performance problems. You are fetching a lot of data but only show part of it. The data that does not get shown still needs to be loaded, which is wasted effort.
There are separate arguments for performance here:
Load everything once - Load all the data once (which might take a long time), but then allow the user to filter the loaded data (which is very fast)
Load chunks - Only load the data that matches the applied filters. If the user changes the filters, you load the data again. The first load will be much faster, but the subsequent filtering actions will take longer compared to in-memory filtering.
What you should do here is not my decision. It's a matter of priorities. Some customers prefer one over the other. I would say that in most cases, the second option (loading chunks) is the better option here, as it prevents needlessly loading a massive dataset if the user never looks through 90% of it. That's a waste of performance and network load.
The answer I gave applies to the "load chunks" approach.
If you decide to take the "load everything once" approach, then you will have to accept the performance hit of that initial load. The best you can do is severely limit the returned data columns (like I showed with the Select) in order to minimize the performance/network cost.
I see no reasonable argument to mix these two approaches. You'll end up with both drawbacks.
Related
I have the method below to load dependent data from navigation property. However, it generates an error. I can remove the error by adding ToList() or ToArray(), but I'd rather not do that for performance reasons. I also cannot set the MARS property in my web.config file because it causes a problem for other classes of the connection.
How can I solve this without using extension methods or editing my web.config?
public override void Load(IEnumerable<Ques> data)
{
if (data.Any())
{
foreach (var pstuu in data)
{
if (pstuu?.Id_user != null)
{
db.Entry(pstuu).Reference(q => q.Users).Load();
}
}
}
}
I take it from this question you've got a situation something like:
// (outside code)
var query = db.SomeEntity.Wnere(x => x.SomeCondition == someCondition);
LoadDependent(query);
Chances are based on this method it's probably a call stack of various methods that build search expressions and such, but ultimately what gets passed into LoadDependent() is an IQueryable<TEntity>.
Instead if you call:
// (outside code)
var query = db.SomeEntity.Wnere(x => x.SomeCondition == someCondition);
var data = query.ToList();
LoadDependent(data);
Or.. in your LoadDependent changing doing something like:
base.LoadDependent(data);
data = data.ToList();
or better,
foreach (Ques qst in data.ToList())
Then your LoadDependent() call works, but in the first example you get an error that a DataReader is open. This is because your foreach call as-is would be iterating over the IQueryable meaning EF's data reader would be left open so further calls to db, which I'd assume is a module level variable for the DbContext that is injected, cannot be made.
Replacing this:
db.Entry(qst).Reference(q => q.AspNetUsers).Load();
with this:
db.Entry(qst).Reference(q => q.AspNetUsers).LoadAsync();
... does not actually work. This just delegates the load call asynchronously, and without awaiting it, it too would fail, just not raise the exception on the continuation thread.
As mentioned in the comments to your question this is a very poor design choice to handle loading references. You are far, far better off enabling lazy loading and taking the Select n+1 hit if/when a reference is actually needed if you aren't going to implement the initial fetch properly with either eager loading or projection.
Code like this forces a Select n+1 pattern throughout your code.
A good example of loading a "Ques" with it's associated User eager loaded:
var ques = db.Ques
.Include(x => x.AspNetUsers)
.Where(x => x.SomeCondition == someCondition)
.ToList();
Whether "SomeCondition" results in 1 Ques returned or 1000 Ques returned, the data will execute with one query to the DB.
Select n+1 scenarios are bad because in the case where 1000 Ques are returned with a call to fetch dependencies you get:
var ques = db.Ques
.Where(x => x.SomeCondition == someCondition)
.ToList(); // 1 query.
foreach(var q in ques)
db.Entry(q).Reference(x => x.AspNetUsers).Load(); // 1 query x 1000
1001 queries run. This compounds with each reference you want to load.
Which then looks problematic where later code might want to offer pagination such as to take only 25 items where the total record count could run in the 10's of thousands or more. This is where lazy loading would be the lesser of two Select n+1 evils, as with lazy loading you know that AspNetUsers would only be selected if any returned Ques actually referenced it, and only for those Ques that actually reference it. So if the pagination only "touched" 25 rows, Lazy Loading would result in 26 queries. Lazy loading is a trap however as later code changes could inadvertently lead to performance issues appearing in seemingly unrelated areas as new referenences or code changes result in far more references being "touched" and kicking off a query.
If you are going to pursue a LoadDependent() type method then you need to ensure that it is called as late as possible, once you have a known set size to load because you will need to materialize the collection to load related entities with the same DbContext instance. (I.e. after pagination) Trying to work around it using detached instances (AsNoTracking()) or by using a completely new DbContext instance may give you some headway but will invariably lead to more problems later, as you will have a mix of tracked an untracked entities, or worse, entities tracked by different DbContexts depending on how these loaded entities are consumed.
An alternative teams pursue is rather than a LoadReference() type method would be an IncludeReference() type method. The goal here being to build .Include statements into the IQueryable. This can be done two ways, either by magic strings (property names) or by passing in expressions for the references to include. Again this can turn into a bit of a rabbit hole when handling more deeply nested references. (I.e. building .Include().ThenInclude() chains.) This avoids the Select n+1 issue by eager loading the required related data.
I have solved the problem by deletion the method Load and I have used Include() in my first query of data to show the reference data in navigation property
I have this code to explicit loading for an entity:
dbContext.StorageRequests.Add(storageRequest);
dbContext.SaveChanges();
//Here I want to explict loading some navigation properties
dbContext.Entry(storageRequest).Reference(c => c.Manager).Load();
dbContext.Entry(storageRequest).Reference(c => c.Facility).Load();
dbContext.Entry(storageRequest).Collection(x=> x.PhysicalObjects).Query().Include(x => x.Classification).Load();
My question is two parts:
The first one how can I load all together (I want to call Load() once)?
The second part does the above code sends query for each Load() calling which in turn hit the database to load related data?
I had a similar question withEF core. Turning on SQL logging to the debugoutput window helped answer a lot of my questions as to what it was doing, and why. In terms of your questions:
1) You can't, though you can eager load it with a series of dbContext.Collection.Include(otherCollection).ThenInclude(stuffRelatedToOtherCollection) type chains
2) Yes it does, even eager loading in one c# statement bangs out multiple queries. I presumed this was because it's too much of a complex artificial intelligence problem to do it any way other than its most naive multiple-sql, because it's hard for the framework to deal with cartesian products when multiple tables are joined together in one rectangular dataset. (A school has students and teachers, teacher:students is a many:many relationship, decomposed by class. If you wrote one query to join school, class, student and teachers, you'd get repeated data all over the place and though conceptually possible to pick through it looking for unique school, class teacher and student primary key values, you could be downloading tens of thousands of duplicated rows only to have to unique them all again. EF tends to select the school ,then school join class, then school join class join students, then school join class join teachers (if that's how you coded your school include class theninclude students then include teachers. Changing your include strategy will change the queries that are run)
Nice question! Let me answer differently, in reverse order, with new info.
2.)
Each Load() will cause a query to the database as of the documentation (Querying and Finding Entities - 10/23/2016):
A query is executed against the database when:
It is enumerated by a foreach (C#) or For Each (Visual Basic) statement.
It is enumerated by a collection operation such as ToArray, ToDictionary, or ToList.
LINQ operators such as First or Any are specified in the outermost part of the query.
The following methods are called: the Load extension method on a DbSet, DbEntityEntry.Reload, and Database.ExecuteSqlCommand.
People often uses eager loading with Include() to let EF optimize as much as possible:
in most cases, EF will combine the joins when generating SQL
// ef 6
using System.Data.Entity;
var storageRequests = dbContext.StorageRequests
.Include(r => r.PhysicalObjects.Select(p => p.Classification))
.Include(r => r.Manager)
.Include(r => r.Facility);
// evaluate "storageRequests" here by linq method or foreach
or:
// ef core
var storageRequests = dbContext.StorageRequests
.Include(r => r.PhysicalObjects)
.ThenInclude(p => p.Classification)
.Include(r => r.Manager)
.Include(r => r.Facility);
// evaluate "storageRequests" here by linq method or foreach
1.)
Only possible way I can imagine is having above code with storageRequests.Load().
You could inspect whether it:
generates single/multiple queries,
loads navigation property data along StorageRequest.
FYI: these query executions are also called network roundtrips in microsoft docs:
Multiple network roundtrips can degrade performance, especially where latency to the database is high (for example, cloud services).
Point of interest:
There is a relative new option Single vs. Split Queries (10/03/2019) in .Net Core 5.
The default is single queries (behavior described above). After that you can decide to request/load data per table instead, by adding .AsSplitQuery() to your linq query, before the evaluation. Splitted queries increases roundtrips and memory usage (does not loads distinct data) but helps performance.
There is also .AsSingleQuery() if your global choice was:
.UseSqlServer(
connectionString,
o => o.UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery));
So I'm rewriting a dated application and trying to structure things the right way. I have a Category > Product > Part > Options basic structure, but there are multiple layers in each and I don't know how to simplify my data structure, and navigate children effectively.
I feel like it's a little complicated for e-commerce, but we have a multitude of product, part, and part options.
So just for kicks I tried to see if I could round up all the data from a top level category all the way down to the swatches for the different part options to see if an entire page could display the entire category/product line at once (I know not to do this in production). The immediate problem I ran into was including all the descendants in my LINQ queries, as there are several that require intermediary objects due to the extra columns in the relational tables. That's necessary, I understand, but it gets messy quickly as this is setup to have a potentially unlimited number of category/subcategory levels. For example:
IQueryable<Category> Categories = Context.CategoryHierarchies
.Where(w => w.ParentCategoryId == null)
.Where(w => w.Active == true)
.OrderBy(o => o.Sort)
.Select(s => s.Category)
.Include("ParentCategories")
.Include("CategoryProducts.Product.ProductParts.Part")
.Include("SubCategories.Category.CategoryProducts.Product.ProductParts.part")
.Include("SubCategories.Category.SubCategories.Category.CategoryProducts.Product.ProductParts.part")
I didn't do lambas on the includes to keep things shorter for the paste. Now obviously this could go on even longer as I didn't get into the part options, but from there, I would potentially need to have 3 more lines for each level of part option, right? Like:
.Include("SubCategories.Category.CategoryProducts.Product.ProductParts.Part.PartMaterials.Swatch")
All the way down. Yikes. So for my question, when I'm loading my ViewModels into a view, and I want to access categories, products, and potentially parts, is there a better way to do this? I have a view that does a foreach on each level that I can, but it starts getting tedious real fast. Do I just load them all as separate objects in the view model and access them directly, and populated through separate queries? I'm pretty new to this and would really appreciate anyone's suggestions.
I did see the .NET Core .ThenInclude() stuff, which does look helpful, but I wasn't completely sure it would clean things up that much. It's a lot of descending.
i am new to linq, i started writing this query:
var dProjects = Projects
.Select(p => new Models.Project {
ProjectID = p.ProjectID,
Status = p.Status,
ExpiresOn = p.ExpiresOn,
LatestComments = p.ProjectComments
.OrderByDescending(pc => pc.CreatedOn)
.Select(pc => pc.Comments)
.FirstOrDefault(),
ProjectFileIDs = p.ProjectFiles
.Select(pf => pf.BinaryFileID)
.AsQueryable()
})
.AsQueryable<Models.Project>();
I already know this query will perform really slow because related entities like ProjectComments and ProjectFiles will create nested selects, though it works and gives me right results that i need.
How can i optimize this query and get the same results? One of my guesses would be using inner join but ProjectComments and ProjectFiles already has a relationship in database through keys, so not sure what we can achieve by setting the relationship again.
Basically, need to know which is the best approach to take here from performance perspective. One thing to note is i am sorting ProjectComments and only taking the most recent one. Should i be using combination of join and group by into? Help will be much appreciated. Thanks.
UPDATED:
Sorry, if i wasn't clear enough on what i am trying to do. Basically, in front end, i have a grid, which shows list of projects with latest project comments and list of all the files associated to project, so users can click on those links and actually open those documents. So the query that i have above is working and it does show the following in the grid:
Project ID (From Project table)
Status (From Project table)
ExpiresOn (From Project table)
LatestComments (latest entry from ProjectComments table which has project ID as foreign key)
ProjectFileIDs (list of file ids from ProjectFiles table which has Project ID as foreign key - i am using those File IDs and creating links so users can open those files).
So everything is working, i have it all setup, but the query is little slow. Right now we have very little data (only test data), but once this is launched, i am expecting lot of users/data and thus i want to optimize this query to the best, before it goes live. So, the goal here is to basically optimize. I am pretty sure this is not the best approach, because this will create nested selects.
In Entity Framework, you can drastically improve the performance of the queries by returning the objects back as an object graph instead of a projection. Entity Framework is extremely efficient at optimizing all but the most complex SQL queries, and can take advantage of deferred "Eager" loading vs. "Lazy" Loading (not loading related items from the db until they are actually accessed). This MSDN reference is a good place to start.
As far as your specific query is concerned, you could use this technique something like the following:
var dbProjects = yourContext.Projects
.Include(p => p.ProjectComments
.OrderByDescending(pc => pc.CreatedOn)
.Select(pc => pc.Comments)
.FirstOrDefault()
)
.Include(p => p.ProjectFileIDs)
.AsQueryable<Models.Project>();
note the .Include() being used to imply Eager Loading.
From the MDSN Reference on Loading Related Objects,
Performance Considerations
When you choose a pattern for loading related entities, consider the behavior of each approach with regard to the number and timing of connections made to the data source versus the amount of data returned by and the complexity of using a single query. Eager loading returns all related entities together with the queried entities in a single query. This means that, while there is only one connection made to the data source, a larger amount of data is returned in the initial query. Also, query paths result in a more complex query because of the additional joins that are required in the query that is executed against the data source.
Explicit and lazy loading enables you to postpone the request for related object data until that data is actually needed. This yields a less complex initial query that returns less total data. However, each successive loading of a related object makes a connection to the data source and executes a query. In the case of lazy loading, this connection occurs whenever a navigation property is accessed and the related entity is not already loaded.
Do you get any boost in performance if you add Include statements before the Select?
Example:
var dProjects = Projects
.Include(p => p.ProjectComments)
.Include(p => p.ProjectFiles)
Include allows all matching ProjectComments and ProjectFiles to be eagerly loaded. See Loading Related Entities for more details.
Right now I'm working on a pretty complex database. Our object model is designed to be mapped to the database. We're using EF 5 with POCO classes, manually generated.
Everything is working, but there's some complaining about the performances. I've never had performance problems with EF so I'm wondering if this time I just did something terribly wrong, or the problem could reside somewhere else.
The main query may be composed of dynamic parameters. I have several if and switch blocks that are conceptually like this:
if (parameter != null) { query = query.Where(c => c.Field == parameter); }
Also, for some complex And/Or combinations I'm using LinqKit extensions from Albahari.
The query is against a big table of "Orders", containing years and years of data. The average use is a 2 months range filter though.
Now when the main query is composed, it gets paginated with a Skip/Take combination, where the Take is set to 10 elements.
After all this, the IQueryable is sent through layers, reaches the MVC layer where Automapper is employed.
Here, when Automapper starts iterating (and thus the query is really executed) it calls a bunch of navigation properties, which have their own navigation properties and so on. Everything is set to Lazy Loading according to EF recommendations to avoid eager loading if you have more than 3 or 4 distinct entities to include. My scenario is something like this:
Orders (maximum 10)
Many navigation properties under Order
Some of these have other navigation under them (localization entities)
Order details (many order details per order)
Many navigation properties under each Order detail
Some of these have other navigation under them (localization entities)
This easily leads to a total of 300+ queries for a single rendered "page". Each of those queries is very fast, running in a few milliseconds, but still there are 2 main concerns:
The lazy loaded properties are called in sequence and not parallelized, thus taking more time
As a consequence of previous point, there's some dead time between each query, as the database has to receive the sql, run it, return it and so on for each query.
Just to see how it went, I tried to make the same query with eager loading, and as I predicted it was a total disaster, with a translated sql of more than 7K lines (yes, seven thousands) and way more slow overall.
Now I'm reluctant to think that EF and Linq are not the right choice for this scenario. Some are saying that if they were to write a stored procedure which fetches all the needed data, it would run tens of times faster. I don't believe that to be true, and we would lose the automatic materialization of all related entities.
I thought of some things I could do to improve, like:
Table splitting to reduce the selected columns
Turn off object tracking, as this scenario is read only (have untracked entities)
With all of this said, the main complaint is that the result page (done in MVC 4) renders too slowly, and after a bit of diagnostics it seems all "Server Time" and not "Network Time", taking about from 8 to 12 seconds of server time.
From my experience, this should not be happening. I'm wondering if I'm approaching this query need in a wrong way, or if I have to turn my attention to something else (maybe a bad configured IIS server, or anything else I'm really clueless). Needles to say, the database has its indexes ok, checked very carefully by our dba.
So if anyone has any tip, advice, best practice I'm missing about this, or just can tell me that I'm dead wrong in using EF with Lazy Loading for this scenario... you're all welcome.
For a very complex query that brings up tons of hierarchical data, stored procs won't generally help you performance-wise over LINQ/EF if you take the right approach. As you've noted, the two "out of the box" options with EF (lazy and eager loading) don't work well in this scenario. However, there are still several good ways to optimize this:
(1) Rather than reading a bunch of entities into memory and then mapping via automapper, do the "automapping" directly in the query where possible. For example:
var mapped = myOrdersQuery.Select(o => new OrderInfo { Order = o, DetailCount = o.Details.Count, ... })
// by deferring the load until here, we can bring only the information we actually need
// into memory with a single query
.ToList();
This approach works really well if you only need a subset of the fields in your complex hierarchy. Also, EF's ability to select hierarchical data makes this much easier than using stored procs if you need to return something more complex than flat tabular data.
(2) Run multiple LINQ queries by hand and assemble the results in memory. For example:
// read with AsNoTracking() since we'll be manually setting associations
var myOrders = myOrdersQuery.AsNoTracking().ToList();
var orderIds = myOrders.Select(o => o.Id);
var myDetails = context.Details.Where(d => orderIds.Contains(d.OrderId)).ToLookup(d => d.OrderId);
// reassemble in memory
myOrders.ForEach(o => o.Details = myDetails[o.Id].ToList());
This works really well when you need all the data and still want to take advantage of as much EF materialization as possible. Note that, in most cases a stored proc approach can do no better than this (it's working with raw SQL, so it has to run multiple tabular queries) but can't reuse logic you've already written in LINQ.
(3) Use Include() to manually control which associations are eager-loaded. This can be combined with #2 to take advantage of EF loading for some associations while giving you the flexibility to manually load others.
Try to think of an efficient yet simple sql query to get the data for your views.
Is it even possible?
If not, try to decompose (denormalize) your tables so that less joins is required to get data. Also, are there efficient indexes on table colums to speed up data retrieval?
If yes, forget EF, write a stored procedure and use it to get the data.
Turning tracking off for selected queries is a-must for a read-only scenario. Take a look at my numbers:
http://netpl.blogspot.com/2013/05/yet-another-orm-micro-benchmark-part-23_15.html
As you can see, the difference between tracking and notracking scenario is significant.
I would experiment with eager loading but not everywhere (so you don't end up with 7k lines long query) but in selected subqueries.
One point to consider, EF definitely helps make development time much quicker. However, you must remember that when you're returning lots of data from the DB, that EF is using dynamic SQL. This means that EF must 1. Create the SQL, 2.SQL Server then needs to create an execution plan. this happens before the query is run.
When using stored procedures, SQL Server can cache the execution plan (which can be edited for performance), which does make it faster than using EF. BUT... you can always create your stored proc and then execute it from EF. Any complex procedures or queries I would convert to stored procs and then call from EF. Then you can see your performance gain(s) and reevaluate from there.
In some cases, you can use Compiled Queries MSDN to improve query performance drastically. The idea is that if you have a common query that is run many times that might generate the same SQL call with different parameters, you compile the query tie first time it's run then pass it as a delegate, eliminating the overhead of Entity Framework re-generating the SQL for each subsequent call.