Entity Framework - Table structure, navigating many descendants, etc

Entity Framework - Table structure, navigating many descendants, etc - c#

So I'm rewriting a dated application and trying to structure things the right way. I have a Category > Product > Part > Options basic structure, but there are multiple layers in each and I don't know how to simplify my data structure, and navigate children effectively.
I feel like it's a little complicated for e-commerce, but we have a multitude of product, part, and part options.
So just for kicks I tried to see if I could round up all the data from a top level category all the way down to the swatches for the different part options to see if an entire page could display the entire category/product line at once (I know not to do this in production). The immediate problem I ran into was including all the descendants in my LINQ queries, as there are several that require intermediary objects due to the extra columns in the relational tables. That's necessary, I understand, but it gets messy quickly as this is setup to have a potentially unlimited number of category/subcategory levels. For example:
IQueryable<Category> Categories = Context.CategoryHierarchies
.Where(w => w.ParentCategoryId == null)
.Where(w => w.Active == true)
.OrderBy(o => o.Sort)
.Select(s => s.Category)
.Include("ParentCategories")
.Include("CategoryProducts.Product.ProductParts.Part")
.Include("SubCategories.Category.CategoryProducts.Product.ProductParts.part")
.Include("SubCategories.Category.SubCategories.Category.CategoryProducts.Product.ProductParts.part")
I didn't do lambas on the includes to keep things shorter for the paste. Now obviously this could go on even longer as I didn't get into the part options, but from there, I would potentially need to have 3 more lines for each level of part option, right? Like:
.Include("SubCategories.Category.CategoryProducts.Product.ProductParts.Part.PartMaterials.Swatch")
All the way down. Yikes. So for my question, when I'm loading my ViewModels into a view, and I want to access categories, products, and potentially parts, is there a better way to do this? I have a view that does a foreach on each level that I can, but it starts getting tedious real fast. Do I just load them all as separate objects in the view model and access them directly, and populated through separate queries? I'm pretty new to this and would really appreciate anyone's suggestions.
I did see the .NET Core .ThenInclude() stuff, which does look helpful, but I wasn't completely sure it would clean things up that much. It's a lot of descending.

Related

Reccursive include from database Entity Framework .net C#

I am new to working with Entity framework, and cannot seem to find any information solving this; or if it is even possible. I am writing code first and trying to get an object from the db.
My structure is that it is possible to make a mainThing. Each mainThing can contain multiple thing1s, with multiple thing2s. However each thing1 can also contain a new thing1 and so on.
Now to the issue, when I am getting this mainThing from the db, I do not know how many levels of thing1s to go into.
My query is like this:
var copy = _db.Mainthing
.Include(x=>x.thing1).ThenInclude(x => x.thing1).ThenInclude(x =>x.thing1).XXX.ThenInclude(x =>x.thing2)
.AsSplitQuery()
.AsNoTrackingWithIdentityResolution()
.FirstOrDefault(x => x.Id == memainthingId);```
where the XXX is is where I do not know how many levels to go down, bc there is no way of knowing how many levels of thing1s there are. Is there a way to find out how many levels it is possible to go down/dynamically get them?

LINQ Include slowing down performance when searching

We have the following method that allows us to search a table of Projects for a DataGrid:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
projects.Include(p => p.Client);
projects.Include(p => p.Architect);
projects.Include(p => p.ProjectManager);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
If we do not use the Include on the projects then the searching is instantaneous. However, after adding them in the search can take over 3 seconds.
We need to include the other Entities to allow the Users to search on them should they want to.
How are we able to improve performance but still keep the Include to allow searching on them?
Without Incldue the method looks like so:
public async Task<IEnumerable<Project>> GetFilteredProjects(string searchString)
{
var projects = _context.Projects.Where(p => p.Current);
if (!string.IsNullOrEmpty(searchString))
{
projects = projects
.Where(p => p.Name.Contains(searchString));
}
projects = projects.OrderBy(p => p.Name).Take(10);
return await projects.ToListAsync();
}
Without Include the performance looks like so:
With Include:

The short answer is that including all the extra entities takes time and effort, thus increasing the load times.
However, there is a flaw in your assumption:
We need to include the other Entities to allow the Users to search on them should they want to.
That is not (necessarily) correct. Filtering happens on the database level. Include tells Entity Framework to load the records from the database. These are two separate things.
Look at the following examples:
_context.Projects
.Include(p => p.Architect)
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (and their architects) who have an architect named Bob.
_context.Projects
.Where(p => p.Architect.Name == "Bob")
.ToList()
This will give you a list of projects (without architects) who have an architect named Bob; but it does not actually load the Architect object into memory.
_context.Projects
.Include(p => p.Architect)
.ToList()
This will give you a list of projects (and their architects). It will contain every project, the list is not filtered.
You only need to use Include when you want to do in-memory filtering, i.e. on a collection that was already loaded from the database.
Whether that is the case for you depends on this part:
projects = projects
.Where(p => p.NormalizedFullProjectName.Contains(searchString)
|| p.Client.NormalizedName.Contains(searchString)
|| p.Architect.NormalizedFullName.Contains(searchString)
|| p.ProjectManager.NormalizedFullName.Contains(searchString));
If NormalizedFullProjectName (and the other properties) are database columns, then EF is able to perform the filtering at the database level. You do not need the Include for filtering the items.
If NormalizedFullProjectName (and the other properties) are not database columns, then EF will first have to load the items in memory before it can apply the filter. In this case, you do need the Include, because the architects (and others) need to be loaded in memory.
If you are only loading the related entities for filtering purposes (not display purposes), and you are doing the filtering on the database level; then you can simply remove the include statements.
If you need those related entities to be loaded (for in-memory filtering, or for display purposes), then you can't easily remove the Include statements, unless you write a Select that specifies the fields you need.
For example:
_context.Projects
.Select(p => new { Project = p, ArchitectName = p.Architect.Name })
.ToList()
This will load the project entities (in their entirety) but only the name of the architect and none of the other properties. This can be a significant performance boost if your related entities have many properties that you currently do not need.
Note: The current example uses an anonymous type. I generally advocate creating a custom type for this; but that's unrelated to the performance issue we're addressing here.
Update
Based on your update, you seemingly imply that the intended filtering happens after the objects have been loaded from the database.
This is the source of your performance problems. You are fetching a lot of data but only show part of it. The data that does not get shown still needs to be loaded, which is wasted effort.
There are separate arguments for performance here:
Load everything once - Load all the data once (which might take a long time), but then allow the user to filter the loaded data (which is very fast)
Load chunks - Only load the data that matches the applied filters. If the user changes the filters, you load the data again. The first load will be much faster, but the subsequent filtering actions will take longer compared to in-memory filtering.
What you should do here is not my decision. It's a matter of priorities. Some customers prefer one over the other. I would say that in most cases, the second option (loading chunks) is the better option here, as it prevents needlessly loading a massive dataset if the user never looks through 90% of it. That's a waste of performance and network load.
The answer I gave applies to the "load chunks" approach.
If you decide to take the "load everything once" approach, then you will have to accept the performance hit of that initial load. The best you can do is severely limit the returned data columns (like I showed with the Select) in order to minimize the performance/network cost.
I see no reasonable argument to mix these two approaches. You'll end up with both drawbacks.

How to eagerly load several attributes (including parent/grandparent/great grandparent attributes) without having duplicate grandparents on a parent

I have an object that I want to eagerly load, where I want to eagerly load several parent elements, but also some grandparent elements. I've set up my select like so:
var events = (from ed in eventRepo._session.Query<EventData>() where idsAsList.Contains(ed.Id) select ed)
.FetchMany(ed => ed.Parents)
.ThenFetchMany(pa => pa.Grandparents)
.ThenFetch(gp => gp.GreatGrandparents)
// other fetches here for other attributes
.ToList();
My problem is that if I just .FetchMany the parents, I get the right number of elements. Once I add the grandparents, I get way too many, and that grows even more with great grandparents.
It's clearly doing some kind of cartesian product, so I had a look around and saw that some people use Transformers to solve this. I had a look at that and tried to implement it, but adding a .TransformUsing() causes a compiler error, since I don't seem to be able to call .TransformUsing() on this type of call.
What is the right way to get the right number of elements from such a call, without duplicates due to computing the cartesian product?

Here is a pretty popular post that uses Futures to do this type of loadign to avoid cartesian products. It isn't as elegant as doing it in a single query but it gets the job done.
Fighting cartesian product (x-join) when using NHibernate 3.0.0
One other possible solution would be to define your collections as sets instead of bags. This would also avoid cartesian product issues. I don't really like this solution considering you have to use an nhibernate specific collection type but it is known to work.

There is not much you can do about it if you get force NHibernate join explicitly. The database will return same entities multiple times (this is perfectly normal since your query makes Cartesian joins). And NHibernate cannot distinguish if you ask same item multiple times or not. NHibernate does not know your intention. There is a workaround though. You can add the below line
var eventsSet = new HashSet<Events>(events);
Assuming your entity override Equals and GetHashCode, you will end up with unique events.

Entity Framework 5 performance concerns

Right now I'm working on a pretty complex database. Our object model is designed to be mapped to the database. We're using EF 5 with POCO classes, manually generated.
Everything is working, but there's some complaining about the performances. I've never had performance problems with EF so I'm wondering if this time I just did something terribly wrong, or the problem could reside somewhere else.
The main query may be composed of dynamic parameters. I have several if and switch blocks that are conceptually like this:
if (parameter != null) { query = query.Where(c => c.Field == parameter); }
Also, for some complex And/Or combinations I'm using LinqKit extensions from Albahari.
The query is against a big table of "Orders", containing years and years of data. The average use is a 2 months range filter though.
Now when the main query is composed, it gets paginated with a Skip/Take combination, where the Take is set to 10 elements.
After all this, the IQueryable is sent through layers, reaches the MVC layer where Automapper is employed.
Here, when Automapper starts iterating (and thus the query is really executed) it calls a bunch of navigation properties, which have their own navigation properties and so on. Everything is set to Lazy Loading according to EF recommendations to avoid eager loading if you have more than 3 or 4 distinct entities to include. My scenario is something like this:
Orders (maximum 10)
Many navigation properties under Order
Some of these have other navigation under them (localization entities)
Order details (many order details per order)
Many navigation properties under each Order detail
Some of these have other navigation under them (localization entities)
This easily leads to a total of 300+ queries for a single rendered "page". Each of those queries is very fast, running in a few milliseconds, but still there are 2 main concerns:
The lazy loaded properties are called in sequence and not parallelized, thus taking more time
As a consequence of previous point, there's some dead time between each query, as the database has to receive the sql, run it, return it and so on for each query.
Just to see how it went, I tried to make the same query with eager loading, and as I predicted it was a total disaster, with a translated sql of more than 7K lines (yes, seven thousands) and way more slow overall.
Now I'm reluctant to think that EF and Linq are not the right choice for this scenario. Some are saying that if they were to write a stored procedure which fetches all the needed data, it would run tens of times faster. I don't believe that to be true, and we would lose the automatic materialization of all related entities.
I thought of some things I could do to improve, like:
Table splitting to reduce the selected columns
Turn off object tracking, as this scenario is read only (have untracked entities)
With all of this said, the main complaint is that the result page (done in MVC 4) renders too slowly, and after a bit of diagnostics it seems all "Server Time" and not "Network Time", taking about from 8 to 12 seconds of server time.
From my experience, this should not be happening. I'm wondering if I'm approaching this query need in a wrong way, or if I have to turn my attention to something else (maybe a bad configured IIS server, or anything else I'm really clueless). Needles to say, the database has its indexes ok, checked very carefully by our dba.
So if anyone has any tip, advice, best practice I'm missing about this, or just can tell me that I'm dead wrong in using EF with Lazy Loading for this scenario... you're all welcome.

For a very complex query that brings up tons of hierarchical data, stored procs won't generally help you performance-wise over LINQ/EF if you take the right approach. As you've noted, the two "out of the box" options with EF (lazy and eager loading) don't work well in this scenario. However, there are still several good ways to optimize this:
(1) Rather than reading a bunch of entities into memory and then mapping via automapper, do the "automapping" directly in the query where possible. For example:
var mapped = myOrdersQuery.Select(o => new OrderInfo { Order = o, DetailCount = o.Details.Count, ... })
// by deferring the load until here, we can bring only the information we actually need
// into memory with a single query
.ToList();
This approach works really well if you only need a subset of the fields in your complex hierarchy. Also, EF's ability to select hierarchical data makes this much easier than using stored procs if you need to return something more complex than flat tabular data.
(2) Run multiple LINQ queries by hand and assemble the results in memory. For example:
// read with AsNoTracking() since we'll be manually setting associations
var myOrders = myOrdersQuery.AsNoTracking().ToList();
var orderIds = myOrders.Select(o => o.Id);
var myDetails = context.Details.Where(d => orderIds.Contains(d.OrderId)).ToLookup(d => d.OrderId);
// reassemble in memory
myOrders.ForEach(o => o.Details = myDetails[o.Id].ToList());
This works really well when you need all the data and still want to take advantage of as much EF materialization as possible. Note that, in most cases a stored proc approach can do no better than this (it's working with raw SQL, so it has to run multiple tabular queries) but can't reuse logic you've already written in LINQ.
(3) Use Include() to manually control which associations are eager-loaded. This can be combined with #2 to take advantage of EF loading for some associations while giving you the flexibility to manually load others.

Try to think of an efficient yet simple sql query to get the data for your views.
Is it even possible?
If not, try to decompose (denormalize) your tables so that less joins is required to get data. Also, are there efficient indexes on table colums to speed up data retrieval?
If yes, forget EF, write a stored procedure and use it to get the data.
Turning tracking off for selected queries is a-must for a read-only scenario. Take a look at my numbers:
http://netpl.blogspot.com/2013/05/yet-another-orm-micro-benchmark-part-23_15.html
As you can see, the difference between tracking and notracking scenario is significant.
I would experiment with eager loading but not everywhere (so you don't end up with 7k lines long query) but in selected subqueries.

One point to consider, EF definitely helps make development time much quicker. However, you must remember that when you're returning lots of data from the DB, that EF is using dynamic SQL. This means that EF must 1. Create the SQL, 2.SQL Server then needs to create an execution plan. this happens before the query is run.
When using stored procedures, SQL Server can cache the execution plan (which can be edited for performance), which does make it faster than using EF. BUT... you can always create your stored proc and then execute it from EF. Any complex procedures or queries I would convert to stored procs and then call from EF. Then you can see your performance gain(s) and reevaluate from there.

In some cases, you can use Compiled Queries MSDN to improve query performance drastically. The idea is that if you have a common query that is run many times that might generate the same SQL call with different parameters, you compile the query tie first time it's run then pass it as a delegate, eliminating the overhead of Entity Framework re-generating the SQL for each subsequent call.

Optimize solution for categories tree search

I'm creating some kind of auction application, and I have to decide what is the most optimize way for this problem. I'm using BL Toolkit as my OR Mapper (It have nice Linq support) and ASP.NET MVC 2.
Background
I've got multiple Category objects that are created dynamically and that are saved in my database as a representation of this class:
class Category
{
public int Id { get; set; }
public int ParentId { get; set; }
public string Name { get; set; }
}
Now every Category object can have associated multiple of the InformatonClass objects that represents single information in that category, for example it's a price or colour. Those classes are also dynamicly created by administator and stored in database. There are specific for a group of categories. The class that represents it looks following:
class InformationClass
{
public int Id { get; set; }
public InformationDataType InformationDataType { get; set; }
public string Name { get; set; }
public string Label { get; set; }
}
Now I've got third table that represents the join between them like this:
class CategoryInformation
{
public int InformationClassId { get; set; }
public int AuctionCategoryId { get; set; }
}
Problem
Now the problem is that I need to inherit all category InformationClass in the child categories. For example every product will have a price so I need to add this InformationClass only to my root category. The frequency information can be added to base CPU category and it should be avaible in AMD and Intel categories that will derive from CPU category.
I have to know which InformationClass objects are related to specifed Category very often in my application.
So here is my question. What will be the most optimize solution for this problem? I've got some ideas but I cant decide.
Load all categories from database to Application table and take them from this place everytime - as far as the categories will not change too often it will reduce number of database requests but it will still require to tree search using Linq-to-Objects
Invent (I don't know if it's possible) some fancy Linq query that can tree search and get all information class id's without stressing the database too much.
Some other nice ideas?
I will be grateful for every answers and ideas. Thank you all in advice.

Sounds like a case for an idea I once had which I blogged about:
Tree structures and DAGs in SQL with efficient querying using transitive closures
The basic idea is this: In addition to the Category table, you also have a CategoryTC table which contains the transitive closure of the parent-child relationship. It allows you to quickly and efficiently retrieve a list of all ancestor or descendant categories of a particular category. The blog post explains how you can keep the transitive closure up-to-date every time a new category is created, deleted, or a parent-child relationship changed (it’s at most two queries each time).
The post uses SQL to express the idea, but I’m sure you can translate it to LINQ.
You didn’t specify in your question how the InformationClass table is linked to the Category table, so I have to assume that you have a CategoryInformation table that looks something like this:
class CategoryInformation
{
public int CategoryId { get; set; }
public int InformationClassId { get; set; }
}
Then you can get all the InformationClasses associated with a specific category by using something like this:
var categoryId = ...;
var infoClasses = db.CategoryInformation
.Where(cinf => db.CategoryTC.Where(tc => tc.Descendant == categoryId)
.Any(tc => tc.Ancestor == cinf.CategoryId))
.Select(cinf => db.InformationClass
.FirstOrDefault(ic => ic.Id == cinf.InformationClassId));
Does this make sense? Any questions, please ask.

In the past (pre SQLServer 2005 and pre LINQ) when dealing with this sort of structure (or the more general case of a directed acyclic graph, implemented with a junction table so that items can have more than one "parent"), I've either done this by loading the entire graph into memory, or by creating a tigger-updated lookup table in the database that cached in relationship of ancestor to descendant.
There are advantages to either and which wins out depends on update frequency, complexity of the objects outside of the matter of the parent-child relationship, and frequency of updating. In general, loading into memory allows for faster individual look-ups, but with a large graph it doesn't natively scale as well due to the amount of memory used in each webserver ("each" here, because webfarm situations are one where having items cached in memory brings extra issues), meaning that you will have to be very careful about how things are kept in synch to counter-act that effect.
A third option now available is to do ancestor lookup with a recursive CTE:
CREATE VIEW [dbo].[vwCategoryAncestry]
AS
WITH recurseCategoryParentage (ancestorID, descendantID)
AS
(
SELECT parentID, id
FROM Categories
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, id
FROM recurseCategoryParentage
INNER JOIN Categories ON parentID = descendantID
)
SELECT DISTINCT ancestorID, descendantID
FROM recurseCategoryParentage
Assuming that root categories are indicated by having a null parentID.
(We use UNION ALL since we're going to SELECT DISTINCT afterwards anyway, and this way we have a single DISTINCT operation rather than repeating it).
This allows us to do the look-up table approach without the redundancy of that denormalised table. The efficiency trade-off is obviously different and generally poorer than with a table but not much (slight hit on select, slight gain on insert and delete, negliable space gain), but guarantee of correctness is greater.
I've ignored the question of where LINQ fits into this, as the trade-offs are much the same whatever way this is queried. LINQ can play nicer with "tables" that have individual primary keys, so we can change the select clause to SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID and defining that as the primary key in the [Column] attribute. Of course all columns should be indicated as DB-generated.
Edit. Some more on the trade-offs involved.
Comparing the CTE approach with look-up maintained in database:
Pro CTE:
The CTE code is simple, the above view is all the extra DB code you need, and the C# needed is identical.
The DB code is all in one place, rather than there being both a table and a trigger on a different table.
Inserts and deletes are faster; this doesn't affect them, while the trigger does.
While semantically recursive, it is so in a way the query planner understands and can deal with, so it's typically (for any depth) implemented in just two index scans (likely clustered) two light-weight spools, a concatenation and a distinct sort, rather than in the many many scans that you might imagine. So while certainly a heavier scan than a simple table lookup, it's nowhere near as bad as one might imagine at first. Indeed, even the nature of those two index scans (same table, different rows) makes it less expensive than you might think when reading that.
It is very very easy to replace this with the table look-up if later experience proves that to be the way to go.
A lookup table will, by its very nature, denormalise the database. Purity issues aside, the "bad smell" involved means that this will have to be explained and justified to any new dev, as until then it may simply "look wrong" and their instincts will send them on a wild-goose chase trying to remove it.
Pro Lookup-Table:
While the CTE is faster to select from than one might imagine, the lookup is still faster, especially when used as part of a more complicated query.
While CTEs (and the WITH keyword used to create them) are part of the SQL 99 standard, they are relatively new and some devs don't know them (though I think this particular CTE is so straightforward to read that it counts as a good learning example anyway, so maybe this is actually pro CTE!)
While CTEs are part of the SQL 99 standard, they aren't imlemented by some SQL databases, including older versions of SQLServer (which are still in live use), which may affect any porting efforts. (They are though supported by Oracle, and Postgres among others, so at this point this may not really be an issue).
It's reasonably easy to replace this with the CTE version later, if later experience suggests you should.
Comparing (both) the db-heavy options with in-memory caching.
Pro In-Memory:
Unless your implementation really sucks, it is going to be much faster than DB lookups.
It makes some secondary optimisations possible on the back of this change.
It is reasonably difficult to change from DB to in-memory if later profiling shows that in-memory is the way to go.
Pro Querying DB:
Start-up time can be very slow with in-memory.
Changes to the data are much much simpler. Most of the points are aspects of this. Really, if you go the in-memory route then the question of how to handle changes invalidating the cached information becomes a whole new ongoing concern for the lifetime of the project, and not a trivial one at all.
If you use in-memory, you are probably going to have to use this in-memory store even for operations where it is not relevant, which may complicate where it fits with the rest of your data-access code.
It is not necessary to track changes and cache freshness.
It is not necessary to ensure that every webserver in a web-farm and/or web-garden solution (a certain level of success will necessitate this) has precisely the same degree of freshness.
Similarly, the degree of scalability across machines (how close to 100% extra performance you get by doubling the number of webservers and DB slaves) is higher.
With in-memory, memory use can become very high, if either (a) the number of objects is high or (b) the size of the objects (fields, esp. strings, collections and objects which themselves have a sting or collection). Possibly "we need a bigger webserver" amounts of memory, and that goes for every machine in the farm.
7a. That heavy memory use is particularly like to continue to grow as the project evolves.
Unless changes cause an immediate refresh of the in-memory store, the in-memory solution will mean that the view used by the people in charge of administrating these categories will differ from what is seen by customers, until they are re-synchronised.
In-memory resynching can be very expensive. Unless you're very clever with it, it can cause random (to the user) massive performance spikes. If you are clever with it, it can exasperate the other issues (esp. in terms of keeping different machines at an equiv. level of freshness).
Unless you're clever with in-memory, those spikes can accumulate, putting the machine into a long-term hang. If you are clever with avoiding this, you may exasperate other issues.
It is very difficult to move from in-memory to hitting the db should that prove the way to go.
None of this leans with 100% certainty to one solution or the other, and I certainly aren't going to give a clear answer as doing so is premature optimsiation. What you can do a priori is make a reasonable decision about which is likely to be the optimal solution. Whichever you go for you should profile afterwards, esp. if the code does turn out to be a bottleneck and possibly change. You should also do so over the lifetime of the product as both changes to the code (fixes and new features) and changes to the dataset can certainly change which option is optimal (indeed, it can change from one to another and then change back to the previous one, over the course of the lifetime). This is why I included considerations of the ease of moving from one approach to another in the above list of pros and cons.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.