preloading a tree / hierarchy from database using NHibernate / C#

preloading a tree / hierarchy from database using NHibernate / C# - c#

I need to populate some tree hierarchies and traverse them in order to build up a category menu. Each category can have more than 1 parent. The problem is how to do this efficiently, and try to avoid the Select N+1 problem.
Currently, it is implemented by using two tables / entities:
Category
--------
ID
Title
CategoryLink
---------
ID
CategoryID
ParentID
Ideally, I would use the normal object traversal to go through the nodes, i.e by going through Category.ChildCategories, etc. Is this possible to be done in one SQL statement? And also, can this be done in NHibernate?

Specify a batch-size on the Category.ChildCategories mapping. That will cause NHibernate to fetch children in batches of the specified size, not one at a time (which will alleviate the N+1 problem).
If you are using .hbm files, you can specify the batch-size like this:
<bag name="ChildCategories" batch-size="30">
or using fluent mapping
HasMany(x => x.ChildCategories).KeyColumn("ParentId").BatchSize(30);
See the NHibernate documentation for more info.
EDIT
Ok, I believe I understand your requirements. With the following configuration
HasManyToMany<Item>(x => x.ChildCategories)
.Table("CategoryLink")
.ParentKeyColumn("ParentId")
.ChildKeyColumn("CategoryID")
.BatchSize(100)
.Not.LazyLoad()
.Fetch.Join();
you should be able to get the entire hierarchy in one call using the following line.
var result = session.CreateCriteria(typeof(Category)).List();
For some reason, retrieving a single category like this
var categoryId = 1;
var result = session.Get<Category>(categoryId);
results in one call per level in the hierarchy. I believe this should still significantly reduce the number of calls to the database, but I was not able to get the example above to work with a single call to the database.

This would retrieve all the categories with their children:
var result = session.Query<Category>()
.FetchMany(x => x.ChildCategories)
.ToList();
The problem is determining what the root categories are. You could either use a flag, or map the inverse collection (ParentCategories) and do this:
var root = session.Query<Category>()
.FetchMany(x => x.ChildCategories)
.FetchMany(x => x.ParentCategories)
.ToList()
.Where(x => !x.ParentCategories.Any());
All sorting should be done client-side (i.e. after ToList)

Related

Linq to SQL/Entities: Greatest N-Per group problem/performance increase

Allright, So I have too encountered what I believe is the Greatest-N-Per problem, whereas this question has been answered before I do not think it has been solved well yet with Linq. I have a table with a few million entries, so therefore queries take a lot of time. I would like these queries to take less than a second, whereas currently they spend about 10 seconds to infinity.
var query =
from MD in _context.MeasureDevice
where MD.DbdistributorMap.DbcustomerId == 6 // Filter the devices based on customer
select new
{
DbMeasureDeviceId = MD.DbMeasureDeviceId,
// includes measurements and alarms which have 1-m and 1-m relations
Measure = _context.Measures.Include(e=> e.MeasureAlarms)
.FirstOrDefault(e => e.DbMeasureDeviceId == MD.DbMeasureDeviceId && e.MeasureTimeStamp == _context.Measures
.Where(x => x.DbMeasureDeviceId == MD.DbMeasureDeviceId)
.Max(e=> e.MeasureTimeStamp)),
Address = MD.Dbaddress // includes address 1-1 relation
};
In this query I'm selecting data from 4 different tables. Firstly the MeasureDevice table which is the primary entity im after. Secondly I want the latest measurement from the measures table, which should also include alarms from another table if any exist. Lastly I need the address of the device, which is located in its own table.
There are a few thousand devices, but they have between themselves several thousands of measures which amount to several million rows in the measurement table.
I wonder if anyone has any knowledge as to either improve the performance of Linq queries using EF5, or any better method for solving the Greatest-N-Per problem. I've analyzed the query using Microsoft SQL Server Manager and the most time is spent fetching the measurements.
Query generated as requested:
SELECT [w].[DBMeasureDeviceID], [t].[DBMeasureID], [t].[AlarmDBAlarmID], [t].[batteryValue], [t].[DBMeasureDeviceID], [t].[MeasureTimeStamp], [t].[Stand], [t].[Temperature], [t].[c], [a].[DBAddressID], [a].[AmountAvtalenummere],
[a].[DBOwnerID], [a].[Gate], [a].[HouseCharacter], [a].[HouseNumber], [a].[Latitude], [a].[Longitude], [d].[DBDistributorMapID], [m1].[DBMeasureID], [m1].[DBAlarmID], [m1].[AlarmDBAlarmID], [m1].[MeasureDBMeasureID]
FROM [MeasureDevice] AS [w]
INNER JOIN [DistribrutorMap] AS [d] ON [w].[DBDistributorMapID] = [d].[DBDistributorMapID]
LEFT JOIN [Address] AS [a] ON [w].[DBAddressID] = [a].[DBAddressID]
OUTER APPLY (
SELECT TOP(1) [m].[DBMeasureID], [m].[AlarmDBAlarmID], [m].[batteryValue], [m].[DBMeasureDeviceID], [m].[MeasureTimeStamp], [m].[Stand], [m].[Temperature], 1 AS [c]
FROM [Measure] AS [m]
WHERE ([m].[MeasureTimeStamp] = (
SELECT MAX([m0].[MeasureTimeStamp])
FROM [Measure] AS [m0]
WHERE [m0].[DBMeasureDeviceID] = [w].[DBMeasureDeviceID])) AND ([w].[DBMeasureDeviceID] = [m].[DBMeasureDeviceID])
) AS [t]
LEFT JOIN [MeasureAlarm] AS [m1] ON [t].[DBMeasureID] = [m1].[MeasureDBMeasureID]
WHERE [d].[DBCustomerID] = 6
ORDER BY [w].[DBMeasureDeviceID], [d].[DBDistributorMapID], [a].[DBAddressID], [t].[DBMeasureID], [m1].[DBMeasureID], [m1].[DBAlarmID]
Entity Relations

You have navigation properties defined, so it stands that MeasureDevice should have a reference to it's Measures:
var query = _context.MeasureDevice
.Include(md => md.Measures.Select(m => m.MeasureAlarms)
.Where(md => md.DbDistributorMap.DbCustomerId == 6)
.Select(md => new
{
DbMeasureDeviceId = md.DbMeasureDeviceId,
Measure = md.Measures.OrderByDescending(m => m.MeasureTimeStamp).FirstOrDefault(),
Address = md.Address
});
The possible bugbear here is including the MeasureAlarms with the required Measure. AFAIK you cannot put an .Include() within a .Select() (Where we might have tried Measure = md.Measures.Include(m => m.MeasureAlarms)...
Caveat: It has been quite a while since I have used EF 5 (Unless you are referring to EF Core 5) If you are using the (very old) EF5 in your project I would recommend arguing for the upgrade to EF6 given EF6 did bring a number of performance and capability improvements to EF5. If you are instead using EF Core 5, the Include statement above would be slightly different:
.Include(md => md.Measures).ThenInclude(m => m.MeasureAlarms)
Rather than returning entities, my go-to advice is to use Projection to select precisely the data we need. That way we don't need to worry about eager or lazy loading. If there are details about the Measure and MeasureAlarms we need:
var query = _context.MeasureDevice
.Where(md => md.DbDistributorMap.DbCustomerId == 6)
.Select(md => new
{
md.DbMeasureDeviceId,
Measure = md.Measures
.Select(m => new
{
m.MeasureId,
m.MeasureTimestamp,
// any additional needed fields from Measure
Address = m.Address.Select(a => new
{
// Assuming Address is an entity, any needed fields from Address.
}),
Alarms = m.MeasureAlarms.Select(ma => new
{
ma.MeasureAlarmId,
ma.Label // etc. Whatever fields needed from Alarm...
}).ToList()
}).OrderByDescending(m => m.MeasureTimestamp)
.FirstOrDefault()
});
This example selects anonymous types, alternatively you can define DTOs/ViewModels and can leverage libraries like Automapper to map the fields to the respective entity values to replace all of that with something like ProjectTo<LatestMeasureSummaryDTO> where Automapper has rules to map a MeasureDevice to resolve the latest Measure and extract the needed fields.
The benefits of projection are handling otherwise complex/clumsy eager loading, building optimized payloads with only the fields a consumer needs, and resilience in a changing system where new relationships don't accidentally introduce lazy loading performance issues. For example if Measure currently only has MeasureAlarm to eager load, everything works. But down the road if a new relationship is added to Measure or MeasureAlarm and your payload containing those entities are serialized, that serialization call will now "trip" lazy loading on the new relationship unless you revisit all queries retrieving these entities and add more eager loads, or start worrying about disabling lazy loading entirely. Projections remain the same until only if and when the fields they need to return actually need to change.
Beyond that, the next thing you can investigate is to run the resulting query through an analyzer, such as within SQL Management Studio to return the execution plan and identify whether the query could benefit from indexing changes.

Entity framework the opposite of include

We are working on a Project in which there are many linq queries are not optimized, because as they started on the project they used the property virtual for all of their models.
My task is to optimize the max number of the queries, in order to enhance the app performance.
The problem is if I use the Include function and delete all virtual properties from the model, lot of things stop working and the number of affected functions is huge.
So I thought if I can find some thing resemble to "exclude" to exclude the unnecessary sub queries in some cases.

(with the assumption of your result set implements ienumerable)
My first choice would be:
ListMain.Except(ItemsToExclude);
Or, I would go with (not) "Contains" as follows and have check in-between to exclude the records. This may not be the best way out there but I could work.
!ListMain.Contains(ItemsToExclude)

I don't know if I got the question right, but to avoid loading certain attributes or related objects, you can make an additional Select() including only what's needed.
Example:
A simple ToList() will bring the entire object from the table:
var resultList = await dbContext.ABTests.AsNoTracking().ToListAsync();
It will derives in the query:
SELECT [a].[Id], [a].[AssignedUsers], [a].[EndDate], [a].[Groups], [a].[Json], [a].[MaxUsers], [a].[Name], [a].[NextGroup], [a].[StartDate]
FROM [ABTests] AS [a]
(which includes all mapped fields of the ABTest object)
To avoid fetching all, you can do as follow:
var resultList = await dbContext.ABTests.AsNoTracking().Select(x => new ABTest
{
Id = x.Id,
Name = x.Name
}).ToListAsync();
(supposing that you only wan the fields Id and Name to be eagerly loaded)
The resulting SQL Query will be:
SELECT [a].[Id], [a].[Name]
FROM [ABTests] AS [a]

How can I conditionally add where clauses and filter children in a single linq query?

I'm using entity framework and building up a linq query so the query is executed at the database to minimize data coming back and the query can have some search criteria which is optional and some ordering which is done every time. I am working with parents and children (the mummy and daddy type). The filter I am trying to implement is for age of the children.
So if I have some data like so...
parent 1
- child[0].Age = 5
- child[1].Age = 10
parent 2
- child[0].Age = 7
- child[1].Age = 23
...and I specify a minimum age of 8, my intended result to display is...
parent 1
- child[1].Age = 10
parent 2
- child[1].Age = 23
...and if I specify a minimum age of 15 I intend to display...
parent 2
- child[1].Age = 23
I can re-create my expected result with this horrible query (which I assume is actually doing more than one query):
var parents = context.Parents;
if(minimumChildAge.HasValue)
{
parents = parents.Where(parent => parent.Children.Any(child => child.Age >= minimumChildAge.Value));
foreach(var parent in parents)
{
parent.Children = parent.Children.Where(child => child.minimumChildAge.Value >= mimumumChildAge);
}
}
parents = parents.OrderBy(x => x.ParentId).Take(50);
So I tried the other method instead...
var query = from parent in context.Parents
select parent;
if (minimumChildAge.HasValue)
query = from parent in query
join child in context.Children
on parent.ParentId equals child.ParentId
where child.Age >= minimumChildAge.Value
select parent;
query = query.OrderBy(x => x.ParentId).Take(50);
When I run this in linqpad the query generated looks good. So my question...
Is this the correct way of doing this? Is there a better way? It seems a bit funny that if I now specified a maximum age that I would be writing the same joins and hoping that entity framework works it out. In addition, how does this impact lazy loading? I expect only the children which match the criteria to be returned. So when I do parent.Children does entity framework know that it just queried these and its working on a filtered collection?

Assuming your context is backed by an entity framework database or similar, then yes, your first option is going to do more than one SQL query. When you begin executing the foreach it will run a SQL query to get the parent (since you've forced enumeration on the query). Then, for each attempt to populate the Children property of a single parent object it will make another database call.
The second form should only produce a single SQL query; it will have a ton of redundant data but it will use JOIN statements to bring back all of the parent and child data in a single SQL call, then enumerate through it and populate the data on the client side as needed.
A rule of thumb I tend to follow is that, if you have fewer than 4 nested tables in your query, try to run it all at once. Both SQL and Entity Framework's query parsers seem to be very, very efficient when producing joins at that level.
If you get much beyond that, the SQL queries that EF can produce may get messy, and SQL itself (assuming MSSQL) gets less effective when you have 5+ joins on a single query. There's no hard and fast limit, because it depends on a number of specific factors, but if I find myself needing very deep nesting I tend to break it up into smaller LINQ queries and recombine them client-side.
(Side note: you can reproduce your second query in method syntax easily enough, since that's what the compiler is going to end up doing anyway, by using the Join method, but the syntax for that can get very complex; I typically go with query syntax for anything more complex then a single method call.)

Entity Framework include vs where

My database structure is this: an OptiUser belongs to multiple UserGroups through the IdentityMap table, which is a matching table (many to many) with some additional properties attached to it. Each UserGroup has multiple OptiDashboards.
I have a GUID string which identifies a particular user (wlid in this code). I want to get an IEnumerable of all of the OptiDashboards for the user identified by wlid.
Which of these two Linq-to-Entities queries is the most efficient? Do they run the same way on the back-end?
Also, can I shorten option 2's Include statements to just .Include("IdentityMaps.UserGroup.OptiDashboards")?
using (OptiEntities db = new OptiEntities())
{
// option 1
IEnumerable<OptiDashboard> dashboards = db.OptiDashboards
.Where(d => d.UserGroups
.Any(u => u.IdentityMaps
.Any(i => i.OptiUser.WinLiveIDToken == wlid)));
// option 2
OptiUser user = db.OptiUsers
.Include("IdentityMaps")
.Include("IdentityMaps.UserGroup")
.Include("IdentityMaps.UserGroup.OptiDashboards")
.Where(r => r.WinLiveIDToken == wlid).FirstOrDefault();
// then I would get the dashboards through user.IdentityMaps.UserGroup.OptiDashboards
// (through foreach loops...)
}

You may be misunderstanding what the Include function actually does. Option 1 is purely a query syntax which has no effect on what is returned by the entity framework. Option 2, with the Include function instructs the entity framework to Eagerly Fetch the related rows from the database when returns the results of the query.
So option 1 will result in some joins, but the "select" part of the query will be restricted to the OptiDashboards table.
Option 2 will result in joins as well, but in this case it will be returning the results from all the included tables, which obviously is going to introduce more of a performance hit. But at the same time, the results will include all the related entities you need, avoiding the [possible] need for more round-trips to the database.

I think the Include will render as joins an you will the able to access the data from those tables in you user object (Eager Loading the properties).
The Any query will render as exists and not load the user object with info from the other tables.
For best performance if you don't need the additional info use the Any query

As has already been pointed out, the first option would almost certainly perform better, simply because it would be retrieving less information. Besides that, I wanted to point out that you could also write the query this way:
var dashboards =
from u in db.OptiUsers where u.WinLiveIDToken == wlid
from im in u.IdentityMaps
from d in im.UserGroup.OptiDashboards
select d;
I would expect the above to perform similarly to the first option, but you may (or may not) prefer the above form.

Data Loading Strategy/Syntax in EF4

Long time lurker, first time posting, and newly learning EF4 and MVC3.
I need help making sure I'm using the correct data loading strategy in this case as well as some help finalizing some details of the query. I'm currently using the eager loading approach outlined here for somewhat of a "dashboard" view that requires a small amount of data from about 10 tables (all have FK relationships).
var query = from l in db.Leagues
.Include("Sport")
.Include("LeagueContacts")
.Include("LeagueContacts.User")
.Include("LeagueContacts.User.UserContactDatas")
.Include("LeagueEvents")
.Include("LeagueEvents.Event")
.Include("Seasons")
.Include("Seasons.Divisions")
.Include("Seasons.Divisions.Teams")
.Where(l => l.URLPart.Equals(leagueName))
select (l);
model = (Models.League) query.First();
However, I need to do some additional filtering, sorting, and shaping of the data that I haven't been able to work out. Here are my chief needs/concerns from this point:
Several child objects still need additional filtering but I haven't been able to figure out the syntax or best approach yet. Example: "TOP 3 LeagueEvents.Event WHERE StartDate >= getdate() ORDER BY LeagueEvents.Event.StartDate"
I need to sort some of the fields. Examples: ORDERBY Seasons.StartDate, LeagueEvents.Event.StartDate, and LeagueContacts.User.SortOrder, etc.
I'm already very concerned about the overall size of the SQL generated by this query and the number of joins and am thinking that I may need a different data loading approach alltogether.(Explicit loading? Multiple QueryObjects? POCO?)
Any input, direction, or advice on how to resolve these remaining needs as well as ensuring the best performance is greatly appreciated.

Your concern about size of the query and size of the result set are tangible.
As #BrokenGlass mentioned EF doesn't allow you doing filtering or ordering on includes. If you want to order or filter relations you must use projection either to anonymous type or custom (non mapped) type:
var query = db.Leagues
.Where(l => l.URLPart.Equals(leagueName))
.Select(l => new
{
League = l,
Events = l.LeagueEvents.Where(...)
.OrderBy(...)
.Take(3)
.Select(e => e.Event)
...
});

Unfortunately EF doesn't allow to selectively load related entities using its navigation properties, it will always load all Foos if you specify Include("Foo").
You will have to do a join on each of the related entities using your Where() clauses as filters where they apply.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.