I am running into an interesting performance issue with Entity Framework. I am using Code First.
Here is the structure of my entities:
A Book can have many Reviews.
A Review is associated with a single Book.
A Review can have one or many Comments.
A Comment is associated with one Review.
public class Book
{
public int BookId { get; set; }
// ...
public ICollection<Review> Reviews { get; set; }
}
public class Review
{
public int ReviewId { get; set; }
public int BookId { get; set; }
public Book Book { get; set; }
public ICollection<Comment> Comments { get; set; }
}
public class Comment
{
public int CommentId { get; set; }
public int ReviewId { get; set; }
public Review Review { get; set; }
}
I populated my database with a lot of data and added the proper indexes. I am trying to retrieve a single book that has 10,000 reviews on it using this query:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.FirstOrDefault();
This particular book has 10,000 reviews. The performance of this query is around 4 seconds. Running the exact same query (via SQL Profiler) actually returns in no time at all. I used the same query and a SqlDataAdapter and custom objects to retrieve the data and it happens in under 500 milliseconds.
Using ANTS Performance Profiler it looks like a bulk of the time is being spent doing a few different things:
The Equals method is being called 50 million times.
Does anyone know why it would need to call this 50 million times and how I could increase the performance for this?
Why is Equals called 50M times?
It sounds quite suspicious. You have 10.000 reviews and 50.000.000 calls to Equals. Suppose that this is caused by identity map internally implemented by EF. Identity map ensures that each entity with unique key is tracked by the context only once so if context already has instance with the same key as loaded record from the database it will not materialize new instance and instead uses the existing one. Now how this can coincide with those numbers? My terrifying guess:
=============================================
1st record read | 0 comparisons
2nd record read | 1 comparison
3rd record read | 2 comparisons
...
10.000th record read | 9.999 comparisons
That means that each new record is compared with every existing record in identity map. By applying math to compute sum of all comparison we can use something called "Arithmetic sequence":
a(n) = a(n-1) + 1
Sum(n) = (n / 2) * (a(1) + a(n))
Sum(10.000) = 5.000 * (0 + 9.999) => 5.000 * 10.000 = 50.000.000
I hope I didn't make mistake in my assumptions or calculation. Wait! I hope I did mistake because this doesn't seem good.
Try turning off change tracking = hopefully turning off identity map checking.
It can be tricky. Start with:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.AsNoTracking()
.FirstOrDefault();
But there is a big chance that your navigation property will not be populated (because it is handled by change tracking). In such case use this approach:
var book = db.Books.Where(b => b.BookId == id).AsNoTracking().FirstOrDefault();
book.Reviews = db.Reviews.Where(r => r.BookId == id).AsNoTracking().ToList();
Anyway can you see what object type is passed to Equals? I think it should compare only primary keys and even 50M integer comparisons should not be such a problem.
As a side note EF is slow - it is well known fact. It also uses reflection internally when materializing entities so simply 10.000 records can take "some time". Unless you already did that you can also turn off dynamic proxy creation (db.Configuration.ProxyCreationEnabled).
I know this sounds lame, but have you tried the other way around, e.g.:
var reviewsAndBooks = db.Reviews.Where(r => r.Book.BookId == id)
.Include(r => r.Book);
I have noticed sometimes better performance from EF when you approach your queries this way (but I haven't had the time to figure out why).
Related
public class Student
{
public int StudentId;
public string StudentName;
public int CourseId;
public virtual Course Courses { get; set; }
}
public class Course
{
public int CourseId;
public string CourseName;
public string Description;
public ICollection<Student> Students {get;set;}
public ICollection<Lecture> Lectures { get; set; }
}
public class Lecture
{
public int LectureId;
public string LectureName;
public int CourseId;
public virtual Course Courses { get; set; }
}
What is the keyword virtual used for here?
I was told a virtual is for lazy loading but I don't understand why.
Because when we do
_context.Lecture.FirstOrDefault()
the result returns the first Lecture and it does not include the attribute Course.
To get the Lecture with the Course, we have to use:
_context.Lecture.Include("Courses").FirstOrDefault()
without using a virtual keyword, it's already a lazy-loading.
Then why do we need the keyword?
By declaring it virtual you allow EF to substitute the value property with a proxy to enable lazy loading. Using Include() is telling the EF query to eager-load the related data.
In EF6 and prior, lazy loading was enabled by default. With EF Core it is disabled by default. (Or not supported in the earliest versions)
Take the following query:
var lecture = _context.Lecture.Single(x => x.LectureId == lectureId);
to load one lecture.
If you omit virtual then accessing lecture.Course would do one of two things. If the DbContext (_context) was not already tracking an instance of the Course that lecture.CourseId was pointing at, lecture.Course would return #null. If the DbContext was already tracking that instance, then lecture.Course would return that instance. So without lazy loading you might, or might not get a reference, don't count on it being there.
With virtual and lazy loading in the same scenario, the proxy checks if the Course has been provided by the DbContext and returns it if so. If it hasn't been loaded then it will automatically go to the DbContext if it is still in scope and attempt to query it. In this way if you access lecture.Course you can count on it being returned if there is a record in the DB.
Think of lazy loading as a safety net. It comes with a potentially significant performance cost if relied on, but one could argue that a performance hit is the lesser of two evils compared to runtime bugs with inconsistent data. This can be very evident with collections of related entities. In your above example the ICollection<Student> and such should be marked as virtual as well to ensure those can lazy load. Without that you would get back whatever students might have been tracked at the time, which can be very inconsistent data state at runtime.
Take for example you have 2 courses, Course #1 and #2. There are 4 students, A, B, C, and D. All 4 are registered to Course #1 and only A & B are registered to Course B. If we ignore lazy-loading by removing the virtual then the behavior will change depending on which course we load first if we happen to eager-load in one case and forget in the second...
using (var context = new MyAppDbContext())
{
var course1 = context.Courses
.Include(x => x.Students)
.Single(x => x.CourseId == 1);
var course2 = context.Courses
.Single(x => x.CourseId == 2);
var studentCount = course2.Students.Count();
}
Disclaimer: With collections in entities you should ensure these are always initialized so they are ready to go. This can be done in the constructor or on an auto-property:
public ICollection<Student> Students { get; set; } = new List<Student>();
In the above example, studentCount would come back as "2" because in loading Course #1, both Student A & B were loaded via the Include(x => x.Students) This is a pretty obvious example loading the two courses right after one another but this situation can easily occur when loading multiple records that share data, such as search results, etc. It is also affected by how long the DbContext has been alive. This example uses a using block for a new DbContext instance scope, one scoped to the web request or such could be tracking related instances from earlier in the call.
Now reverse the scenario:
using (var context = new MyAppDbContext())
{
var course2 = context.Courses
.Include(x => x.Students)
.Single(x => x.CourseId == 2);
var course1 = context.Courses
.Single(x => x.CourseId == 1);
var studentCount = course1.Students.Count();
}
In this case, only Students A & B were eager loaded. While Course 1 actually references 4 students, studentCount here would return "2" for the two students associated with Course 1 that the DbContext was tracking when Course 1 was loaded. You might expect 4, or 0 knowing that you didn't eager-load the students. The resulting related data is unreliable and what you might or might not get back will be situational.
Where lazy loading will get expensive is when loading sets of data. Say we load a list of 100 students and when working with those students we access student.Course. Eager loading will generate 1 SQL statement to load 100 students and their related courses. Lazy loading will end up executing 1 query for the students, then 100 queries to load course for each student. (I.e. SELECT * FROM Courses WHERE StudentId = 1; SELECT * FROM Courses WHERE StudentId = 2; ...) If student had several lazy loaded properties then that's another 100 queries per lazy load.
I wrote a query which is pretty simple:
var locations = await _context.Locations
.Include(x => x.LocationsOfTheUsers)
.Include(x => x.Address)
.ThenInclude(x => x.County)
.Where(CalculateFilters(searchObj))
.ToListAsync(cancellationToken);
And everytime LocationsOfTheUsers were null so I decided to .Include(x => x.LocationsOfTheUsers) and I received results as expected but I'm not sure why do I have to include this collections since it's defined like this:
public class Location
{
public string Title { get; set; }
public long? RegionId { get; set; }
public Region Region { get; set; }
public long? AddressId { get; set; }
public Address Address { get; set; }
public long? CountyId { get; set; }
public County County { get; set; }
public ICollection<LocationsOfTheUsers> LocationsOfTheUsers { get; set; }
}
I thought this will be automatically included since it exist as ICollection in Location class.
So why is .Include() on LocationsOfTheUsers needed here?
Thanks guys
Cheers
In entity framework the non-virtual properties represent the columns of the tables, the virtual properties represent the relations between the tables (one-to-many, many-to-many, ...)
So your property should have been defined as:
public virtual ICollection<LocationsOfTheUsers> LocationsOfTheUsers { get; set; }
One of the slower parts of a database query is the transfer of the selected data from the database management system to your local process. Hence it is wise to limit the selected data to the values you actually plan to use.
If you have a one-to-many relation between Schools and Students, and you ask for School [10] you don't want automatically to fetch its 2000 Students.
Even if you would like to have "School [10] with all its Students" it would not be efficient to use Include to also fetch the Students. Every Student will have a foreign key SchoolId with a Value of [10]. If you would use Include you would transfer this foreign key 2000 times. What a waste!
When using entity framework always use Select to fetch data and select only the properties that you actually plan to use. Only use Include if you plan to change the included items.
This way you can separate your database table structure from the actual query. If your database structure changes, only the query changes, users of your query don't notice the internal changes.
Apart from better performance and more robustness against changes, readers of your code can more easily see what values are in their query.
Certainly don't use Include to save you some typing. Having to debug one error after future changes will take way more time than you will ever save by typeing include instead of Select
Finally: limit your data early in your process, so put the Where in front.
So your query should be:
var predicate = CalculateFilters(searchObj)
var queryLocations = dbContext.Locations
.Where(predicate)
.Select(location => new
{
// Select only the location properties that you plan to use
Id = location.Id,
Name = location.Name,
// Locations Of the users:
UserLocations = location.LocationsOfTheUsers
.Select(userLocation => new
{
// again: only the properties that you plan to use
Id = userLocation.Id,
...
// Not needed, you already know the value
// LocationId = userLocation.LocationId
})
.ToList(),
Address = new
{
Street = location.Address.Street,
PostCode = location.Addrress.PostCode,
...
County = location.Address.County.Name // if you only want one property
// or if you want more properties:
County = new
{
Name = location.Address.County.Name,
Abbr = location.Address.Count.Abbr,
...
}),
},
});
I thought this will be automatically included since it exist as ICollection in Location class.
Well, it's not automatically included, probably for performance reasons as the graph of related entities and their recursive child entities may be rather deep.
That's why you use eager loading to explicitly include the related entities that you want using the Include method.
The other option is to use lazy loading which means that the related entities are loaded as soon as you access the navigation property in your code, assuming some prerequisites are fulfilled and that the context is still around when this happens.
Please refer to the docs for more information.
I believe you are using EntityFrameworkCore. In EntityFramework (EF6), lazy loading is enabled by default, However, in EntityFrameworkCore, lazy loading related entities is handled by a separate package Microsoft.EntityFrameworkCore.Proxies.
To enable the behaviour you are seeking, install the above package and add the following code
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseLazyLoadingProxies();
}
After this, the related entities will be loaded without the Include call.
I have huge transactions table in azure database, where we import files with +1 million objects.
public class Transaction
{
[Key]
public int Id { get; set; }
public int TransactionId { get; set; }
public DateTime Date { get; set; }
public decimal Price { get; set; }
public int UserId { get; set; }
public string Product { get; set; }
public int ClientId { get; set; }
public int Uploaed { get; set; }
public string UniqueId { get; set; }
public string Custom1 { get; set; }
public string Custom2 { get; set; }
public string Custom3{ get; set; }
}
after importing all new data I take all new transaction ids, and take all transaction ids for that client from database.
// ids from import
string transactionsString = string.Join(",", transactionIdsCsv);
var result = await _transactionsDataRepository.GetByTransactionIdsAndClientId(transactionIdsCsv.ToArray(), clientId);
// ids from repository
string transactionsDBString = string.Join(",", result.ToList());
// remove rows in db where duplicate transactions ids and clientId=ClientId
but I am struggling to find the most effective way. I wanted to do something like
delete from transactions where transactionId IN (transactionsDBString) and clientId = ClientID but that would delete both values and I only want new value to be deleted (and old value to stay)
but would that be a good way? even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows.
I only want new value to be deleted (and old value to stay)
Since you already know how to identify the transaction IDs you want to delete you could delete the necessary rows while keeping the latest like so (you didn't mention it but I'm assuming you're using Entity Framework - given your use of the [Key] attribute - correct me if I'm wrong):
var transToRemove = dbContext.Transactions
.Where(t => t.ClientId == clientId && transIds.Contains(t.TransactionId))
.GroupBy(t => t.TransactionId, t => t) // Group transactions with the same TransactionId
.SelectMany(
group => group.OrderBy(t => t.Date) // Order the oldest first
.Skip(1) // Skip the oldest (we want to keep it)
);
dbContext.Transactions.RemoveRange(transToRemove);
dbContext.SaveChanges();
Edit: Included an example that should work for Dapper...
var cn = // Create your DbConnection
// This query should select all transactions you want to delete excluding
// those with the oldest Date. This is just like 'transToRemove' above
var selectQuery = #"
SELECT t1.Id FROM Transactions t1
INNER JOIN (
SELECT
MIN(tInner.Date) AS FirstTransDate,
tInner.TransactionId,
tInner.ClientId
FROM Transactions tInner
WHERE tInner.ClientId = #clientId
AND tInner.TransactionId IN #transIds
GROUP BY tInner.TransactionId, tInner.ClientId
) t2 ON t2.ClientId = t1.ClientId AND t2.TransactionId = t1.TransactionId
WHERE t1.Date != t2.FirstTransDate
";
var idsToDelete = cn.Query<int>(
selectQuery,
new { clientId, transIds }).ToList();
// Delete the whole list in one go
cn.Execute("DELETE FROM Transactions WHERE Id in #idsToDelete", new {idsToDelete});
(inspiration from here and here)
I haven't tested this using Dapper but the list of idsToDelete should be correct according to this fiddle I made. A couple things to note:
Depending on how long your list of transIds is (I believe those ID's are in result in your own example) you might want to repeat this in smaller batches instead of trying to delete the whole list in one go.
The SQL query above doesn't take into account if two "duplicate" transactions have the same "oldest" Date. If that can happen in your table, then this query will only remove all "duplicate" rows apart from those two.
Improvements
There are a couple of things that seem a little out of place with your setup that I think you should consider:
even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows
Millions of rows should not be an issue for any decent database server to handle. It sounds like you are missing some indexes on your table. With proper indexes your queries should be pretty swift as long as you can keep them simple.
but would that be a good way?
Not quite sure what you are referring to being good or bad here, but I'll interpret a little... Right now you are writing tons of rows to the table that seems to contain duplicate data. When I think of a transaction-based system, no two transactions should share the same ID. That means for two different ClientIds there should never be a case where t1.TransactionId == t2.TransactionId. Now you can avoid checking ClientId in my code-snippet above.
Since you want to only keep 1 transaction for each TransactionId will you ever need to have two transactions with the same TransactionId? If not, then you can go even further and make the TransactionId column unique and avoid inserting two rows with the same TransactionId. You can use the Entity Framework [Index(IsUnique=true)] attribute to also create an index to speed up queries on that column/property.
Everytime I use the Include extension, it would return an error when a value from included entity is used in the WHERE CLAUSE.
I included the System.Data.Entity which is the common answer but still have the same issue.
Model:
public partial class business_partner
{
public int id { get; set; }
public string accountid { get; set; }
}
public partial class order
{
public int id { get; set; }
public string doc_number { get; set; }
public int vendor_id { get; set; }
public int status { get; set; };
[ForeignKey("vendor_id")]
public virtual business_partner businessPartnerVendor { get; set; }
}
public IQueryable<order> GetOrder()
{
return (context.order);
}
Query:
_orderService.GetOrder()
.Include(a => a.businessPartnerVendor)
.Where(o => o.doc_number == "Order Number"
&& o.businessPartnerVendor.accountid == "TEST"
&& o.status > 2 && o.status != 9).Count() > 0
Exception:
The specified type member 'businessPartnerVendor' is not supported in LINQ to Entities. Only initializers, entity members, and entity navigation properties are supported.
Alas you forgot to write your requirement. Your code doesn't do what you want, so I might come to the incorrect conclusion, but looking at your code, it seems that you want the following:
Tell me whether there are Orders, that
- have a value of DocNumber that equals "Order_Number",
- AND that are orders of a BusinessPartnerVendor with a value of AccountId equal to "TEST",
- AND have a value of Status which is more than 2 and not equal to 9.
The part "Tell me whether there are Orders that", was deducted by the fact that you only want to know whether Count() > 0
Your Count would have joined all elements, included all columns of BusinessPartnerVendor, removed all rows that didn't match your Where, and counted how many joined items were left. That integer value would be transferred, after which your process would check whether the value is larger than zero.
One of the slower parts of a database query is the transport of the selected data to from the Database Management System to your local process. Hence it is wise to limit the amount of transferred data.
Quite often I see people using Include to get the items that are stored in a different table (quite often a one-to-many). This will select the complete row. From the businessPartnerVendor, you only want to use property AccountId. So why select the complete object?
In entity framework use Select to select properties you want to query. Only use Include if you want to update the fetched data.
bool areTestOrdersAvailable = orderService.GetOrder()
.Where(order => order.doc_number == "Order Number"
&& order.businessPartnerVendor.accountid == "TEST"
&& order.status > 2 && order.status != 9)
.Any();
Because of the virtual keyword in your classes (ans maybe some fluent API), entity framework knows about the one-to-many relation and will perform the correct join for you. It will only use SQL "TAKE 1" to detect whether there are any elements. Only one Boolean is transferred
Some Advices about entity framework
It is good practice to stick as much as possible to the entity framework code first conventions The more you do this, the less Attributes and Fluent API is needed. There will also be less discrepancy between the way Microsoft's usage of identifiers for classes, fields, properties, methods, etc and yours.
In entity framework, all columns of a table are represented by non-virtual properties, the virtual properties represent the relations between tables (one-to-many, many-to-many, ...)
My advice would be: add the foreign keys to your classes, and stick to one identifier to describe one row in your tables.
So decide whether to use business_partner or BusinessPartnerVendor if they are in fact the same kind of thing
Add the foreign key:
// Every Order is the Order of exactly one BusinessPartner, using foreign key (one-to-many)
public int BusinessPartnerId {get; set;}
public virtual BusinessPartner BusinessPartner {get; set;}
This has the advantage, that if you want to select the Ids of all BusinessPartners that have one or more Orders that ..., you don't have to perform a join:
var businessPartnerIds = myDbContext.Orders
.Where(order => ...)
.Select(order => order.BusinessPartnerId)
.Distinct();
Only one database table will be accessed
I am trying to design a new system for tracking sales. A simplistic version of my data models are:
public class Sale
{
public int SaleId { get; set; }
public DateTime CompletedDateTime { get; set; }
public virtual List<SaleItem> SaleItems { get; set; }
public decimal Total
{
get
{
return SaleItems.Sum(i => i.Price);
}
}
}
public class SaleItem
{
public int SaleItemId { get; set; }
public decimal Price { get; set; }
public int SaleId { get; set; }
public virtual Sale Sale { get; set; }
}
I am now writing some reports which total the sales value for between a specified period. I have the following code to do that:
List<Sale> dailySales = db.Sales
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) >= fromParam)
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) <= toParam)
.ToList();
decimal total = dailySales.Sum(x => x.Total);
This is working ok and giving me the expected result. I feel like this might give me problems further down the line though once large datasets get involved. I assume having to load all the Sale's into a list would become resource intensive, plus my actual implementation has tax, costs etc. associated with each SaleItem so again becomes more complex.
The following would allow me to do all the processing on the database, however it is not possible to do this as the DB does not have a representation for Total, so EF throws an error:
Decimal total = db.Sales.Sum(x=>x.Total);
Which leads me to my question. I could set me model as the following and each time I add a SaleItem, make sure I update the Total:
public class Sale
{
...
public decimal Total { get; set; }
}
This would then allow me to query the database as required, and I assume will be less resource intensive. The flip side though is that I have reduced redundancy into the database. Is the latter way, the better method of dealing with this or is there an alternative method I haven't even considered that is better?
It depends on many factors. For instance, how often will you require the "Total" amount to be available? And how many SaleItems are there usually in a Sale?
If we're talking about, say, a supermarket kind of sale where you have... say... maximum of maximums 200 items. It's quite okay to just quickly calculate it on the fly. Then again, if this ever gets mapped to a RDBMS and if you have all the SaleItems in one single table, having an index on the foreign key (which links each individual SaleItem to its Sale) is a must, otherwise performance will take a huge hit once you start to have millions of transactions to sift through.
Answering the second half of your question, having redundancy is not always a bad thing... you just need to make sure that if each Sale ever needs to get its List modified, at the end of it the Total is recalculated. It's slightly dangerous (redundancy always has this attached burden) but you just need to ensure that whatever has the potential to change the Sale, does so in a way (maybe even with a trigger in the RDBMS) that the total will be automatically recalculated.
Hope it helps!
You're right that, it's much more efective to calculate totals on the DB side instead of loading the whole list and calculating it on the application.
I think you're missing that you can make a LINQ query that gets the SUM of related children entities.
using (var ctx = new MyDbContext())
{
var totalSales = ctx.Sales
.Select(s => s.SaleItems.Sum(si => si.Price)) // Total of each Sale
.Sum(tsi => tsi); // Sum of the total of each sale
}
You can of course shape the query to bring additional information, projecting the result in an anonymous class or in a class created ad-hoc for this purpose.
Of course, this EF query will be translated into a SQL query and executed on the server side.
When you start using LINQ to EF it's not very obvious how to get what you want, but in most occassions you can do it.