Virtual keyword in Entity Framework properties - c#

public class Student
{
public int StudentId;
public string StudentName;
public int CourseId;
public virtual Course Courses { get; set; }
}
public class Course
{
public int CourseId;
public string CourseName;
public string Description;
public ICollection<Student> Students {get;set;}
public ICollection<Lecture> Lectures { get; set; }
}
public class Lecture
{
public int LectureId;
public string LectureName;
public int CourseId;
public virtual Course Courses { get; set; }
}
What is the keyword virtual used for here?
I was told a virtual is for lazy loading but I don't understand why.
Because when we do
_context.Lecture.FirstOrDefault()
the result returns the first Lecture and it does not include the attribute Course.
To get the Lecture with the Course, we have to use:
_context.Lecture.Include("Courses").FirstOrDefault()
without using a virtual keyword, it's already a lazy-loading.
Then why do we need the keyword?

By declaring it virtual you allow EF to substitute the value property with a proxy to enable lazy loading. Using Include() is telling the EF query to eager-load the related data.
In EF6 and prior, lazy loading was enabled by default. With EF Core it is disabled by default. (Or not supported in the earliest versions)
Take the following query:
var lecture = _context.Lecture.Single(x => x.LectureId == lectureId);
to load one lecture.
If you omit virtual then accessing lecture.Course would do one of two things. If the DbContext (_context) was not already tracking an instance of the Course that lecture.CourseId was pointing at, lecture.Course would return #null. If the DbContext was already tracking that instance, then lecture.Course would return that instance. So without lazy loading you might, or might not get a reference, don't count on it being there.
With virtual and lazy loading in the same scenario, the proxy checks if the Course has been provided by the DbContext and returns it if so. If it hasn't been loaded then it will automatically go to the DbContext if it is still in scope and attempt to query it. In this way if you access lecture.Course you can count on it being returned if there is a record in the DB.
Think of lazy loading as a safety net. It comes with a potentially significant performance cost if relied on, but one could argue that a performance hit is the lesser of two evils compared to runtime bugs with inconsistent data. This can be very evident with collections of related entities. In your above example the ICollection<Student> and such should be marked as virtual as well to ensure those can lazy load. Without that you would get back whatever students might have been tracked at the time, which can be very inconsistent data state at runtime.
Take for example you have 2 courses, Course #1 and #2. There are 4 students, A, B, C, and D. All 4 are registered to Course #1 and only A & B are registered to Course B. If we ignore lazy-loading by removing the virtual then the behavior will change depending on which course we load first if we happen to eager-load in one case and forget in the second...
using (var context = new MyAppDbContext())
{
var course1 = context.Courses
.Include(x => x.Students)
.Single(x => x.CourseId == 1);
var course2 = context.Courses
.Single(x => x.CourseId == 2);
var studentCount = course2.Students.Count();
}
Disclaimer: With collections in entities you should ensure these are always initialized so they are ready to go. This can be done in the constructor or on an auto-property:
public ICollection<Student> Students { get; set; } = new List<Student>();
In the above example, studentCount would come back as "2" because in loading Course #1, both Student A & B were loaded via the Include(x => x.Students) This is a pretty obvious example loading the two courses right after one another but this situation can easily occur when loading multiple records that share data, such as search results, etc. It is also affected by how long the DbContext has been alive. This example uses a using block for a new DbContext instance scope, one scoped to the web request or such could be tracking related instances from earlier in the call.
Now reverse the scenario:
using (var context = new MyAppDbContext())
{
var course2 = context.Courses
.Include(x => x.Students)
.Single(x => x.CourseId == 2);
var course1 = context.Courses
.Single(x => x.CourseId == 1);
var studentCount = course1.Students.Count();
}
In this case, only Students A & B were eager loaded. While Course 1 actually references 4 students, studentCount here would return "2" for the two students associated with Course 1 that the DbContext was tracking when Course 1 was loaded. You might expect 4, or 0 knowing that you didn't eager-load the students. The resulting related data is unreliable and what you might or might not get back will be situational.
Where lazy loading will get expensive is when loading sets of data. Say we load a list of 100 students and when working with those students we access student.Course. Eager loading will generate 1 SQL statement to load 100 students and their related courses. Lazy loading will end up executing 1 query for the students, then 100 queries to load course for each student. (I.e. SELECT * FROM Courses WHERE StudentId = 1; SELECT * FROM Courses WHERE StudentId = 2; ...) If student had several lazy loaded properties then that's another 100 queries per lazy load.

Related

LINQ Query optimalisation using EF6

I'm trying my hand at LINQ for the first time and just wanted to post a small question to make sure if this was the best way to go about it. I want a list of every value in a table. So far this is what I have, and it works, but is this the best way to go about collecting everything in a LINQ friendly way?
public static List<Table1> GetAllDatainTable()
{
List<Table1> Alldata = new List<Table1>();
using (var context = new EFContext())
{
Alldata = context.Tablename.ToList();
}
return Alldata;
}
For simple entities, that is an entity that has no references to other entities (navigation properties) your approach is essentially fine. It can be condensed down to:
public static List<Table1> GetAllDatainTable()
{
using (var context = new EFContext())
{
return context.Table1s.ToList();
}
}
However, in most real-world scenarios you are going to want to leverage things like navigation properties for the relationships between entities. I.e. an Order references a Customer with Address details, and contains OrderLines which each reference a Product, etc. Returning entities this way becomes problematic because any code that accepts the entities returned by a method like this should be getting either complete, or completable entities.
For instance if I have a method that returns an order, and I have various code that uses that order information: Some of that code might try to get info about the order's customer, other code might be interested in the products. EF supports lazy loading so that related data can be pulled if, and when needed, however that only works within the lifespan of the DbContext. A method like this disposes the DbContext so Lazy Loading is off the cards.
One option is to eager load everything:
using (var context = new EFContext())
{
var order = context.Orders
.Include(o => o.Customer)
.ThenInclude(c => c.Addresses)
.Include(o => o.OrderLines)
.ThenInclude(ol => ol.Product)
.Single(o => o.OrderId == orderId);
return order;
}
However, there are two drawbacks to this approach. Firstly, it means loading considerably more data every time we fetch an order. The consuming code may not care about the customer or order lines, but we've loaded it all anyways. Secondly, as systems evolve, new relationships may be introduced that older code won't necessarily be noticed to be updated to include leading to potential NullReferenceExceptions, bugs, or performance issues when more and more related data gets included. The view or whatever is initially consuming this entity may not expect to reference these new relationships, but once you start passing around entities to views, from views, and to other methods, any code accepting an entity should expect to rely on the fact that the entity is complete or can be made complete. It can be a nightmare to have an Order potentially loaded in various levels of "completeness" and code handling whether data is loaded or not. As a general recommendation, I advise not to pass entities around outside of the scope of the DbContext that loaded them.
The better solution is to leverage projection to populate view models from the entities suited to your code's consumption. WPF often utilizes the MVVM pattern, so this means using EF's Select method or Automapper's ProjectTo method to populate view models based each of your consumer's needs. When your code is working with ViewModels containing the data views and such need, then loading and populating entities as needed this allows you to produce far more efficient (fast) and resilient queries to get data out.
If I have a view that lists orders with a created date, customer name, and list of products /w quantities we define a view model for the view:
[Serializable]
public class OrderSummary
{
public int OrderId { get; set; }
public string OrderNumber { get; set; }
public DateTime CreatedAt { get; set; }
public string CustomerName { get; set; }
public ICollection<OrderLineSummary> OrderLines { get; set; } = new List<OrderLineSummary>();
}
[Serializable]
public class OrderLineSummary
{
public int OrderLineId { get; set; }
public int ProductId { get; set; }
public string ProductName { get; set; }
public int Quantity { get; set; }
}
then project the view models in the Linq query:
using (var context = new EFContext())
{
var orders = context.Orders
// add filters & such /w Where() / OrderBy() etc.
.Select(o => new OrderSummary
{
OrderId = o.OrderId,
OrderNumber = o.OrderNumber,
CreatedAt = o.CreatedAt,
CustomerName = o.Customer.Name,
OrderLines = o.OrderLines.Select( ol => new OrderLineSummary
{
OrderLineId = ol.OrderLineId,
ProductId = ol.Product.ProductId,
ProductName = ol.Product.Name,
Quantity = ol.Quantity
}).ToList()
}).ToList();
return orders;
}
Note that we don't need to worry about eager loading related entities, and if later down the road an order or customer or such gains new relationships, the above query will continue to work, only being updated if the new relationship information is useful for the view(s) it serves. It can compose a faster, less memory intensive query fetching fewer fields to be passed over the wire from the database to the application, and indexes can be employed to tune this even further for high-use queries.
Update:
Additional performance tips: Generally avoid methods like GetAll*() as a lowest common denominator method. Far too many performance issues I come across with methods like this are in the form of:
var ordersToShip = GetAllOrders()
.Where(o => o.OrderStatus == OrderStatus.Pending)
.ToList();
foreach(order in ordersToShip)
{
// do something that only needs order.OrderId.
}
Where GetAllOrders() returns List<Order> or IEnumerable<Order>. Sometimes there is code like GetAllOrders().Count() > 0 or such.
Code like this is extremely inefficient because GetAllOrders() fetches *all records from the database, only to load them into memory in the application to later be filtered down or counted etc.
If you're following a path to abstract away the EF DbContext and entities into a service / repository through methods then you should ensure that the service exposes methods to produce efficient queries, or forgo the abstraction and leverage the DbContext directly where data is needed.
var orderIdsToShip = context.Orders
.Where(o => o.OrderStatus == OrderStatus.Pending)
.Select(o => o.OrderId)
.ToList();
var customerOrderCount = context.Customer
.Where(c => c.CustomerId == customerId)
.Select(c => c.Orders.Count())
.Single();
EF is extremely powerful and when selected to service your application should be embraced as part of the application to give the maximum benefit. I recommend avoiding coding to abstract it away purely for the sake of abstraction unless you are looking to employ unit testing to isolate the dependency on data with mocks. In this case I recommend leveraging a unit of work wrapper for the DbContext and the Repository pattern leveraging IQueryable to make isolating business logic simple.

why ef lost relationship once SaveChanges?

If I simply do this:
var medical = ctx.Medicals.FirstOrDefault(p => p.ID == medicalViewModel.ID);
var sizeClinics = medical.Clinics.Count;
The amount is (for example) 10 (i.e. I have 10 clinics for that medical).
Now, if I do this:
var medical = mapper.Map<MedicalViewModel, Medicals>(medicalViewModel);
ctx.Entry(medical).State = medical.ID == 0 ? EntityState.Added : EntityState.Modified;
ctx.SaveChanges();
medical = ctx.Medicals.FirstOrDefault(p => p.ID == medicalViewModel.ID);
var sizeClinics = medical.Clinics.Count;
The size is 0. Why? It seems it remove relationship after SaveChanges?
Here's the Medicals object:
public partial class Medicals
{
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2214:DoNotCallOverridableMethodsInConstructors")]
public Medicals()
{
this.Activities = new HashSet<Activities>();
this.MedicalsRefunds = new HashSet<MedicalsRefunds>();
this.Clinics = new HashSet<Clinics>();
}
public int ID { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string Email { get; set; }
public string Phone { get; set; }
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2227:CollectionPropertiesShouldBeReadOnly")]
public virtual ICollection<Activities> Activities { get; set; }
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2227:CollectionPropertiesShouldBeReadOnly")]
public virtual ICollection<MedicalsRefunds> MedicalsRefunds { get; set; }
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2227:CollectionPropertiesShouldBeReadOnly")]
public virtual ICollection<Clinics> Clinics { get; set; }
}
I thing I've noticed: if I analyze medical object with QuickWatch the first time (without SaveChanges part) its as {System.Data.Entity.DynamicProxies.Medicals_650D310387E78A83885649345ED0FB2870EC304BF647B59321DFA0E4FBC78047}.
Instead, if I do SaveChanges and then I retrieve that medical, it is as {MyNamespace.Models.Medicals}.
What can it be?
This question is answered by understanding how Entity Framework works internally. I'll try to highlight the key features here.
Change tracking
Entity Framework has a sort of cache of entities in-memory, called the change tracker.
In your first example, when you fetch an entity from the database:
var medical = ctx.Medicals.FirstOrDefault(p => p.ID == medicalViewModel.ID);
Entity Framework creates the Medicals instance that you receive. When it does so, it also uses that opportunity to store a reference to that object, for its own reasons. It will keep an eye on those objects and track any changes made to them.
For example, if you now call ctx.SaveChanges(); at any point, it's going to look at everything in its change tracker, see which things have been changed, and update those in the database.
There are several benefits attached to this: you don't have to explicitly tell EF that you made changes to some of the entities it was already tracking in its cache, and EF can also spot which specific fields have changed, so it only has to update those specific fields and it can ignore the unchanged fields.
Update from comments: EF only allows the tracking of one instance of a given entity, based on the PK value. So if you've already tracked the Medical with ID 123, you can't track another instance of the same Medical entity with ID 123.
Lazy loading
The code you use suggests that you are lazy loading. I'm going to gloss over the intricate details here, to keep it simple. If you don't know what lazy/eager loading is, I suggest you look this up, as the explanation is too long to write down here. Lazy/eager loading is a key concept in Entity Framework for dealing with entity relations and how to fetch related entities.
When dealing with lazy loading, EF slightly tinkers with your entity when it fetches it for you. It puts a special lazy collection in all the entity's navigational properties (such as medical.Clinics), so that it will fetch the related data only when you actually try to access it, i.e. by enumerating the collection in any way.
Comparatively, if you were using eager loading, EF wouldn't do this for you and the nav prop simply wouldn't be filled in with anything unless you explicitly called Include on it.
Updating untracked entities
In your second example, you are working with an entity object which was not created by Entity Framework. You made it yourself:
var medical = mapper.Map<MedicalViewModel, Medicals>(medicalViewModel);
And now you manually add it to the change tracker:
ctx.Entry(medical).State = medical.ID == 0 ? EntityState.Added : EntityState.Modified;
There's nothing wrong with this, but you have to realize that the entity in the change tracker was not generated by EF, and therefore it doesn't contain these special "lazy navigational properties". And because it doesn't contain these lazy navigational properties...
var sizeClinics = medical.Clinics.Count;
... the above code doesn't actually try to fetch the data from the database. It simply works with the entity object you generated and what it already contains in-memory.
And since you didn't add anything to medical.Clinics yourself, the collection is therefore empty.
The answer
Lazy loading only works on entity objects generated by EF, not on entity objects generated by you, regardless of whether you manually added it to EF's change tracker afterwards or not.
So to get the count, you can specifically query the clinics from the database:
var medical = mapper.Map<MedicalViewModel, Medicals>(medicalViewModel);
var clinicCount = ctx.Clinics.Count(p => p.MedicalId == medical.ID);
Or you could detach the entity and fetch it from the db, though I'm not a fan of this:
var medical = mapper.Map<MedicalViewModel, Medicals>(medicalViewModel);
ctx.Entry(medical).State = medical.ID == 0 ? EntityState.Added : EntityState.Modified;
ctx.SaveChanges();
// Detach
ctx.Entry(medical).State = EntityState.Detached;
// Now fetch from db
var medical2 = ctx.Medicals.FirstOrDefault(p => p.ID == medical.ID);
var sizeClinics = medical2.Clinics.Count;
Why detach? Remember how I mentioned that EF only allows tracking of one entity of a given type and PK. Since the object referred to by medical is already being tracked, you can't fetch and track another new instance of Medicals with the same PK.
By detaching the first, medical2 can be fetched and tracked since the change tracker "forgot" the other instance.
But to be honest, it would be easier to just open a new context instead of trying to manually detach and re-query.
var medical = mapper.Map<MedicalViewModel, Medicals>(medicalViewModel);
ctx.Entry(medical).State = medical.ID == 0 ? EntityState.Added : EntityState.Modified;
ctx.SaveChanges();
using(var ctx2 = new MyContext())
{
var medical2 = ctx2.Medicals.FirstOrDefault(p => p.ID == medical.ID);
var sizeClinics = medical2.Clinics.Count;
}
More info if you're interested
If you're using code first, lazy loading is why EF requires you to make these properties virtual. EF needs to be able to inherit from your entity class and make a special derived class which overrides the navigational property behavior.
You already stumbled on this, when you said:
I thing I've noticed: if I analyze medical object with QuickWatch the first time (without SaveChanges part) its as {System.Data.Entity.DynamicProxies.Medicals_650D310387E78A83885649345ED0FB2870EC304BF647B59321DFA0E4FBC78047}.
Instead, if I do SaveChanges and then I retrieve that medical, it is as {MyNamespace.Models.Medicals}.
That System.Data.Entity.DynamicProxies.Medicals_65 (and so on) class was dynamically generated by Entity Framework, inherits the Medicals class, and overrides the virtual navigational properties so that it lazily loads this information when the collection is enumerated.
This is the hidden magic of how EF achieves lazy loading.

Why do I need to .Include() collections

I wrote a query which is pretty simple:
var locations = await _context.Locations
.Include(x => x.LocationsOfTheUsers)
.Include(x => x.Address)
.ThenInclude(x => x.County)
.Where(CalculateFilters(searchObj))
.ToListAsync(cancellationToken);
And everytime LocationsOfTheUsers were null so I decided to .Include(x => x.LocationsOfTheUsers) and I received results as expected but I'm not sure why do I have to include this collections since it's defined like this:
public class Location
{
public string Title { get; set; }
public long? RegionId { get; set; }
public Region Region { get; set; }
public long? AddressId { get; set; }
public Address Address { get; set; }
public long? CountyId { get; set; }
public County County { get; set; }
public ICollection<LocationsOfTheUsers> LocationsOfTheUsers { get; set; }
}
I thought this will be automatically included since it exist as ICollection in Location class.
So why is .Include() on LocationsOfTheUsers needed here?
Thanks guys
Cheers
In entity framework the non-virtual properties represent the columns of the tables, the virtual properties represent the relations between the tables (one-to-many, many-to-many, ...)
So your property should have been defined as:
public virtual ICollection<LocationsOfTheUsers> LocationsOfTheUsers { get; set; }
One of the slower parts of a database query is the transfer of the selected data from the database management system to your local process. Hence it is wise to limit the selected data to the values you actually plan to use.
If you have a one-to-many relation between Schools and Students, and you ask for School [10] you don't want automatically to fetch its 2000 Students.
Even if you would like to have "School [10] with all its Students" it would not be efficient to use Include to also fetch the Students. Every Student will have a foreign key SchoolId with a Value of [10]. If you would use Include you would transfer this foreign key 2000 times. What a waste!
When using entity framework always use Select to fetch data and select only the properties that you actually plan to use. Only use Include if you plan to change the included items.
This way you can separate your database table structure from the actual query. If your database structure changes, only the query changes, users of your query don't notice the internal changes.
Apart from better performance and more robustness against changes, readers of your code can more easily see what values are in their query.
Certainly don't use Include to save you some typing. Having to debug one error after future changes will take way more time than you will ever save by typeing include instead of Select
Finally: limit your data early in your process, so put the Where in front.
So your query should be:
var predicate = CalculateFilters(searchObj)
var queryLocations = dbContext.Locations
.Where(predicate)
.Select(location => new
{
// Select only the location properties that you plan to use
Id = location.Id,
Name = location.Name,
// Locations Of the users:
UserLocations = location.LocationsOfTheUsers
.Select(userLocation => new
{
// again: only the properties that you plan to use
Id = userLocation.Id,
...
// Not needed, you already know the value
// LocationId = userLocation.LocationId
})
.ToList(),
Address = new
{
Street = location.Address.Street,
PostCode = location.Addrress.PostCode,
...
County = location.Address.County.Name // if you only want one property
// or if you want more properties:
County = new
{
Name = location.Address.County.Name,
Abbr = location.Address.Count.Abbr,
...
}),
},
});
I thought this will be automatically included since it exist as ICollection in Location class.
Well, it's not automatically included, probably for performance reasons as the graph of related entities and their recursive child entities may be rather deep.
That's why you use eager loading to explicitly include the related entities that you want using the Include method.
The other option is to use lazy loading which means that the related entities are loaded as soon as you access the navigation property in your code, assuming some prerequisites are fulfilled and that the context is still around when this happens.
Please refer to the docs for more information.
I believe you are using EntityFrameworkCore. In EntityFramework (EF6), lazy loading is enabled by default, However, in EntityFrameworkCore, lazy loading related entities is handled by a separate package Microsoft.EntityFrameworkCore.Proxies.
To enable the behaviour you are seeking, install the above package and add the following code
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseLazyLoadingProxies();
}
After this, the related entities will be loaded without the Include call.

Entity Framework 6.2 copy many to many from one DbContext to another DbContext

When working with a network database such as MySQL, the DbContext should be short lived, but according to https://www.entityframeworktutorial.net/EntityFramework4.3/persistence-in-entity-framework.aspx the DbContext can be long lived when working with a local database, such as SQLite.
My app is using a long lived DbContext to work with SQLite on HDD and I want to copy many-to-many entities to another DbContext for the same type of SQLite database on USB.
I am using the Code-First approach.
public class Student
{
public Student()
{
this.Courses = new HashSet<Course>();
}
public int StudentId { get; set; }
[Required]
public string StudentName { get; set; }
public virtual ICollection<Course> Courses { get; set; }
}
public class Course
{
public Course()
{
this.Students = new HashSet<Student>();
}
public int CourseId { get; set; }
public string CourseName { get; set; }
public virtual ICollection<Student> Students { get; set; }
}
DbContextHDD contains students StudentA, StudentB and StudentC and courses Course1, Course2 and Course3:
StudentA attends Course1 and Course3
StudentB attends Course2 and Course3
StudentC attends Course1 and Course2
DbContextUSB contains no students and no courses.
var courses = DbContextHDD.Courses.AsNoTracking();
List<Student> students = new List<Student>();
foreach(Course course in courses)
{
foreach(Student student in course.Students)
{
if(!students.Any(s => s.StudentId == student.StudentId))
{
students.Add(student);
}
}
}
Debug.WriteLine(students.Count); // output: 3
Debug.WriteLine(DbContextUSB.Students.Local.Count); // output: 0
DbContextUSB.Students.AddRange(students);
Debug.WriteLine(DbContextUSB.Students.Local.Count); // output: 4
DbContextUSB.SaveChanges(); // exception: UNIQUE constraint failed
DbContextUSB.Courses.AddRange(courses);
DbContextUSB.SaveChanges();
Why are there 4 students (3 unique and 1 duplicate) after I insert 3 unique students in to a DbSet with 0 students? What is the proper way to do this?
As I said, I am using a long lived DbContext because I am working with SQLite.
First, don't use AsNoTracking:
var courses = DbContextHDD.Courses. ...
Second, Include the required data:
var courses = DbContextHDD.Courses
.Include(c => c.Students)
.ToList();
Third, add the courses to the other context:
DbContextUSB.Courses.AddRange(courses);
DbContextUSB.SaveChanges();
You may not believe it, but in essence that's all!
One caveat is that you should disable proxy creation in the source context:
DbContextHDD.Configuration.ProxyCreationEnabled = false;
Otherwise EF creates proxy objects, which have a reference to the context they came from. They can't be attached to another context.
Another is that there may be students that don't attend courses. You'll miss them when querying courses. So you have to add them separately:
var lazyStudents = DbContextHDD.Students.Where(s => s.Courses.Count() == 0).ToList();
...
DbContextUSB.Students.AddRange(lazyStudents);
...
DbContextUSB.SaveChanges();
Why does this work?
Without tracking, Entity Framework can't detect that StudentA in
Course1 is the same student as in Course3. As a consequence, StudentA
in Course3 is a new Student instance. You'll end up having 6 students, 3 duplicates (if there's no unique index on StudentName preventing this). With tracking, EF does detect
that both courses have the same Student instance.
When adding an entity to a context, EF also marks nested
entities as Added when they're not yet attached to the context.
That's why it's enough to add courses only, and that's why EF doesn't
complain when courses contain the same student instances.
Since the added courses have their Students collections properly populated, EF also inserts the required junction records in the StudentCourse table. This didn't happen in your code (well maybe, or partly, see later).
Now why did you get 4 students?
Look at the courses:
Course1 StudentA*, StudentC*
Course2 StudentB*, StudentC
Course3 StudentA , StudentB
Because of AsNoTracking all student are different instances, but only the marked* students are in students because of how you add them. But here's the tricky part. Even with AsNoTracking(), Entity Framework executes relationship fixup with related entities that are materialized in one query. That means that the foreach(Course course in courses) loop produces courses with populated Students collections of which each student has one course in its Courses collection. It's almost impossible to keep track of what exactly happens, esp. because debugging also triggers lazy loading, but for sure, the line...
DbContextUSB.Students.AddRange(students);
also marks their nested courses and their students as Added as far as they ended up being different instances. The end result in this case is that one more student instance is added to the cache. Also, a number of junction records was created but not necessarily the correct ones.
The conclusion is that EF is a great tool for cloning object graphs, but the graph must be populated correctly, the right relationships and no duplicates, and should be added in one go.

Entity Framework Performance Issue

I am running into an interesting performance issue with Entity Framework. I am using Code First.
Here is the structure of my entities:
A Book can have many Reviews.
A Review is associated with a single Book.
A Review can have one or many Comments.
A Comment is associated with one Review.
public class Book
{
public int BookId { get; set; }
// ...
public ICollection<Review> Reviews { get; set; }
}
public class Review
{
public int ReviewId { get; set; }
public int BookId { get; set; }
public Book Book { get; set; }
public ICollection<Comment> Comments { get; set; }
}
public class Comment
{
public int CommentId { get; set; }
public int ReviewId { get; set; }
public Review Review { get; set; }
}
I populated my database with a lot of data and added the proper indexes. I am trying to retrieve a single book that has 10,000 reviews on it using this query:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.FirstOrDefault();
This particular book has 10,000 reviews. The performance of this query is around 4 seconds. Running the exact same query (via SQL Profiler) actually returns in no time at all. I used the same query and a SqlDataAdapter and custom objects to retrieve the data and it happens in under 500 milliseconds.
Using ANTS Performance Profiler it looks like a bulk of the time is being spent doing a few different things:
The Equals method is being called 50 million times.
Does anyone know why it would need to call this 50 million times and how I could increase the performance for this?
Why is Equals called 50M times?
It sounds quite suspicious. You have 10.000 reviews and 50.000.000 calls to Equals. Suppose that this is caused by identity map internally implemented by EF. Identity map ensures that each entity with unique key is tracked by the context only once so if context already has instance with the same key as loaded record from the database it will not materialize new instance and instead uses the existing one. Now how this can coincide with those numbers? My terrifying guess:
=============================================
1st record read | 0 comparisons
2nd record read | 1 comparison
3rd record read | 2 comparisons
...
10.000th record read | 9.999 comparisons
That means that each new record is compared with every existing record in identity map. By applying math to compute sum of all comparison we can use something called "Arithmetic sequence":
a(n) = a(n-1) + 1
Sum(n) = (n / 2) * (a(1) + a(n))
Sum(10.000) = 5.000 * (0 + 9.999) => 5.000 * 10.000 = 50.000.000
I hope I didn't make mistake in my assumptions or calculation. Wait! I hope I did mistake because this doesn't seem good.
Try turning off change tracking = hopefully turning off identity map checking.
It can be tricky. Start with:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.AsNoTracking()
.FirstOrDefault();
But there is a big chance that your navigation property will not be populated (because it is handled by change tracking). In such case use this approach:
var book = db.Books.Where(b => b.BookId == id).AsNoTracking().FirstOrDefault();
book.Reviews = db.Reviews.Where(r => r.BookId == id).AsNoTracking().ToList();
Anyway can you see what object type is passed to Equals? I think it should compare only primary keys and even 50M integer comparisons should not be such a problem.
As a side note EF is slow - it is well known fact. It also uses reflection internally when materializing entities so simply 10.000 records can take "some time". Unless you already did that you can also turn off dynamic proxy creation (db.Configuration.ProxyCreationEnabled).
I know this sounds lame, but have you tried the other way around, e.g.:
var reviewsAndBooks = db.Reviews.Where(r => r.Book.BookId == id)
.Include(r => r.Book);
I have noticed sometimes better performance from EF when you approach your queries this way (but I haven't had the time to figure out why).

Categories

Resources