I've just been noodling about with a profiler looking at performance bottlenecks in a WCF application after some users complained of slowness.
To my surprise, almost all the problems came down to Entity Framework operations. We use a repository pattern and most of the "Add/Modify" code looks very much like this:
public void Thing_Add(Thing thing)
{
Log.Trace("Thing_Add called with ThingID " + thing.ThingID);
if (db.Things.Any(m => m.ThingID == thing.ThingID))
{
db.Entry(thing).State = System.Data.EntityState.Modified;
}
else
{
db.Things.Add(thing);
}
}
This is obviously a convenient way to wrap an add/update check into a single function.
Now, I'm aware that EF isn't the most efficient thing when it comes to doing inserts and updates. However, my understanding was (which a little research bears out) that it should be capable of processing a few hundred records faster than a user would likely notice.
But this is causing big bottlenecks on small upserts. For example, in one case it takes six seconds to process about fifty records. That's a particularly bad example but there seem to be instances all over this application where small EF upserts are taking upwards of a second or two. Certainly enough to annoy a user.
We're using Entity Framework 5 with a Database First model. The profiler says it's not the Log.Trace that's causing the issue. What could be causing this, and how can I investigate and fix the issue?
I found the root of the problem on another SO post: DbContext is very slow when adding and deleting
Turns out that when you're working with a large number of objects, especially in a loop, the gradual accumulation of change tracking makes EF get slower and slower.
Refreshing the DbContext isn't enough in this instance as we're still working with too many linked entities. So I put this inside the repository:
public void AutoDetectChangesEnabled(bool detectChanges)
{
db.Configuration.AutoDetectChangesEnabled = detectChanges;
}
And can now use it to turn AutoDetectChangesEnabled on and off before doing looped inserts:
try
{
rep.AutoDetectChangesEnabled(false);
foreach (var thing in thingsInFile)
{
rep.Thing_add(new Thing(thing));
}
}
finally
{
rep.AutoDetectChangesEnabled(true);
}
This makes a hell of a difference. Although it needs to used with care, since it'll stop EF from recognizing potential updates to changed objects.
Related
I was playing around with Entity Framework 6 on my home computer and decided to try out inserting a fairly large amount of rows, around 430k.
My first try looked like this, yes I know it can be better but it was for research anyway:
var watch = System.Diagnostics.Stopwatch.StartNew();
foreach (var event in group)
{
db.Events.Add(event);
db.SaveChanges();
}
var dbCount = db.Events.Count(x => x.ImportInformation.FileName == group.Key);
if (dbCount != group.Count())
{
throw new Exception("Mismatch between rows added for file and current number of rows!");
}
watch.Stop();
Console.WriteLine($"Added {dbCount} events to database in {watch.Elapsed.ToString()}");
Started it in the evening and checked back when I got home from work. This was the result:
As you can see 64523 events were added in the first 4 hours and 41 minutes but then it got a lot slower and the next 66985 events took 14 hours and 51 minutes. I checked the database and the program was still inserting events but at an extremely low speed. I then decided to try the "new" AddRange method for DbSet.
I switched my models from IDbSet to DbSet and replaced the foreach loop with this:
db.Events.AddRange(group);
db.SaveChanges();
I could now add 60k+ events in around 30 seconds. It is perhaps not SqlBulkCopy fast but it is still a huge improvement. What is happening under the hood to achieve this? I thought I was gonna check SQL Server Profiler tomorrow for queries but It would be nice with an explanation what happens in code as well.
As Jakub answered, calling SaveChanges after every added entity was not helping. But you would still get some performance problems even if you move it out. That will not fix the performance issue caused by the Add method.
Add vs AddRange
That's a very common error to use the Add method to add multiple entities. In fact, it's the DetectChanges method that's INSANELY slow.
The Add method DetectChanges after every record added.
The AddRange method DetectChanges after all records are added.
See: Entity Framework - Performance Add
It is perhaps not SqlBulkCopy fast, but it is still a huge improvement
It's possible to get performance VERY close to SqlBulkCopy.
Disclaimer: I'm the owner of the project Entity Framework Extensions
(This library is NOT free)
This library can make your code more efficient by allowing you to save multiples entities at once. All bulk operations are supported:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
BulkSynchronize
Example:
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);
// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);
// Customize Primary Key
context.BulkMerge(customers, operation => {
operation.ColumnPrimaryKeyExpression =
customer => customer.Code;
});
We noticed that some very small web service calls were taking much longer than we expected. We did some investigation and put some timers in place and we narrowed it down to creating an instance of our Entity Framework 6 DbContext. Not the query itself, just the creation of the context. I've since put some logging to see on average how long it actually takes to create an instance of our DbContext and it seems it was around 50ms.
After the application is warmed up context creation is not slow. After an app recycle it starts out at 2-4ms (which is what we see in our dev environments). Context creation seems to slow down over time. Over the next couple hours it will creep up to the 50-80ms range and level off.
Our context is a fairly large code-first context with around 300 entities - including some pretty complex relationships between some of the entities. We are running EF 6.1.3. We are doing a "one context per request", but for most of our web API calls it's only doing one or two queries. Creating a context taking 60+ms, and then execute a 1ms query is a bit dissatisfying. We have about 10k requests per minute, so we aren't a lightly used site.
Here is a snapshot of what we see. Times are in MS, the big dip is a deploy which recycled the app domain. Each line is one of 4 different web servers. Notice it's not always the same server either.
I did take a memory dump to try and flesh out what's going on and here is the heap stats:
00007ffadddd1d60 70821 2266272 System.Reflection.Emit.GenericFieldInfo
00007ffae02e88a8 29885 2390800 System.Linq.Enumerable+WhereSelectListIterator`2[[NewRelic.Agent.Core.WireModels.MetricDataWireModel, NewRelic.Agent.Core],[System.Single, mscorlib]]
00007ffadda7c1a0 1462 2654992 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[System.Object, mscorlib],[System.Object, mscorlib]][]
00007ffadd4eccf8 83298 2715168 System.RuntimeType[]
00007ffadd4e37c8 24667 2762704 System.Reflection.Emit.DynamicMethod
00007ffadd573180 30013 3121352 System.Web.Caching.CacheEntry
00007ffadd2dc5b8 35089 3348512 System.String[]
00007ffadd6734b8 35233 3382368 System.RuntimeMethodInfoStub
00007ffadddbf0a0 24667 3749384 System.Reflection.Emit.DynamicILGenerator
00007ffae04491d8 67611 4327104 System.Data.Entity.Core.Metadata.Edm.MetadataProperty
00007ffadd4edaf0 57264 4581120 System.Signature
00007ffadd4dfa18 204161 4899864 System.RuntimeMethodHandle
00007ffadd4ee2c0 41900 5028000 System.Reflection.RuntimeParameterInfo
00007ffae0c9e990 21560 5346880 System.Data.SqlClient._SqlMetaData
00007ffae0442398 79504 5724288 System.Data.Entity.Core.Metadata.Edm.TypeUsage
00007ffadd432898 88807 8685476 System.Int32[]
00007ffadd433868 9985 9560880 System.Collections.Hashtable+bucket[]
00007ffadd4e3160 92105 10315760 System.Reflection.RuntimeMethodInfo
00007ffadd266668 493622 11846928 System.Object
00007ffadd2dc770 33965 16336068 System.Char[]
00007ffadd26bff8 121618 17335832 System.Object[]
00007ffadd2df8c0 168529 68677312 System.Byte[]
00007ffadd2d4d08 581057 127721734 System.String
0000019cf59e37d0 166894 143731666 Free
Total 5529765 objects
Fragmented blocks larger than 0.5 MB:
Addr Size Followed by
0000019ef63f2140 2.9MB 0000019ef66cfb40 Free
0000019f36614dc8 2.8MB 0000019f368d6670 System.Data.Entity.Core.Query.InternalTrees.SimpleColumnMap[]
0000019f764817f8 0.8MB 0000019f76550768 Free
0000019fb63a9ca8 0.6MB 0000019fb644eb38 System.Data.Entity.Core.Common.Utils.Set`1[[System.Data.Entity.Core.Metadata.Edm.EntitySet, EntityFramework]]
000001a0f6449328 0.7MB 000001a0f64f9b48 System.String
000001a0f65e35e8 0.5MB 000001a0f666e2a0 System.Collections.Hashtable+bucket[]
000001a1764e8ae0 0.7MB 000001a17659d050 System.RuntimeMethodHandle
000001a3b6430fd8 0.8MB 000001a3b6501aa0 Free
000001a4f62c05c8 0.7MB 000001a4f636e8a8 Free
000001a6762e2300 0.6MB 000001a676372c38 System.String
000001a7761b5650 0.6MB 000001a776259598 System.String
000001a8763c4bc0 2.3MB 000001a8766083a8 System.String
000001a876686f48 1.4MB 000001a8767f9178 System.String
000001a9f62adc90 0.7MB 000001a9f63653c0 System.String
000001aa362b8220 0.6MB 000001aa36358798 Free
That seems like quite a bit of metadata and typeusage.
Things we've tried:
Creating a simple test harness to replicate. It failed, my guess is because we weren't varying traffic, or varying the type of queries run. Just loading the context and executing a couple queries over and over didn't result in the timing increase.
We've reduced the context significantly, it was 500 entities, now 300. Didn't make a difference in speed. My guess is because we weren't using those 200 entities at all.
(Edit) We use SimpleInjector to create our "context per request". To validate it isn't SimpleInjector I have spun up an instance of the Context by just new'in it up. Same slow create times.
(Edit) We have ngen'd EF. Didn't make any impact.
What can we investigate further? I understand the cache used by EF is extensive to speed things up. Does more things in the cache, slow down the context creation? Is there a way to see exactly what's in that cache to flesh out any weird things in there? Does anyone know what specifically we can do to speed up context creation?
Update - 5/30/17
I took the EF6 source and compiled our own version to stick in some timings. We run a pretty popular site so collecting huge amount of timing info is tricky and I didn't get as far as I wanted, but basically we found that all of the slowdown is coming from this method
public void ForceOSpaceLoadingForKnownEntityTypes()
{
if (!_oSpaceLoadingForced)
{
// Attempting to get o-space data for types that are not mapped is expensive so
// only try to do it once.
_oSpaceLoadingForced = true;
Initialize();
foreach (var set in _genericSets.Values.Union(_nonGenericSets.Values))
{
set.InternalSet.TryInitialize();
}
}
}
Each iteration of that foreach hits for each one of the entities defined by a DBSet in our context. Each iteration is relatively short .1-.3 ms, but when you add in the 254 entities we had it adds up. We still haven't figured out why it's fast at the beginning and slows down.
Here is where I would start to solving the problem, sans moving to a more enterprise friendly solution.
Our context is a fairly large code-first context with around 300 entities
While EF has greatly improved over time, I still would start seriously looking at chopping things up once you get to 100 entities (actually I would start well before that, but that seems to be a magic number many people have stated - concensus?). Think of it as designing for "contexts", but use the word "domain" instead? That way you can sell your execs that you are applying "domain driven design" to fix the application? Maybe you are designing for future "microservices", then you use two buzz words in a single paragraph. ;-)
I am not a huge fan of EF in the Enterprise space, so I tend to avoid it for high scale or high performance applications. Your mileage may vary. For SMB, it is probably perfectly fine. I do run into clients that use it, however.
I am not sure the following ideas are completely up to date, but they are some other things I would consider, based on experience.
Pre-gen your views. They are the most expensive part of the query. This will help even more with large models.
Move your model to a separate assembly. Not so much a perf thing than a pet peeve of mine in code organization.
Examine your application, model, for caching possibilities. Query plan caching can often shave quite a bit of time off.
Use CompileQuery.
When feasible, use NoTracking. This is a huge savings, if you do not need the feature.
It looks like you are already running some type of profiler on the application, so I am going to assume you also looked at your SQL queries and possible performance gains. Yes, I know that is not the problem you are looking to solve, but it is something that can contribute to the entire issue from the user perspective.
In response to #WiktorZichia's comment about not answering the question about the performance problem, the best way to get rid of these types of problems in an Enterprise System is to get rid of Entity Framework. There are trade offs in every decision. EF is a great abstraction and speeds up development. But it comes with some unnecessary overhead that can hurt systems at scale. Now, technically, I still did not answer the "how do I solve this problem they way I am trying to solve it" question, so this might still be seen as a fail.
I'm going through old projects at work trying to make them faster. I'm currently looking at some web APIs. One API is running particularly slow the problem is in the data service it is calling. Specifically it is in a lambda method trying to map a stored procedure result to a domain model. A simple version of the code.
public IEnumerable<DomainModelResult> GetData()
{
return this.EntityFrameworkDB.GetDataSproc().ToList()
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
This is a simplified version, but after profiling it I found the major hangup is in the lambda function. I am assuming this is because the EFContext is still open and some goofy entity framework stuff is happening.
Problem is I'm relatively new to Entity Framework(intern) and pretty ignorant to the inner workings of it. Could someone explain why this is so slow. I feel it should be very fast The DomainModelResult is a POCO and only setter methods are being used in ToDomainModelResult.
Edit:
I thought ToList() would do that but started to doubt myself because I couldn't think of another explanation. All the ToDomainModelResult() stuff is extremely simple. Something like.
public static DomainModelResult ToDomainModelResult(SprocResult source)
{
return new DomainModeResult
{
FirstName = source.description,
MiddleName = source._middlename,
LastName = source.lastname,
UserName = source.expr2,
Address = source.uglyName
};
}
Its just a bunch of simple setters, I think the model causing problems has 17 properties. The reason this is being done is because the project is old database first and the stored procedures have ugly names that aren't descriptive at all. Also so switching the stored procedures in dataservices is easy and doesn't break the rest of the project.
Edit:2 For some reason Using ToArray and breaking apart the linq statements makes the assignment from procedure result to domain model result extremely fast. Now the whole dataservice method is faster which is odd, I don't know where the rest of the time went.
This might be a more esoteric question than I originally thought. My question hasn't been answered but the problem is no longer there. Thanks to all the replied. I'm keeping this as unanswered for now.
Edit3: Please flag this question for removal I can't remove it. I found the problem but it is totally unrelated to my original question. I misunderstood the problem when I asked the question. The increase in speed I'm chalking up to compiler optimization and running code in the profiler. The real issues wasn't in my lambda but in a dynamic lambda called by entity framework when the context is closed or an object is accessed it was doing data validation. GetString, GetInt32, and ISDBNull were eating up the most time. So I'm assuming microsoft has optimized these methods and the only way to speed this up is possibly making some variable not nullable in the procedure. This question is misleading and so esoteric I don't think it belongs here and will just confuse people. Sorry.
You should split the code and check which one is taking time.
public IEnumerable<DomainModelResult> GetData()
{
var lst = this.EntityFrameworkDB.GetDataSproc().ToList();
return lst
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
I am pretty sure the GetDataSproc procedure is taking most of your time. You need to optimize the stored procedure code
Update
If possible, it is better to do more work on SQL side instead of retrieving 60,000 rows into your memory. Few possible solutions:
If you need to display this information, do paging (top and skip)
If you are doing any filtering or calculating or grouping anything after you retrieve rows in memory, do it in your stored proc
.Net side, as you are returning IEnumerable you may able to use yield on your second part, depends on your architecture
I'm very familiar with using a transaction RDBMS, but how would I make sure that changes made to my in-memory data are rolled back if the transaction fails? What if I'm not even using a database?
Here's a contrived example:
public void TransactionalMethod()
{
var items = GetListOfItems();
foreach (var item in items)
{
MethodThatMayThrowException(item);
item.Processed = true;
}
}
In my example, I might want the changes made to the items in the list to somehow be rolled back, but how can I accomplish this?
I am aware of "software transactional memory" but don't know much about it and it seems fairly experimental. I'm aware of the concept of "compensatable transactions", too, but that incurs the overhead of writing do/undo code.
Subversion seems to deal with errors updating a working copy by making you run the "cleanup" command.
Any ideas?
UPDATE:
Reed Copsey offers an excellent answer, including:
Work on a copy of data, update original on commit.
This takes my question one level further - what if an error occurs during the commit? We so often think of the commit as an immediate operation, but in reality it may be making many changes to a lot of data. What happens if there are unavoidable things like OutOfMemoryExceptions while the commit is being applied?
On the flipside, if one goes for a rollback option, what happens if there's an exception during the rollback? I understand things like Oracle RDBMS has the concept of rollback segments and UNDO logs and things, but assuming there's no serialisation to disk (where if it isn't serialised to disk it didn't happen, and a crash means you can investigate those logs and recover from it), is this really possible?
UPDATE 2:
An answer from Alex made a good suggestion: namely that one updates a different object, then, the commit phase is simply changing the reference to the current object over to the new object. He went further to suggest that the object you change is effectively a list of the modified objects.
I understand what he's saying (I think), and I want to make the question more complex as a result:
How, given this scenario, do you deal with locking? Imagine you have a list of customers:
var customers = new Dictionary<CustomerKey, Customer>();
Now, you want to make a change to some of those customers, how do you apply those changes without locking and replacing the entire list? For example:
var customerTx = new Dictionary<CustomerKey, Customer>();
foreach (var customer in customers.Values)
{
var updatedCust = customer.Clone();
customerTx.Add(GetKey(updatedCust), updatedCust);
if (CalculateRevenueMightThrowException(customer) >= 10000)
{
updatedCust.Preferred = true;
}
}
How do I commit? This (Alex's suggestion) will mean locking all customers while replacing the list reference:
lock (customers)
{
customers = customerTx;
}
Whereas if I loop through, modifying the reference in the original list, it's not atomic,a and falls foul of the "what if it crashes partway through" problem:
foreach (var kvp in customerTx)
{
customers[kvp.Key] = kvp.Value;
}
Pretty much every option for doing this requires one of three basic methods:
Make a copy of your data before modifications, to revert to a rollback state if aborted.
Work on a copy of data, update original on commit.
Keep a log of changes to your data, to undo them in the case of an abort.
For example, Software Transactional Memory, which you mentioned, follows the third approach. The nice thing about that is that it can work on the data optimistically, and just throw away the log on a successful commit.
Take a look at the Microsoft Research project, SXM.
From Maurice Herlihy's page, you can download documentation as well as code samples.
You asked: "What if an error occurs during the commit?"
It doesn't matter. You can commit to somewhere/something in memory and check meanwhile if the operation succeeds. If it did, you change the reference of the intended object (object A) to where you committed (object B). Then you have failsafe commits - the reference is only updated on successful commit. Reference change is atomic.
public void TransactionalMethod()
{
var items = GetListOfItems();
try {
foreach (var item in items)
{
MethodThatMayThrowException(item);
item.Processed = true;
}
}
catch(Exception ex) {
foreach (var item in items)
{
if (item.Processed) {
UndoProcessingForThisItem(item);
}
}
}
}
Obviously, the implementation of the "Undo..." is left as an exercise for the reader.
Good morning!
Actually I'm playing around with EF atm a little bit and I need your guys help:
Following scenario: I have a table with a lot of data in it. If I'm querying this table through EF, all the records get load into memory.
eg.
var counter = default(int);
using (var myEntities = new MyEntities())
{
foreach (var record in myEntities.MySpecialTable)
{
counter++;
}
}
So, I run through all of the records of MySpecialTable and count counter up.
When I'm having a look at the Taskmanager, or anything else which shows me the memory-consumption of my app, it tells me: 400Mb. (because of the data)
If I run throught the table another time, memory consumption will get doubled.
I already tried to call the GC, but it won't work.
Why do all of the records of each run get stored somewhere in the memory (and are not released)?
Where do they get stored?
Is there any way to make EF-queries behave like a DataReader?
Or is there any other ORM which is as elegant as EF?
edit:
no, i'm not doing a count like that ... this is just for showing the iteration :)
First of all, I hope you're not actually doing a count like that; the Count method is far more efficient. But presuming this is just demo code to show the memory issue:
Change the MergeOption property of the ObjectQuery to NoTracking. This tells the Entity Framework that you have no intention of modifying the entities, and hence it needn't bother keeping track of their original state.