How to Optimize Performance for a Full Table Update - c#

I am writing a fairly large service centered around Stanford's Folding#Home project. This portion of the project is a WCF service hosted inside of a Windows Service. With proper database indices and a dual core Core2Duo/7200rpm platter I am able to run approximately 1500 rows per second (SQL 2012 Datacenter instance). Each hour when I run this update, it takes a considerable amount of time to iterate through all 1.5 million users and add updates where necessary.
Looking at the performance profiler in SQL Server Management Studio 2012, I see that every user is being loaded via individual queries. Is there a way with EF to eagerly load a set of a given size of users, update them in memory, then save the updated users - using queries more elegant than single-select, single-update? I am currently using EF5, but if I need to move to 6 for improved performance, I will. The main source of delay on this process is waiting for database results.
Also, if there is anything I should change about the ForAll or pre-processing, feel free to mention it. The group pre-processing is very quick and dramatically increases the speed of the update by controlling each EF context's size - but if I can pre-process more and improve the overall time, I am more than willing to look into it!
private void DoUpdate(IEnumerable<Update> table)
{
var t = table.ToList();
var numberOfRowsInGroups = t.Count() / (Properties.Settings.Default.UpdatesPerContext); //Control each local context size. 120 works well on most systems I have.
//Split work groups out of the table of updates.
var groups = t.AsParallel()
.Select((update, index) => new {Value = update, Index = index})
.GroupBy(a => a.Index % numberOfRowsInGroups)
.ToList();
groups.AsParallel().ForAll(group =>
{
var ents = new FoldingDataEntities();
ents.Configuration.AutoDetectChangesEnabled = false;
ents.Configuration.LazyLoadingEnabled = true;
ents.Database.Connection.Open();
var count = 0;
foreach (var a in group)
{
var update = a.Value;
var data = UserData.GetUserData(update.Name, update.Team, ents); //(Name,Team) is a superkey; passing ents allows external context control
if (data.TotalPoints < update.NewCredit)
{
data.addUpdate(update.NewCredit, update.Sum); //basic arithmetic, very quick - may attach a row to the UserData.Updates collection. (does not SaveChanges here)
}
}
ents.ChangeTracker.DetectChanges();
ents.SaveChanges();
});
}
//from the UserData class which wraps the EF code.
public static UserData GetUserData(string name, long team, FoldingDataEntities ents)
{
return context.Users.Local.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.Add(new User { Name = name, Team = team, StartDate = DateTime.Now, LastUpdate = DateTime.Now });
}
internal struct Update
{
public string Name;
public long NewCredit;
public long Sum;
public long Team;
}

EF is not the solution for raw performance... It's the "easy way" to do a Data Access Layer, or DAL, but comes with a fair bit of overhead. I'd highly recommend using Dapper or raw ADO.NET to do a bulk update... Would be a lot faster.
http://www.ormbattle.net/
Now, to answer your question, to do a batch update in EF, you'll need to download some extensions and third party plugins that will enable such abilities. See: Batch update/delete EF5

Related

Migrating legacy data with ef core is very slow

So quick background. I am working on setting up code to migrate a legacy database which is a mess in numerous way to a new database designed code first in Entity Framework Core. I have made new models for the new database and auto generated the old database in ef core to simplify things.
One of the tables i'm creating is based on 2 tables in the legacy database that have essentially the same data (but different models), which combined represent a little over 300,000 rows. Therefore, I am making 300,000 of the new database models along with their related tables totaling about 1.2 million rows in the new database.
My problem is that when I run a test of about 1000 rows and extrapolate the time to complete the whole migration, it comes out to about 3.5 hours. This feels very slow even for such a large number of rows.
Note, both databases are local to my computer so any network delays wouldn't affect it
Here is an example of the logic of my code:
//old rows are selected earlier in the function from the legacyDbContext
//provider = IServiceProvider. I use dependency injection for this
foreach(var oldRow in oldRows)
{
using(var scope = provider.CreateScope())
{
var legacyDbContext = scope.ServiceProvider.GetRequiredService<LegacyDbContext>();
var newDatabaseDbContext = scope.ServiceProvider.GetRequiredService<NewDatabaseDbContext>();
var newRow = new NewRow();
newDatabaseContext.NewRows.Add(newRow);
newDatabaseContext.SaveChanges(); //generates the id for the newRow
//transfer data from oldRow to newRow with some very light processing
//Example
newRow.Name = oldRow.Name;
newRow.IsActive = Convert.ToBoolean(oldRow.IsActive)
//for some reason boolean values were saved as C# short ints
//which corresponds to tinyint on the database side
var oldRelatedItems = legacyDbContext.oldRelatedItems
.Where(m => m.oldItemId = oldItem.Id)
.ToList();
//in generally this list's count is only 2, sometimes 3
foreach(var oldRelatedItem in oldRelatedItems)
{
var newRelatedItem = new NewRelatedItem();
newDatabaseContext.NewRelatedItems.Add(newRelatedItem);
newRelatedItem.newItem = newItem;
//transfer data from oldRelatedItem to newRelatedItem
newDatabaseContext.SaveChanges();
}
//Some more data transferred
newDatabaseContext.SaveChanges();
}
}
One note here is that the legacy database does not have any foreign keys. There are columns which contain the id of rows in other tables but they are not configured as foreign keys (I told you this database was messy). So that may present itself as a bottleneck.
I have tried a couple of things that have been mostly ineffective. I tried running the foreach loops entirely inside the using statement, only saving the context 1 time at the end. This saved some time, but not much (5 minutes of 215 or so).
I also tried running the for loops as Parallel.ForEach loops but and actually saw the extrapolated time increase with this approach (could be I used this function incorrectly though).
Any thoughts on how I could improve my code's performance? The final migration will only have to be run 1 time as this whole project is being rebuild from the ground up (it really is THAT bad). Even still, I would like to know how to improve this, for my own understanding and so that maybe the migration won't take days (remember this question is all for just a couple tables).
Maybe try something like this.
Main themes are:
take the context creation out of the loop
try using bulk insert for everything at the end
using(var scope = provider.CreateScope())
{
//can this live outside the loop?
var legacyDbContext = scope.ServiceProvider.GetRequiredService<LegacyDbContext>();
var newRows = new List<NewRow>();
var newRelatedItems = new List<NewRelatedItem>();
foreach(var oldRow in oldRow){
var newRow = new NewRow();
newRow.Name = oldRow.Name;
newRow.IsActive = Convert.ToBoolean(oldRow.IsActive);
var oldRelatedItems = legacyDbContext.oldRelatedItems
.Where(m => m.oldItemId = oldItem.Id)
.ToList();
foreach(var oldRelatedItem in oldRelatedItems)
{
var newRelatedItem = new NewRelatedItem();
//is this supposed to be newRow?
newRelatedItem.newRow = newRow;
//newRelatedItem.newItem = newItem;
newRelatedItems.Add(newRelatedItem);
}
}
var newDatabaseDbContext = scope.ServiceProvider.GetRequiredService<NewDatabaseDbContext>();
newDatabaseDbContext.BulkInsert(newRows);
newDatabaseDbContext.BulkInsert(newRelatedItems);
}

C# EF 5.0 Adding Million Records to MySQL DB takes hours

Below is the code i am using to add records to database.I know I am calling saveChanges() everytime which is expensive, but if a call save changes once after all i might get duplicate key exception. So i am looking for any idea to make it better to improve performance keeping duplicate records in mind.
using (var db = new dbEntities())
{
for (int i = 0; i < csvCustomers.Count; i++)
{
var csvCustomer = csvCustomers[i];
dbcustomer customer = new dbcustomer() { ADDRESS = csvCustomer.ADDRESS, FIRSTNAME = csvCustomer.FIRSTNAME, LASTNAME = csvCustomer.LASTNAME, PHONE = csvCustomer.PHONE, ZIPCODE = csvCustomer.ZIP };
try
{
dbzipcode z = db.dbzipcodes.FirstOrDefault(x => x.ZIP == customer.ZIPCODE);
//TODO: Handle if Zip Code not Found in DB
if (z == null)
{
db.dbcustomers.Add(customer);
throw new DbEntityValidationException("Zip code not found in database.");
}
customer.dbzipcode = z;
z.dbcustomers.Add(customer);
db.SaveChanges();
}
}
}
One solution that i have in my mind is to add data in batches and then call db.SaveChanges() and in case of Exception reduces the batch size recursively for those records.
Using EF to insert huge #'s of records is going to incur a significant cost compared to more direct approaches, but there are a few considerations you can make to markedly improve performance.
Firstly, batching the requests with a save changes will be preferential to saving individual records, or attempting to commit all of the changes at once. You will need to deal with exceptions if/when a batch fails. (Possibly committing that batch one at a time to fully isolate duplicate rows)
Next, you can pre-cache your zip codes rather than looking it up each iteration. Don't load the entire entity, just cache the zip code and the ID into an in-memory list:
(If the zip code entity amounts to little more than this, then just load the entity)
var zipCodes = db.dbzipcodes.Select(x => new {x.ZIPCODEID, x.ZIP}).ToList();
This will require a bit of extra attention when it comes to associating a zipcode to the customer within the batched calls since the zip code will initially not be known by the DbContext but may be known when the second customer for the same zip code is added.
To associate a zip code without loading it in a DbContext:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
If you did load the entire zip code entity into the cached list, then the var zipCode = new dbzipcode ... is not needed, just attach the cached entity.
However, if in the batch that zip code has already been associated to the DbContext you will get an error, (regardless of whether you cached the entity or just the ID/Code) so you need to first check the dbContext in-memory zip codes:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = db.dbzipcodes.Local.SingleOrDefault(x => x.ZIPCODEID == customerZipCode.ZIPCODEID)
?? new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
Lastly, EF tracks a lot of extra info in memory as the context so the other consideration along with batching would be to avoid using the same DbContext across all batches, rather opening a DbContext with each batch. As you add items and call SaveChanges across a DbContext, it is still tracking each entity that gets added. If you did batches of 1000 or so, the context would be tracking just that 1000 rather than 1000 then 2000, then 3000, etc. up to 5 Million rows.

EntityFramework is painfully slow at executing an update query

We're investigating a performance issue where EF 6.1.3 is being painfully slow, and we cannot figure out what might be causing it.
The database context is initialized with:
Configuration.ProxyCreationEnabled = false;
Configuration.AutoDetectChangesEnabled = false;
Configuration.ValidateOnSaveEnabled = false;
We have isolated the performance issue to the following method:
protected virtual async Task<long> UpdateEntityInStoreAsync(T entity,
string[] changedProperties)
{
using (var session = sessionFactory.CreateReadWriteSession(false, false))
{
var writer = session.Writer<T>();
writer.Attach(entity);
await writer.UpdatePropertyAsync(entity, changedProperties.ToArray()).ConfigureAwait(false);
}
return entity.Id;
}
There are two names in the changedProperties list, and EF correctly generated an update statement that updates just these two properties.
This method is called repeatedly (to process a collection of data items) and takes about 15-20 seconds to complete.
If we replace the method above with the following, execution time drops to 3-4 seconds:
protected virtual async Task<long> UpdateEntityInStoreAsync(T entity,
string[] changedProperties)
{
var sql = $"update {entity.TypeName()}s set";
var separator = false;
foreach (var property in changedProperties)
{
sql += (separator ? ", " : " ") + property + " = #" + property;
separator = true;
}
sql += " where id = #Id";
var parameters = (from parameter in changedProperties.Concat(new[] { "Id" })
let property = entity.GetProperty(parameter)
select ContextManager.CreateSqlParameter(parameter, property.GetValue(entity))).ToArray();
using (var session = sessionFactory.CreateReadWriteSession(false, false))
{
await session.UnderlyingDatabase.ExecuteSqlCommandAsync(sql, parameters).ConfigureAwait(false);
}
return entity.Id;
}
The UpdatePropertiesAsync method called on the writer (a repository implementation) looks like this:
public virtual async Task UpdatePropertyAsync(T entity, string[] changedPropertyNames, bool save = true)
{
if (changedPropertyNames == null || changedPropertyNames.Length == 0)
{
return;
}
Array.ForEach(changedPropertyNames, name => context.Entry(entity).Property(name).IsModified = true);
if (save)
await context.SaveChangesAsync().ConfigureAwait(false);
}
}
What is EF doing that completely kills performance? And is there anything we can do to work around it (short of using another ORM)?
By timing the code I was able to see that the additional time spent by EF was in the call to Attach the object to the context, and not in the actual query to update the database.
By eliminating all object references (setting them to null before attaching the object and restoring them after the update is complete) the EF code runs in "comparable times" (5 seconds, but with lots of logging code) to the hand-written solution.
So it looks like EF has a "bug" (some might call it a feature) causing it to inspect the attached object recursively even though change tracking and validation have been disabled.
Update: EF 7 appears to have addressed this issue by allowing you to pass in a GraphBehavior enum when calling Attach.
The problem with Entity framework is that when you call SaveChanges(), insert statements are sent to database one by one, that's how Entity works.
And actually there are 2 db hits per insert, first db hit is insert statement for a record, and the second one is select statement to get the id of inserted record.
So you have numOfRecords * 2 database trips * time for one database trip.
Write this in your code context.Database.Log = message => Debug.WriteLine(message); to log generated sql to console, and you will see what am I talking about.
You can use BulkInsert, here is the link: https://efbulkinsert.codeplex.com/
Seeing as though you already have tried setting:
Configuration.AutoDetectChangesEnabled = false;
Configuration.ValidateOnSaveEnabled = false;
And you are not using an ordered lists, I think you are going to have to refactor your code and do some benchmarking.
I believe the bottleneck is coming from the foreach as the context is having to deal with a potentially large amounts of bulk data (not sure how many this is in your case).
Try and cut the items contained in your array down into smaller batches before calling the SaveChanges(); or SaveChangesAsync(); methods, and note the performance deviations as apposed to letting the context grow too large.
Also, if you are still not seeing further gains, try disposing of the context post SaveChanges(); and then creating a new one, depending on the size of your entities list, flushing out the context may yield even further improvements.
But this all depends on how many entities we are talking about and may only be noticeable in the hundreds and thousands of record scenarios.

Performance of Related tables in calculated properties

Looking to see if there is a better way to do this.
I am using DB first and have a table called Items. Below is a calculated property that I specify on a partial class to extend it that uses related tables to derive the result. This technically works fine. I like the ease of using it, the fact that all this business logic is defined once in the domain, and that you can use complex code to derive the results.
The only issue I am concerned with is performance, when you pull back multiple records. Using SQL Profiler, I can see that if you pull back 50 rows of Item, it will execute an additional query to retrieve the Work Order Details in this case, 50 times! Not sure why it is not doing a join instead of doing 50 additional reads??? And I have more than one calculated property like this going out to multiple tables and each one is doing an explicit read per row = slow!
The result from pulling back 50 rows from Item table, is 2,735 reads from the database as indicated by SQL Profiler!!! I am not that familiar with SQL Profiler so maybe I am mis-interpreting somthing, but I know it is doing a lot of DB reads.
Why doesn't it do a join instead of doing an explicit read to the related tables for each row in Items?
What is "Best Practice" to accomplish this? Is there a better way?
.
[Display(Name = "Qty Allocated")]
public decimal QtyAllocated
{
get
{
if (this.TrackInventory)
{
var inProcessNonRemnantWorkOrderDetails = this.WorkOrderDetails.Where(wod =>
new[]
{
(int)WorkOrderStatus.Created,
(int)WorkOrderStatus.Released,
(int)WorkOrderStatus.InProcess
}.Contains(wod.WorkOrderHeader.StatusId)
&& wod.EstimatedQuantity >= 1 //Don't count remnants as allocated
);
var inProcessRemnantWorkOrderDetails = this.WorkOrderDetails.Where(wod =>
new[]
{
(int)WorkOrderStatus.Created,
(int)WorkOrderStatus.Released,
(int)WorkOrderStatus.InProcess
}.Contains(wod.WorkOrderHeader.StatusId)
&& wod.EstimatedQuantity > 0 && wod.EstimatedQuantity < 1 //gets just remnants
);
decimal qtyAllocated =
this.WorkOrderDetails == null
? 0
: inProcessNonRemnantWorkOrderDetails.Sum(a => (a.EstimatedQuantity - a.ActualQuantity));
if (qtyAllocated == 0 && inProcessRemnantWorkOrderDetails.Any())
{
qtyAllocated = 0.1M;
}
return qtyAllocated;
}
else
{
return 0;
}
}
}
Aron was correct. When I eager load the related entities by using the Include() method in my query, there is only 1 hit to the database.

C# EF / LINQ hack fix hitting perfomance? Other way of fixing?

I've been learning C# / LINQ / ASP.NET / MVC 3 / EF for a few months now comming from Java / Icefaces / Ibatis base (Real world uses .NET D;). I really enjoy LINQ / Entity Framework from the .NET Framework but I'm having a few issues understand what's really happening behind the scenes.
Here's my problem:
I'm using a AJAX / JSON fed jQuery datatable (that I highly recommend to anyone in need of a free web datatable system by the way). I have a method in my MVC3 application that returns a JSON result of the data needed by the table, doing the sorting and all. Everything is working nicely and smoothly. However, I'm having a concern with the "dirty" hack I had to do to make this work.
Here's the complete code:
//inEntities is the Entity Framework Database Context
//It includes the following entities:
// Poincon
// Horaire
// HoraireDetail
//Poincon, Horaire and HoraireDetail are "decorated" using the Metadata technic which
//adds properties methods and such to the Entity (Like getEmploye which you will see in
//the following snippet)
//
//The Entity Employe is not a database data and therefor not handled by the EF.
//Instead, it is a simple object with properties that applies Lazy Loading to get an
//Employe Name based off of his Employe ID in the Active Directory. An employe object
//can be constructed with his Employe ID which will expose the possibility of getting
//the Employe Name from the AD if needed.
[HttpPost]
public JsonResult List(FormCollection form)
{
String sEcho;
int iDisplayStart;
int iDisplayLength;
String sSearch;
int iSortingCols;
Dictionary<String, String> sorting;
try
{
sEcho = form["sEcho"];
iDisplayStart = int.Parse(form["iDisplayStart"]);
iDisplayLength = int.Parse(form["iDisplayLength"]);
sSearch = form["sSearch"];
iSortingCols = int.Parse(form["iSortingCols"]);
sorting = new Dictionary<string,string>();
for (int i = 0; i < iSortingCols; i++)
sorting.Add(form["mDataProp_" + form["iSortCol_" + i]].ToUpper(), form["sSortDir_" + i].ToUpper());
}
catch
{
HttpContext.Response.StatusCode = 500;
return null;
}
var qPoincon = inEntities.Poincons.AsEnumerable();
var lPoincon = qPoincon.Select(o => new
{
o.id,
emp = o.getEmploye(),
o.poinconStart,
o.poinconEnd,
o.commentaire,
o.codeExceptions
}).AsEnumerable();
//Search
lPoincon = lPoincon.Where(p => (p.emp.empNoStr.Contains(sSearch) || p.emp.empNom.Contains(sSearch) || (p.commentaire != null && p.commentaire.Contains(sSearch))));
//Keep count
int iTotalDisplayRecords = lPoincon.Count();
//Sorting
foreach(KeyValuePair<String,String> col in sorting)
{
switch (col.Key)
{
case "EMPNO":
if (col.Value == "ASC")
lPoincon = lPoincon.OrderBy(h => h.emp.empNo);
else
lPoincon = lPoincon.OrderByDescending(h => h.emp.empNo);
break;
case "POINCONSTART":
if (col.Value == "ASC")
lPoincon = lPoincon.OrderBy(h => h.poinconStart);
else
lPoincon = lPoincon.OrderByDescending(h => h.poinconStart);
break;
case "POINCONEND":
if (col.Value == "ASC")
lPoincon = lPoincon.OrderBy(h => h.poinconEnd);
else
lPoincon = lPoincon.OrderByDescending(h => h.poinconEnd);
break;
case "COMMENTAIRE":
if (col.Value == "ASC")
lPoincon = lPoincon.OrderBy(h => h.commentaire);
else
lPoincon = lPoincon.OrderByDescending(h => h.commentaire);
break;
}
}
//Paging
lPoincon = lPoincon.Skip(iDisplayStart).Take(iDisplayLength);
//Building Response
var jdt = new
{
iTotalDisplayRecords = iTotalDisplayRecords,
iTotalRecords = inEntities.Poincons.Count(),
sEcho = sEcho,
aaData = lPoincon
};
return Json(jdt);
}
As you can see, when I'm grabbing the entire list of "Poincons" from the EF and turning it into a Enumerable. From my current understanding, turning the LINQ query into a Enumerable "kills" the link to the EF, or in other words, will generate the SQL required to get that list at that point instead of keeping the LINQ data until the end and execute a percise query that will return only the data you require. After turning this LINQ Query into a Enumerable, I'm heavily filtering the LINQ (since there is paging, sorting, searching in the datatable). This leads me to thinkg that what my code is currently doing is "Grab all the "Poincons" from the database and put it into the web server's memory as a Enumerable, do your work with the Enumerable then serialize the result as a JSON string and send it to the client.
If I'm correct, the performance hit is quite heavy when you hit the couple thousand of entries (which will happen quite fast once in production... everytime an employe comes to work, it will add 1 entry. 100 employes, ~300 work days a year, you get the idea).
The reason for this hack is that the EF does not know what "getEmploye" method of "Poincon" is, therefor throwing an exception at runtime similar to this:
LINQ to Entities ne reconnaît pas la méthode « PortailNorclair.Models.Employe getEmploye() », et cette dernière ne peut pas être traduite en expression de magasin.
Approximated traduction (If anyone can let me know in a comment how to configure IIS / ASP.NET to display errors in english while keeping the globalization in a foreign language, I would be really grateful. French information about error messages is sometimes lacking):
LINQ to Entity does not recognize the method " PortailNorclair.Models.Employe getEmploye()" and the following could not be translated to a SQL expression.
The "getEmploye" method instances and returns a Employe object with the employe id found in the Poincon object. That Employe object has properties that "lazy loads" information like the employe name from the Active Directory.
So the question is: How can I avoid the performance hit from using .AsEnumerable() on the non-filtered list of objects?
Thanks a lot!
The "getEmploye" method instances and returns a Employe object with
the employe id found in the Poincon object. That Employe object has
properties that "lazy loads" information like the employe name from
the Active Directory.
You should be storing the Employee Name in the database, so you can then order, sort, skip and take in your Linq Query without having to load every employee object.
If empNoStr, empNom, and empNo were all in the database, you could retrieve just the records you want, and call getEmploye() (loading whatever else you need from active directory, or wherever) for each of those.
There are some classes on which your program performs its main work.
There are other classes which represent to database rows.
If you keep them separated, you can also separate actions you intend to occur in the database from actions you intend to perform locally. This makes it trivial to avoid loading the full table, when specific rows are required.
I see you're also doing Paging locally, while the database can do that and save your webserver some memory.

Categories

Resources