Migrating legacy data with ef core is very slow - c#

So quick background. I am working on setting up code to migrate a legacy database which is a mess in numerous way to a new database designed code first in Entity Framework Core. I have made new models for the new database and auto generated the old database in ef core to simplify things.
One of the tables i'm creating is based on 2 tables in the legacy database that have essentially the same data (but different models), which combined represent a little over 300,000 rows. Therefore, I am making 300,000 of the new database models along with their related tables totaling about 1.2 million rows in the new database.
My problem is that when I run a test of about 1000 rows and extrapolate the time to complete the whole migration, it comes out to about 3.5 hours. This feels very slow even for such a large number of rows.
Note, both databases are local to my computer so any network delays wouldn't affect it
Here is an example of the logic of my code:
//old rows are selected earlier in the function from the legacyDbContext
//provider = IServiceProvider. I use dependency injection for this
foreach(var oldRow in oldRows)
{
using(var scope = provider.CreateScope())
{
var legacyDbContext = scope.ServiceProvider.GetRequiredService<LegacyDbContext>();
var newDatabaseDbContext = scope.ServiceProvider.GetRequiredService<NewDatabaseDbContext>();
var newRow = new NewRow();
newDatabaseContext.NewRows.Add(newRow);
newDatabaseContext.SaveChanges(); //generates the id for the newRow
//transfer data from oldRow to newRow with some very light processing
//Example
newRow.Name = oldRow.Name;
newRow.IsActive = Convert.ToBoolean(oldRow.IsActive)
//for some reason boolean values were saved as C# short ints
//which corresponds to tinyint on the database side
var oldRelatedItems = legacyDbContext.oldRelatedItems
.Where(m => m.oldItemId = oldItem.Id)
.ToList();
//in generally this list's count is only 2, sometimes 3
foreach(var oldRelatedItem in oldRelatedItems)
{
var newRelatedItem = new NewRelatedItem();
newDatabaseContext.NewRelatedItems.Add(newRelatedItem);
newRelatedItem.newItem = newItem;
//transfer data from oldRelatedItem to newRelatedItem
newDatabaseContext.SaveChanges();
}
//Some more data transferred
newDatabaseContext.SaveChanges();
}
}
One note here is that the legacy database does not have any foreign keys. There are columns which contain the id of rows in other tables but they are not configured as foreign keys (I told you this database was messy). So that may present itself as a bottleneck.
I have tried a couple of things that have been mostly ineffective. I tried running the foreach loops entirely inside the using statement, only saving the context 1 time at the end. This saved some time, but not much (5 minutes of 215 or so).
I also tried running the for loops as Parallel.ForEach loops but and actually saw the extrapolated time increase with this approach (could be I used this function incorrectly though).
Any thoughts on how I could improve my code's performance? The final migration will only have to be run 1 time as this whole project is being rebuild from the ground up (it really is THAT bad). Even still, I would like to know how to improve this, for my own understanding and so that maybe the migration won't take days (remember this question is all for just a couple tables).

Maybe try something like this.
Main themes are:
take the context creation out of the loop
try using bulk insert for everything at the end
using(var scope = provider.CreateScope())
{
//can this live outside the loop?
var legacyDbContext = scope.ServiceProvider.GetRequiredService<LegacyDbContext>();
var newRows = new List<NewRow>();
var newRelatedItems = new List<NewRelatedItem>();
foreach(var oldRow in oldRow){
var newRow = new NewRow();
newRow.Name = oldRow.Name;
newRow.IsActive = Convert.ToBoolean(oldRow.IsActive);
var oldRelatedItems = legacyDbContext.oldRelatedItems
.Where(m => m.oldItemId = oldItem.Id)
.ToList();
foreach(var oldRelatedItem in oldRelatedItems)
{
var newRelatedItem = new NewRelatedItem();
//is this supposed to be newRow?
newRelatedItem.newRow = newRow;
//newRelatedItem.newItem = newItem;
newRelatedItems.Add(newRelatedItem);
}
}
var newDatabaseDbContext = scope.ServiceProvider.GetRequiredService<NewDatabaseDbContext>();
newDatabaseDbContext.BulkInsert(newRows);
newDatabaseDbContext.BulkInsert(newRelatedItems);
}

Related

C# EF 5.0 Adding Million Records to MySQL DB takes hours

Below is the code i am using to add records to database.I know I am calling saveChanges() everytime which is expensive, but if a call save changes once after all i might get duplicate key exception. So i am looking for any idea to make it better to improve performance keeping duplicate records in mind.
using (var db = new dbEntities())
{
for (int i = 0; i < csvCustomers.Count; i++)
{
var csvCustomer = csvCustomers[i];
dbcustomer customer = new dbcustomer() { ADDRESS = csvCustomer.ADDRESS, FIRSTNAME = csvCustomer.FIRSTNAME, LASTNAME = csvCustomer.LASTNAME, PHONE = csvCustomer.PHONE, ZIPCODE = csvCustomer.ZIP };
try
{
dbzipcode z = db.dbzipcodes.FirstOrDefault(x => x.ZIP == customer.ZIPCODE);
//TODO: Handle if Zip Code not Found in DB
if (z == null)
{
db.dbcustomers.Add(customer);
throw new DbEntityValidationException("Zip code not found in database.");
}
customer.dbzipcode = z;
z.dbcustomers.Add(customer);
db.SaveChanges();
}
}
}
One solution that i have in my mind is to add data in batches and then call db.SaveChanges() and in case of Exception reduces the batch size recursively for those records.
Using EF to insert huge #'s of records is going to incur a significant cost compared to more direct approaches, but there are a few considerations you can make to markedly improve performance.
Firstly, batching the requests with a save changes will be preferential to saving individual records, or attempting to commit all of the changes at once. You will need to deal with exceptions if/when a batch fails. (Possibly committing that batch one at a time to fully isolate duplicate rows)
Next, you can pre-cache your zip codes rather than looking it up each iteration. Don't load the entire entity, just cache the zip code and the ID into an in-memory list:
(If the zip code entity amounts to little more than this, then just load the entity)
var zipCodes = db.dbzipcodes.Select(x => new {x.ZIPCODEID, x.ZIP}).ToList();
This will require a bit of extra attention when it comes to associating a zipcode to the customer within the batched calls since the zip code will initially not be known by the DbContext but may be known when the second customer for the same zip code is added.
To associate a zip code without loading it in a DbContext:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
If you did load the entire zip code entity into the cached list, then the var zipCode = new dbzipcode ... is not needed, just attach the cached entity.
However, if in the batch that zip code has already been associated to the DbContext you will get an error, (regardless of whether you cached the entity or just the ID/Code) so you need to first check the dbContext in-memory zip codes:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = db.dbzipcodes.Local.SingleOrDefault(x => x.ZIPCODEID == customerZipCode.ZIPCODEID)
?? new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
Lastly, EF tracks a lot of extra info in memory as the context so the other consideration along with batching would be to avoid using the same DbContext across all batches, rather opening a DbContext with each batch. As you add items and call SaveChanges across a DbContext, it is still tracking each entity that gets added. If you did batches of 1000 or so, the context would be tracking just that 1000 rather than 1000 then 2000, then 3000, etc. up to 5 Million rows.

EntityFramework is painfully slow at executing an update query

We're investigating a performance issue where EF 6.1.3 is being painfully slow, and we cannot figure out what might be causing it.
The database context is initialized with:
Configuration.ProxyCreationEnabled = false;
Configuration.AutoDetectChangesEnabled = false;
Configuration.ValidateOnSaveEnabled = false;
We have isolated the performance issue to the following method:
protected virtual async Task<long> UpdateEntityInStoreAsync(T entity,
string[] changedProperties)
{
using (var session = sessionFactory.CreateReadWriteSession(false, false))
{
var writer = session.Writer<T>();
writer.Attach(entity);
await writer.UpdatePropertyAsync(entity, changedProperties.ToArray()).ConfigureAwait(false);
}
return entity.Id;
}
There are two names in the changedProperties list, and EF correctly generated an update statement that updates just these two properties.
This method is called repeatedly (to process a collection of data items) and takes about 15-20 seconds to complete.
If we replace the method above with the following, execution time drops to 3-4 seconds:
protected virtual async Task<long> UpdateEntityInStoreAsync(T entity,
string[] changedProperties)
{
var sql = $"update {entity.TypeName()}s set";
var separator = false;
foreach (var property in changedProperties)
{
sql += (separator ? ", " : " ") + property + " = #" + property;
separator = true;
}
sql += " where id = #Id";
var parameters = (from parameter in changedProperties.Concat(new[] { "Id" })
let property = entity.GetProperty(parameter)
select ContextManager.CreateSqlParameter(parameter, property.GetValue(entity))).ToArray();
using (var session = sessionFactory.CreateReadWriteSession(false, false))
{
await session.UnderlyingDatabase.ExecuteSqlCommandAsync(sql, parameters).ConfigureAwait(false);
}
return entity.Id;
}
The UpdatePropertiesAsync method called on the writer (a repository implementation) looks like this:
public virtual async Task UpdatePropertyAsync(T entity, string[] changedPropertyNames, bool save = true)
{
if (changedPropertyNames == null || changedPropertyNames.Length == 0)
{
return;
}
Array.ForEach(changedPropertyNames, name => context.Entry(entity).Property(name).IsModified = true);
if (save)
await context.SaveChangesAsync().ConfigureAwait(false);
}
}
What is EF doing that completely kills performance? And is there anything we can do to work around it (short of using another ORM)?
By timing the code I was able to see that the additional time spent by EF was in the call to Attach the object to the context, and not in the actual query to update the database.
By eliminating all object references (setting them to null before attaching the object and restoring them after the update is complete) the EF code runs in "comparable times" (5 seconds, but with lots of logging code) to the hand-written solution.
So it looks like EF has a "bug" (some might call it a feature) causing it to inspect the attached object recursively even though change tracking and validation have been disabled.
Update: EF 7 appears to have addressed this issue by allowing you to pass in a GraphBehavior enum when calling Attach.
The problem with Entity framework is that when you call SaveChanges(), insert statements are sent to database one by one, that's how Entity works.
And actually there are 2 db hits per insert, first db hit is insert statement for a record, and the second one is select statement to get the id of inserted record.
So you have numOfRecords * 2 database trips * time for one database trip.
Write this in your code context.Database.Log = message => Debug.WriteLine(message); to log generated sql to console, and you will see what am I talking about.
You can use BulkInsert, here is the link: https://efbulkinsert.codeplex.com/
Seeing as though you already have tried setting:
Configuration.AutoDetectChangesEnabled = false;
Configuration.ValidateOnSaveEnabled = false;
And you are not using an ordered lists, I think you are going to have to refactor your code and do some benchmarking.
I believe the bottleneck is coming from the foreach as the context is having to deal with a potentially large amounts of bulk data (not sure how many this is in your case).
Try and cut the items contained in your array down into smaller batches before calling the SaveChanges(); or SaveChangesAsync(); methods, and note the performance deviations as apposed to letting the context grow too large.
Also, if you are still not seeing further gains, try disposing of the context post SaveChanges(); and then creating a new one, depending on the size of your entities list, flushing out the context may yield even further improvements.
But this all depends on how many entities we are talking about and may only be noticeable in the hundreds and thousands of record scenarios.

LINQ to Entities- SaveChanges take too much time

Currently, I am struggling with an issue regarding Entity Framework (LINQ to Entities). Most of the time when I try to execute entity.SaveChanges() everything works fine but at some points entity.SaveChanges() takes too much and timesouts. I searched a lot but was unable to find out the answer.
(According to companies policy, I cannot copy code somewhere else. So, I do not have the exact code but I will try to layout the basic structure. I hope it helps you to figure out the problem but if i doesn't then let me know.)
Task:
My task is to scan the whole network for some specific files. Match content of each file with the content of database and based on the matching either insert or update the database with the content of the file. I have around 3000 files on the network.
Problem:
public void PerformAction()
{
DbTransaction tran = null;
entity.Connection.Open(); //entity is a global variable declared like myDatabaseEntity entity = new myDatabaseEntity();
tran = entity.Connection.BeginTransaction();
foreach(string path in listOfPaths)
{
//returns 1 - Multiple matching in database OR
// 2 - One matching file in database OR
// 3 - No Matching found.
int returnValue = SearchDatabase();
if(returnValue == 1)
DoSomething(); //All inserts/updates work perfectly. Save changes also works correctly.
else if(returnValue == 2)
DoSomething(); //Again, everything ok. SaveChanges works perfectly here.
else
{
//This function uses some XML file to generate all the queries dynamically
//Forexample INSERT INTO TABLEA(1,2,3);
GenerateInsertQueriesFromXML();
ExecuteQueries();
SaveChanges(); <---- Problem here. Sometimes take too much time.
}
//Transaction commit/rollback code here
}
}
public bool ExecuteQueries()
{
int result = 0;
foreach(string query in listOfInsertQueries)
{
result = entity.ExecuteStoreCommand(query); //Execute the insert queries
if(result <=0)
return false;
}
entity.TestEntityA a = new entity.TestEntityA();
a.PropertyA = 123;
a.PropertyB = 345;
//I have around 25 properties here
entity.AddToTestEntityA(a);
return true;
}
Found the issue.
The main table where i was inserting all the data had a trigger on INSERT and DELETE.
So, whenever i inserted some new data in the main table, the trigger was firing in the backend and was taking all the time.
Entity framework is FAST and INNOCENT :D

How to Optimize Performance for a Full Table Update

I am writing a fairly large service centered around Stanford's Folding#Home project. This portion of the project is a WCF service hosted inside of a Windows Service. With proper database indices and a dual core Core2Duo/7200rpm platter I am able to run approximately 1500 rows per second (SQL 2012 Datacenter instance). Each hour when I run this update, it takes a considerable amount of time to iterate through all 1.5 million users and add updates where necessary.
Looking at the performance profiler in SQL Server Management Studio 2012, I see that every user is being loaded via individual queries. Is there a way with EF to eagerly load a set of a given size of users, update them in memory, then save the updated users - using queries more elegant than single-select, single-update? I am currently using EF5, but if I need to move to 6 for improved performance, I will. The main source of delay on this process is waiting for database results.
Also, if there is anything I should change about the ForAll or pre-processing, feel free to mention it. The group pre-processing is very quick and dramatically increases the speed of the update by controlling each EF context's size - but if I can pre-process more and improve the overall time, I am more than willing to look into it!
private void DoUpdate(IEnumerable<Update> table)
{
var t = table.ToList();
var numberOfRowsInGroups = t.Count() / (Properties.Settings.Default.UpdatesPerContext); //Control each local context size. 120 works well on most systems I have.
//Split work groups out of the table of updates.
var groups = t.AsParallel()
.Select((update, index) => new {Value = update, Index = index})
.GroupBy(a => a.Index % numberOfRowsInGroups)
.ToList();
groups.AsParallel().ForAll(group =>
{
var ents = new FoldingDataEntities();
ents.Configuration.AutoDetectChangesEnabled = false;
ents.Configuration.LazyLoadingEnabled = true;
ents.Database.Connection.Open();
var count = 0;
foreach (var a in group)
{
var update = a.Value;
var data = UserData.GetUserData(update.Name, update.Team, ents); //(Name,Team) is a superkey; passing ents allows external context control
if (data.TotalPoints < update.NewCredit)
{
data.addUpdate(update.NewCredit, update.Sum); //basic arithmetic, very quick - may attach a row to the UserData.Updates collection. (does not SaveChanges here)
}
}
ents.ChangeTracker.DetectChanges();
ents.SaveChanges();
});
}
//from the UserData class which wraps the EF code.
public static UserData GetUserData(string name, long team, FoldingDataEntities ents)
{
return context.Users.Local.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.Add(new User { Name = name, Team = team, StartDate = DateTime.Now, LastUpdate = DateTime.Now });
}
internal struct Update
{
public string Name;
public long NewCredit;
public long Sum;
public long Team;
}
EF is not the solution for raw performance... It's the "easy way" to do a Data Access Layer, or DAL, but comes with a fair bit of overhead. I'd highly recommend using Dapper or raw ADO.NET to do a bulk update... Would be a lot faster.
http://www.ormbattle.net/
Now, to answer your question, to do a batch update in EF, you'll need to download some extensions and third party plugins that will enable such abilities. See: Batch update/delete EF5

Entity Framework 4 Out of Memory on SaveChanges

I have a table that contains greater than half a million records. Each record contains about 60 fields but we only make changes to three of them.
We make a small modification to each entity based on a calculation and a look up.
Clearly I can't update each entity in turn and then SaveChanges as that would take far too long.
So at the end of the whole process I call SaveChanges on the Context.
This is causing an Out of Memory error when i apply SaveChanges
I'm using the DataRepository pattern.
//Update code
DataRepository<ExportOrderSKUData> repoExportOrders = new DataRepository<ExportOrderSKUData>();
foreach (ExportOrderSKUData grpDCItem in repoExportOrders.all())
{
..make changes to enity..
}
repoExportOrders.SaveChanges();
//Data repository snip
public DataRepository()
{
_context = new tomEntities();
_objectSet = _context.CreateObjectSet<T>();
}
public List<T> All()
{
return _objectSet.ToList<T>();
}
public void SaveChanges()
{
_context.SaveChanges();
}
What should I be looking for in this instance?
Making changes to half a million record through EF within one transaction is not supposed use case. Doing it in small batches is a better technical solution. Doing it on database side through some stored procedure can be even better solution.
I would start by slightly modifying your code (translate it to your repository API yourselves):
using (var readContext = new YourContext()) {
var set = readContext.CreateObjectSet<ExportOrderSKUData>();
foreach (var item in set.ToList()) {
readContext.Detach(item);
using (var updateContext = new YourContext()) {
updateContext.Attach(item);
// make your changes
updateContext.SaveChanges();
}
}
}
This code uses separate context for saving item = each save is in its own transaction. Don't be afraid of that. Even if you try to save more records within one call of SaveChanges EF will use separate roundtrip to database for every updated record. The only difference is if you want to have multiple updates in the same transaction (but having half a million updates in single transaction will cause issues anyway).
Another option may be:
using (var readContext = new YourContext()) {
var set = readContext.CreateObjectSet<ExportOrderSKUData>();
set.MergeOption = MergeOption.NoTracking;
foreach (var item in set) {
using (var updateContext = new YourContext()) {
updateContext.Attach(item);
// make your changes
updateContext.SaveChanges();
}
}
}
This can in theory consume even less memory because you don't need to have all entities loaded prior to doing foreach. The first example probably needs to load all entities prior to enumeration (by calling ToList) to avoid exception when calling Detach (modifying collection during enumeration) - but I'm not sure if that really happens.
Modifying those examples to use some batches should be easy.

Categories

Resources