Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?
static void ExportFromLegacy()
{
var usersQuery = _oldDb.users.Where(x =>
x.status == 'active');
int BatchSize = 1000;
var errorCount = 0;
var successCount = 0;
var batchCount = 0;
// Using MoreLinq's Batch for sequences
// https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
Console.WriteLine(String.Format("Batch count at {0}", batchCount));
batchCount++;
foreach(var user in batch)
{
try
{
var userData = _oldDb.userData.Where(x =>
x.user_id == user.user_id).ToList();
if (userData.Count > 0)
{
// Insert into table
var newData = new newData()
{
UserId = user.user_id; // shortened code for brevity.
};
_db.newUserData.Add(newData);
_db.SaveChanges();
// Insert item(s) into table
foreach (var item in userData.items)
{
if (!_db.userDataItems.Any(x => x.id == item.id)
{
var item = new Item()
{
UserId = user.user_id, // shortened code for brevity.
DataId = newData.id // id from object created above
};
_db.userDataItems.Add(item);
}
_db.SaveChanges();
successCount++;
}
}
}
catch(Exception ex)
{
errorCount++;
Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
Console.WriteLine("Message: " + ex.Message);
Console.WriteLine("InnerException: " + ex.InnerException);
}
}
}
Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(#"hh\:mm\:ss")));
}
Unfortunately, the major issue is the number of database round-trip.
You make a round-trip:
For every user, you retrieve user data by user id in the old database
For every user, you save user data in the new database
For every user, you save user data item in the new database
So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.
The only way to dramatically improve the performance is by reducing the number of database round-trip.
Batch - Retrieve user by id
You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
foreach(var userData in userDatas)
{
....
}
You should already have saved a few hours only with this change.
Batch - Save Changes
Every time you save a user data or item, you perform a database round-trip.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library allows to perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.
You will, however, have to use a relation to the newData instance instead of using the ID directly.
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
_db.BulkInsert(newDatas);
_db.BulkInsert(newDataItems);
}
EDIT: Answer subquestion
One of the properties of a newDataItem, is the id of newData. (ex.
newDataItem.newDataId.) So newData would have to be saved first in
order to generate its id. How would I BulkInsert if there is a
dependency of an another object?
You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.
public class UserData
{
public int UserDataID { get; set; }
// ... properties ...
public List<UserDataItem> Items { get; set; }
}
public class UserDataItem
{
public int UserDataItemID { get; set; }
// ... properties ...
public UserData OwnerData { get; set; }
}
var userData = new UserData();
var userDataItem = new UserDataItem();
// Use navigation property to set the parent.
userDataItem.OwnerData = userData;
Tutorial: Configure One-to-Many Relationship
Also, I don't see a BulkSaveChanges in your example code. Would that
have to be called after all the BulkInserts?
Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)
Here is an example using BulkSaveChanges:
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
var context = new UserContext();
context.userDatas.AddRange(newDatas);
context.userDataItems.AddRange(newDataItems);
context.BulkSaveChanges();
}
BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.
In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.
Entity Framework is a very bad choice for importing large amounts of data. I know this from personal experience.
That being said, I found a few ways to optimize things when I tried to use it in the same way you are.
The Context will cache objects as you add them, and the more inserts you do, the slower future inserts will get. My solution was to limit each context to about 500 inserts before I disposed of that instance and created a new one. This boosted performance significantly.
I was able to make use of multiple threads to increase performance, but you will have to be very careful about resource contention. Each thread will definitely need its own Context, don't even think about trying to share it between threads. My machine had 8 cores, so threading will probably not help you as much; with a single core I doubt it will help you at all.
Turn off ChangeTracking with AutoDetectChangesEnabled = false;, change tracking is incredibly slow. Unfortunately this means you have to modify your code to make all changes directly through the context. No more Entity.Property = "Some Value";, it becomes Context.Entity(e=> e.Property).SetValue("Some Value"); (or something like that, I don't remember the exact syntax), which makes the code ugly.
Any queries you do should definitely use AsNoTracking.
With all that, I was able to cut a ~20 hour process down to about 6 hours, but I still don't recommend using EF for this. It was an extremely painful project due almost entirely to my poor choice of EF to add data. Please use something else... anything else...
I don't want to give the impression that EF is a bad data access library, it is great at what it was designed to do, unfortunately this is not what it was designed for.
I can think on a few options.
1) A little speed increase could be done by moving your _db.SaveChanges() under your foreach() close bracket
foreach (...){
}
successCount += _db.SaveChanges();
2) Add items to a list, and then to context
List<ObjClass> list = new List<ObjClass>();
foreach (...)
{
list.Add(new ObjClass() { ... });
}
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
3) If it's a big amount of dada, save on bunches
List<ObjClass> list = new List<ObjClass>();
int cnt=0;
foreach (...)
{
list.Add(new ObjClass() { ... });
if (++cnt % 100 == 0) // bunches of 100
{
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
// Optional if a HUGE amount of data
if (cnt % 1000 == 0)
{
_db = new MyDbContext();
}
}
}
// Don't forget that!
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
4) If TOOOO big, considere using bulkinserts. There are a few examples on internet and a few free libraries around.
Ref: https://blogs.msdn.microsoft.com/nikhilsi/2008/06/11/bulk-insert-into-sql-from-c-app/
On most of these options you loose some control on error handling as it is difficult to know which one failed.
Related
I have more than 15000 POCO elements stored in a Redis List. I'm using ServiceStack in order to save and get them. However, I'm not pleased about the response times that I have when I get them into a grid. As I read , it would be better to store these object in hash - but unfortunately I could not find any good example for my case :(
This is the method I use, in order to get them into my grid
public IEnumerable<BookingRequestGridViewModel> GetAll()
{
try
{
var redisManager = new RedisManagerPool(Global.RedisConnector);
using (var redis = redisManager.GetClient())
{
var redisEntities = redis.As<BookingRequestModel>();
var result =redisEntities.Lists["BookingRequests"].GetAll().Select(z=> new BookingRequestGridViewModel
{
CreatedDate =z.CreatedDate,
DropOffBranchName =z.DropOffBranch !=null ? z.DropOffBranch.Name : string.Empty,
DropOffDate =z.DropOffDate,
DropOffLocationName = z.DropOffLocation != null ? z.DropOffLocation.Name : string.Empty,
Id =z.Id.Value,
Number =z.Number,
PickupBranchName =z.PickUpBranch !=null ? z.PickUpBranch.Name :string.Empty,
PickUpDate =z.PickUpDate,
PickupLocationName = z.PickUpLocation != null ? z.PickUpLocation.Name : string.Empty
}).OrderBy(z=>z.Id);
return result;
}
}
catch (Exception ex)
{
return null;
}
}
Note that I use redisEntities.Lists["BookingRequests"].GetAll() which is causing performance issues (I would like to use just redisEntities.Lists["BookingRequests"] but I lose last updates from grid - after editing)
I would like to know if saving them into list is a good approach as for me it's very important to have a fast grid (I have now 1 second at paging which is huge).
Please, advice!
Firstly you should not create a new Redis Client Manager like RedisManagerPool instance each time, there should only be a singleton instance of RedisManagerPool in your App which all clients are resolved from.
But otherwise I would rethink your data access strategy, downloading 15K items in a batch is not an ideal strategy. You can create indexes by storing ids in Sets or you could store items in a sorted set with a value that you can page against like an incrementing id, e.g:
var redisEntities = redis.As<BookingRequestModel>();
var bookings = redisEntities.SortedSets["bookings"];
foreach (var item in new BookingRequestModel[0])
{
redisEntities.AddItemToSortedSet(bookings, item, item.Id);
}
That way you will be able to fetch them in batches, e.g:
var batch = bookings.GetRangeByLowestScore(fromId, toId, skip, take);
I have two List. One with new objects, lets call it NewCol, and one that I have read out of my database(lets call it CurrentCol). The CurrentCol is about 1.2 million rows(and will grow daily) and the NewCol can be any thing from 1 to thousands of rows at a time. Can any one please help me improve the following code as it runs way too slow:
foreach (PortedNumberCollection element in NewCol)
{
if (CurrentCol.AsParallel().Select(x => x.MSISDN).Contains(element.MSISDN))
{
CurrentCol.AsParallel().Where(y => y.MSISDN == element.MSISDN)
.Select
(x =>
{
//if the MSISDN exists in both the old and new collection
// then update the old collection
x.RouteAction = element.RouteAction;
x.RoutingLabel = element.RoutingLabel;
return x;
}
);
}
else
{
//The item does not exist in the old collection so add it
CurrentCol.Add(element);
}
}
It's a pretty bad idea to read the whole database into memory and search it there. Searching is what databases are optimized for. The best thing you can do is to move that whole code into the database somehow, for example by executing a MERGE statement for each item in NewCol.
However, if you can't do that, the next best thing would be to make CurrentCol a dictionary with MSISDN as the key:
// Somewhere... Only once, not every time you want to add new items
// In fact, there shouldn't be any CurrentCol, only the dictionary.
var currentItems = CurrentCol.ToDictionary(x => x.MSISDN, x => x);
// ...
foreach(var newItem in NewCol)
{
PortedNumberCollection existingItem;
if(currentItems.TryGetValue(newItem.MSISDN, out existingItem)
{
existingItem.RouteAction = newItem.RouteAction;
existingItem.RoutingLabel = newItem.RoutingLabel;
}
else
{
currentItems.Add(newItem.MSISDN, newItem);
}
}
The following code results in deletions instead of updates.
My question is: is this a bug in the way I'm coding against Entity Framework or should I suspect something else?
Update: I got this working, but I'm leaving the question now with both the original and the working versions in hopes that I can learn something I didn't understand about EF.
In this, the original non working code, when the database is fresh, all the additions of SearchDailySummary object succeed, but on the second time through, when my code was supposedly going to perform the update, the net result is a once again empty table in the database, i.e. this logic manages to be the equiv. of removing each entity.
//Logger.Info("Upserting SearchDailySummaries..");
using (var db = new ClientPortalContext())
{
foreach (var item in items)
{
var campaignName = item["campaign"];
var pk1 = db.SearchCampaigns.Single(c => c.SearchCampaignName == campaignName).SearchCampaignId;
var pk2 = DateTime.Parse(item["day"].Replace('-', '/'));
var source = new SearchDailySummary
{
SearchCampaignId = pk1,
Date = pk2,
Revenue = decimal.Parse(item["totalConvValue"]),
Cost = decimal.Parse(item["cost"]),
Orders = int.Parse(item["conv1PerClick"]),
Clicks = int.Parse(item["clicks"]),
Impressions = int.Parse(item["impressions"]),
CurrencyId = item["currency"] == "USD" ? 1 : -1 // NOTE: non USD (if exists) -1 for now
};
var target = db.Set<SearchDailySummary>().Find(pk1, pk2) ?? new SearchDailySummary();
if (db.Entry(target).State == EntityState.Detached)
{
db.SearchDailySummaries.Add(target);
addedCount++;
}
else
{
// TODO?: compare source and target and change the entity state to unchanged if no diff
updatedCount++;
}
AutoMapper.Mapper.Map(source, target);
itemCount++;
}
Logger.Info("Saving {0} SearchDailySummaries ({1} updates, {2} additions)", itemCount, updatedCount, addedCount);
db.SaveChanges();
}
Here is the working version (although I'm not 100% it's optimized, it's working reliably and performing fine as long as I batch it out in groups of 500 or less items in a shot - after that it slows down exponentially but I think that just may be a different question/subject)...
//Logger.Info("Upserting SearchDailySummaries..");
using (var db = new ClientPortalContext())
{
foreach (var item in items)
{
var campaignName = item["campaign"];
var pk1 = db.SearchCampaigns.Single(c => c.SearchCampaignName == campaignName).SearchCampaignId;
var pk2 = DateTime.Parse(item["day"].Replace('-', '/'));
var source = new SearchDailySummary
{
SearchCampaignId = pk1,
Date = pk2,
Revenue = decimal.Parse(item["totalConvValue"]),
Cost = decimal.Parse(item["cost"]),
Orders = int.Parse(item["conv1PerClick"]),
Clicks = int.Parse(item["clicks"]),
Impressions = int.Parse(item["impressions"]),
CurrencyId = item["currency"] == "USD" ? 1 : -1 // NOTE: non USD (if exists) -1 for now
};
var target = db.Set<SearchDailySummary>().Find(pk1, pk2);
if (target == null)
{
db.SearchDailySummaries.Add(source);
addedCount++;
}
else
{
AutoMapper.Mapper.Map(source, target);
db.Entry(target).State = EntityState.Modified;
updatedCount++;
}
itemCount++;
}
Logger.Info("Saving {0} SearchDailySummaries ({1} updates, {2} additions)", itemCount, updatedCount, addedCount);
db.SaveChanges();
}
The thing that keeps popping up in my mind is that maybe the Entry(entity) or Find(pk) method has some side effects? I should probably be consulting the documentation but any advice is appreciated..
It's a slight assumption on my part (without looking into your models/entities), but have a look at what's going on within this block (see if the objects being attached here are related to the deletions):
if (db.Entry(target).State == EntityState.Detached)
{
db.SearchDailySummaries.Add(target);
addedCount++;
}
Your detached object won't be able to use its navigation properties to locate its related objects; you'll be re-attaching an object in a potentially conflicting state (without the correct relationships).
You haven't mentioned what is being deleted above, so I may be way off. Just off out, so this is a little rushed, hope there's something useful in there.
As a follow-up to my earlier question, I now know that EF doesn't just save all of the changes of the entire entity for me automatically. If my entity has a List<Foo>, I need to update that list and save it. But how? I've tried a few things, but I can't get the list to save properly.
I have a many-to-many association between Application and CustomVariableGroup. An app can have one or more groups, and a group can belong to one or more apps. I believe I have this set up correctly with my Code First implementation because I see the many-to-many association table in the DB.
The bottom line is that the Application class has a List<CustomVariableGroup>. My simple case is that the app already exists, and now a user has selected a group to belong to the app. I want to save that change in the DB.
Attempt #1
this.Database.Entry(application).State = System.Data.EntityState.Modified;
this.Database.SaveChanges();
Result: Association table still has no rows.
Attempt #2
this.Database.Applications.Attach(application);
var entry = this.Database.Entry(application);
entry.CurrentValues.SetValues(application);
this.Database.SaveChanges();
Result: Association table still has no rows.
Attempt #3
CustomVariableGroup group = application.CustomVariableGroups[0];
application.CustomVariableGroups.Clear();
application.CustomVariableGroups.Add(group);
this.Database.SaveChanges();
Result: Association table still has no rows.
I've researched quite a bit, and I've tried more things than I've shown, and I simply don't know how to update an Application's list with a new CustomVariableGroup. How should it be done?
EDIT (Solution)
After hours of trial and error, this seems to be working. It appears that I need to get the objects from the DB, modify them, then save them.
public void Save(Application application)
{
Application appFromDb = this.Database.Applications.Single(
x => x.Id == application.Id);
CustomVariableGroup groupFromDb = this.Database.CustomVariableGroups.Single(
x => x.Id == 1);
appFromDb.CustomVariableGroups.Add(groupFromDb);
this.Database.SaveChanges();
}
While I consider this a bit of a hack, it works. I'm posting this in the hopes that it helps someone else save an entire day's worth of work.
public void Save(Application incomingApp)
{
if (incomingApp == null) { throw new ArgumentNullException("incomingApp"); }
int[] groupIds = GetGroupIds(incomingApp);
Application appToSave;
if (incomingApp.IdForEf == 0) // New app
{
appToSave = incomingApp;
// Clear groups, otherwise new groups will be added to the groups table.
appToSave.CustomVariableGroups.Clear();
this.Database.Applications.Add(appToSave);
}
else
{
appToSave = this.Database.Applications
.Include(x => x.CustomVariableGroups)
.Single(x => x.IdForEf == incomingApp.IdForEf);
}
AddGroupsToApp(groupIds, appToSave);
this.Database.SaveChanges();
}
private void AddGroupsToApp(int[] groupIds, Application app)
{
app.CustomVariableGroups.Clear();
List<CustomVariableGroup> groupsFromDb2 =
this.Database.CustomVariableGroups.Where(g => groupIds.Contains(g.IdForEf)).ToList();
foreach (CustomVariableGroup group in groupsFromDb2)
{
app.CustomVariableGroups.Add(group);
}
}
private static int[] GetGroupIds(Application application)
{
int[] groupIds = new int[application.CustomVariableGroups.Count];
int i = 0;
foreach (CustomVariableGroup group in application.CustomVariableGroups)
{
groupIds[i] = group.IdForEf;
i++;
}
return groupIds;
}
This section simply reads from an excel spreadsheet. This part works fine with no performance issues.
IEnumerable<ImportViewModel> so=data.Select(row=>new ImportViewModel{
PersonId=(row.Field<string>("person_id")),
ValidationResult = ""
}).ToList();
Before I pass to a View I want to set ValidationResult so I have this piece of code. If I comment this out the model is passed to the view quickly. When I use the foreach it will take over a minute. If I hardcode a value for item.PersonId then it runs quickly. I know I'm doing something wrong, just not sure where to start and what the best practice is that I should be following.
foreach (var item in so)
{
if (db.Entity.Any(w => w.ID == item.PersonId))
{
item.ValidationResult = "Successful";
}
else
{
item.ValidationResult = "Error: ";
}
}
return View(so.ToList());
You are now performing a database call per item in your list. This is really hard on your database and thus your performance. Try to itterate trough your excel result, gather all users and select them in one query. Make a list from this query result (else the query call is performed every time you access the list). Then perform a match between the result list and your excel.
You need to do something like this :
var ids = so.Select(i=>i.PersonId).Distinct().ToList();
// Hitting Database just for this time to get all Users Ids
var usersIds = db.Entity.Where(u=>ids.Contains(u.ID)).Select(u=>u.ID).ToList();
foreach (var item in so)
{
if (usersIds.Contains(item.PersonId))
{
item.ValidationResult = "Successful";
}
else
{
item.ValidationResult = "Error: ";
}
}
return View(so.ToList());