Entity Framework, Bulk Inserts, and Maintaining Relationships

Entity Framework, Bulk Inserts, and Maintaining Relationships - c#

I have what would seem to be a common problem yet I cannot figure out how to achieve the desired outcome. I have a nested entity with navigation properties defined on it as seen in the following diagram.
The map points collection can potentially be quite large for a given MapLine and there can be quite a large number of MapLines for a MapLayer.
The question here is what is the best approach for getting a MapLayer object inserted into the database using Entity Framework and still maintain the relationships that are defined by the navigation properties?
A standard Entity Framework implementation
dbContext.MapLayers.Add(mapLayer);
dbContext.SaveChanges();
causes a large memory spike and pretty poor return times.
I have tried implementing the EntityFramework.BulkInsert package but it does not honor the relationships of the objects.
This seems like it would be a problem that someone has run into before but I cant seem to find any resources that explain how to accomplish this task.
Update
I have tried to implement the suggestion provided by Richard but I am not understanding how I would go about this for a nested entity such as the one I have described. I am running under the assumption that I need to insert the MapLayer object, then the MapLines, then the MapPoints to honor the PF/FK relationship in the database. I am currently trying the following code but this does not appear to be correct.
dbContext.MapLayers.Add(mapLayer);
dbContext.SaveChanges();
List<MapLine> mapLines = new List<MapLine>();
List<MapPoint> mapPoints = new List<MapPoint>();
foreach (MapLine mapLine in mapLayer.MapLines)
{
//Update the mapPoints.MapLine properties to reflect the current line object
var updatedLines = mapLine.MapPoints.Select(x => { x.MapLine = mapLine; return x; }).ToList();
mapLines.AddRange(updatedLines);
}
using (TransactionScope scope = new TransactionScope())
{
MyDbContext context = null;
try
{
context = new MyDbContext();
context.Configuration.AutoDetectChangesEnabled = false;
int count = 0;
foreach (var entityToInsert in mapLines)
{
++count;
context = AddToContext(context, entityToInsert, count, 100, true);
}
context.SaveChanges();
}
finally
{
if (context != null)
context.Dispose();
}
scope.Complete();
}
Update 2
After having tried multiple different ways to achieve this I finally gave up and just inserted the MapLayer as an entity and stored the MapLines => MapPoints relationship as the raw Json string in a byte array on the MapLayer entity (as I am not querying against those structures this works for me).
As the saying goes "It aint pretty, but it works".
I did have some success with the BulkInsert package and managing the relationships outside of EF, but again ran into a memory problem when trying to use EF to pull the data back into the system. It seems that currently, EF is not capable of handling large datasets and complex relationships efficiently.

I had bad experience with huge context save. All those recommendations about saving in iterations by 100 rows, by 1000 rows, then disposing context or clearing list and detaching objects, assigning null to everything etc etc - it is all bullshit. We had requirements to insert daily millions of rows in many tables. Definitely one should not use entity in these conditions. You will be fighting with memory leaks and decrease in insertion speed when iterations proceed.
Our first improvement was creating stored procedures and adding them to model. It is 100 times faster then Context.SaveChanges(), and there is no leaks, no decrease in speed over time.
But it was not sufficient for us and we decided to use SqlBulkCopy. It is super fast. 1000 times faster then using stored procedures.
So my suggestion will be:
if you have many rows to insert but count is under something like 50000 rows, use stored procedures, imported in model;
if you have hundreds of thousands of rows, go and try SqlBulkCopy.
Here is some code:
EntityConnection ec = (EntityConnection)Context.Connection;
SqlConnection sc = (SqlConnection)ec.StoreConnection;
var copy = new SqlBulkCopy(sc, SqlBulkCopyOptions.CheckConstraints | SqlBulkCopyOptions.Default , null);
copy.DestinationTableName = "TableName";
copy.ColumnMappings.Add("SourceColumn", "DBColumn");
copy.WriteToServer(dataTable);
copy.Close();
If you use DbTransaction with context, you can manage to bulk insert using that transaction as well, but it needs some hacks.

Bulk Insert is not the only way of efficiently adding data using Entity Framework - a number of alternatives are detailed in this answer. You can use the optimisations suggested there (disabling change tracking) then you can just add things as normal.
Note that as you are adding many items at once, you'll need to recreate your context fairly frequently to stop the memory leak and slowdown that you'll get.

Related

Live Transfer of data from one provider to another in Entity Framework

I apologise if this has been asked already, I am struggling greatly with the terminology of what I am trying to find out about as it conflicts with functionality in Entity Framework.
What I am trying to do:
I would like to create an application that on setup gives the user to use 1 database as a "trial"/"startup" database, i.e. non-production database. This would allow a user to trial the application but would not have backups etc. in no way would this be a "production" database. This could be SQLite for example.
When the user is then ready, they could then click "convert to production" (or similar), and give it the target of the new database machine/database. This would be considered the "production" environment. This could be something like MySQL, SQLServer or.. whatever else EF connects to these days..
The question:
Does EF support this type of migration/data transfer live? Would it need another app where you could configure the EF source and EF destination for it to then run through the process of conversion/seeding/population of the data source to another data source?
Why I have asked here:
I have tried to search for things around this topic, but transferring/migration brings up subjects totally non-related, so any help would be much appreciated.

From what you describe I don't think there is anything out of the box to support that. You can map a DbContext to either database, then it would be a matter of fetching and detaching entities from the evaluation DbContext and attaching them to the production one.
For a relatively simple schema / object graph this would be fairly straight-forward to implement.
ICollection<Customer> customers = new List<Customer>();
using(var context = new AppDbContext(evalConnectionString))
{
customers = context.Customers.AsNoTracking().ToList();
}
using(var context = new AppDbContext(productionConnectionString))
{ // Assuming an empty database...
context.Customers.AddRange(customers);
}
Though for more complex models this could take some work, especially when dealing with things like existing lookups/references. Where you want to move objects that might share the same reference to another object you would need to query the destination DbContext for existing relatives and substitute them before saving the "parent" entity.
ICollection<Order> orders = new List<Order>();
using(var context = new AppDbContext(evalConnectionString))
{
orders = context.Orders
.Include(x => x.Customer)
.AsNoTracking()
.ToList();
}
using(var context = new AppDbContext(productionConnectionString))
{
var customerIds = orders.Select(x => x.Customer.CustomerId)
.Distinct().ToList();
var existingCustomers = context.Customers
.Where(x => customerIds.Contains(x.CustomerId))
.ToList();
foreach(var order in orders)
{ // Assuming all customers were loaded
var existingCustomer = existingCustomers.SingleOrDefault(x => x.CustomerId == order.Customer.CustomerId);
if(existingCustomer != null)
order.Customer = existingCustomer;
else
existingCustomers.Add(order.Customer);
context.Orders.Add(order);
}
}
This is a very simple example to outline how to handle scenarios where you may be inserting data with references that may, or may not exist in the target DbContext. If we are copying across Orders and want to deal with their respective Customers we first need to check if any tracked customer reference exists and use that reference to avoid a duplicate row being inserted or throwing an exception.
Normally loading the orders and related references from one DbContext should ensure that multiple orders referencing the same Customer entity will all share the same entity reference. However, to use detached entities that we can associate with the new DbContext via AsNoTracking(), detached references to the same record will not be the same reference so we need to treat these with care.
For example where there are 2 orders for the same customer:
var ordersA = context.Orders.Include(x => x.Customer).ToList();
Assert.AreSame(orders[0].Customer, orders[1].Customer); // Passes
var ordersB = context.Orders.Include(x => x.Customer).AsNoTracking().ToList();
Assert.AreSame(orders[0].Customer, orders[1].Customer); // Fails
Even though in the 2nd example both are for the same customer. Each will have a Customer reference with the same ID, but 2 different references because the DbContext is not tracking the references used. One of the several "gotchas" with detached entities and efforts to boost performance etc. Using tracked references isn't ideal since those entities will still think they are associated with another DbContext. We can detach them, but that means diving through the object graph and detaching all references. (Do-able, but messy compared to just loading them detached)
Where it can also get complicated is when possibly migrating data in batches (disposing of a DbContext regularly to avoid performance pitfalls for larger data volumes) or synchronizing data over time. It is generally advisable to first check the destination DbContext for matching records and use those to avoid duplicate data being inserted. (or throwing exceptions)
So simple data models this is fairly straight forward. For more complex ones where there is more data to bring across and more relationships between that data, it's more complicated. For those systems I'd probably look at generating a database-to-database migration such as creating INSERT statements for the desired target DB from the data in the source database. There it is just a matter of inserting the data in relational order to comply with the data constraints. (Either using a tool or rolling your own script generation)

C# .NET 4.6.1 Entity Framework - DB.MyTable.Add(...) is slow despite NOT calling DB.SaveChanges()

(A) This version is slow... duration is measured in multiple minutes
DB is a typical EF Data Context to a SQL Serve database
AA_Words_100 is a simple SQL Server table which is added to the EF designer
DB.AA_Words_100.Add is called ~3,000 times (confirmed via debugging with counter variables)
I have confirmed that >99% of the runtime is inside the inner loop
XCampaign is a Dictionary<string, Dictionary<string, _Word>> where _Word is a trivial non-EF object.
foreach (var XCampaign in Words100)
foreach (var KVP in XCampaign.Value)
DB.AA_Words_100.Add(KVP.Value.To_AA_Word_100());
DB.SaveChanges();
(B) This version is fast... - .Add() is simply commented out to narrow the scope
var iTemp = 0;
foreach (var XCampaign in Words100)
foreach (var KVP in XCampaign.Value)
iTemp++;
DB.SaveChanges();
(C) This version is fast. I simply fill up a List before calling DB.AddRange(...)
var LIST_WordsToAdd = new List<AA_Words_100>();
foreach (var XCampaign in Words100)
{
foreach (var KVP in XCampaign.Value)
{
LIST_WordsToAdd.Add(KVP.Value.To_AA_Word_100());
}
DB.AA_Words_100.AddRange(LIST_WordsToAdd);
}
DB.SaveChanges();
(D) Documentation
According to DbContext.Add documentation
Begins tracking the given entity, and any other reachable entities that are not already being tracked, in the Added state such that they will be inserted into the database when SaveChanges() is called.
In particular, when SaveChanges() is called.
I recently migrated to EF from Linq-to-SQL in this application. Linq-to-SQL did not have this problem.
What reason could there be for the DB.AA_Words_100.Add(...) command being so slow?
Thank you!
#Update - To_AA_Word_11() Code
public AA_Words_100 To_AA_Word_100()
{
var W = new AA_Words_100();
W.Word = Word;
W.Word_Clean = Word.Replace("'", "");
W.PhraseCount = PhraseCount;
W.Clicks = Clicks;
W.Impressions = Impressions;
W.SalesAmt = SalesAmt;
W.SalesOrders = SalesOrders;
W.SalesUnits = SalesUnits;
W.Spend = Spend;
W.Campaign = Campaign;
return W;
}

As stated in comments - Entity Framework (not sure about Entity Framework Core) by default calls DetectChanges on every Add. This function, among other things, scans all entities already tracked by a context to detect changes in and between them. That means time complexity of this function is O(n), where n is number of entities already tracked by a context. When you do a lot of adds in a loop - time complexity becomes O(n^2), where n is total number of items added. So even with tiny numbers like 3000 rows, perfomance drops down very significatly.
To fix this (arguable design) issue there are couple of options:
set AutoDetectChangesEnabled of context to false. Then manually call DetectChanges before SaveChanges.
or use AddRange instead of adding entities one by one, it calls DetectChanges just once.
Another notes:
Try to avoid reusing context between operations. You said there was already 3000 entities tracked by context before you called first Add. It's better to create new context every time you need it, do the stuff, then dispose it. Perfomance impact is negligible (and connections are managed by connection pool and are not necessary open or close every time you create\dispose a context), but you will have much less problems like this one (reusing context can bite not only in the scenario you have now, but in several others).
Use AsNoTracking queries if you do not intend to modify entities returned by specific query (or if you intend to modify some of them later by attaching to context). Then context will not track them which will reduce the possibility of mentioned and other perfomance problems.
As for Linq To Sql - it has a similar concept of "detect changes", but it is automatically called only before commiting changes to database, not on every add, so you do not see the same problem there.

Entity Framework update/insert multiple entities

Just a bit of an outline of what i am trying to accomplish.
We keep a local copy of a remote database (3rd party) within our application. To download the information we use an api.
We currently download the information on a schedule which then either inserts new records into the local database or updates the existing records.
here is how it currently works
public void ProcessApiData(List<Account> apiData)
{
// get the existing accounts from the local database
List<Account> existingAccounts = _accountRepository.GetAllList();
foreach(account in apiData)
{
// check if it already exists in the local database
var existingAccount = existingAccounts.SingleOrDefault(a => a.AccountId == account.AccountId);
// if its null then its a new record
if(existingAccount == null)
{
_accountRepository.Insert(account);
continue;
}
// else its a new record so it needs updating
existingAccount.AccountName = account.AccountName;
// ... continue updating the rest of the properties
}
CurrentUnitOfWork.SaveChanges();
}
This works fine, however it just feels like this could be improved.
There is one of these methods per Entity, and they all do the same thing (just updating different properties) or inserting a different Entity. Would there be anyway to make this more generic?
It just seems like a lot of database calls, would there be anyway to "Bulk" do this. I've had a look at this package which i have seen mentioned on a few other posts https://github.com/loresoft/EntityFramework.Extended
But it seems to focus on bulk updating a single property with the same value, or so i can tell.
Any suggestions on how i can improve this would be brilliant. I'm still fairly new to c# so i'm still searching for the best way to do things.
I'm using .net 4.5.2 and Entity Framework 6.1.3 with MSSQL 2014 as the backend database

For EFCore you can use this library:
https://github.com/borisdj/EFCore.BulkExtensions
Note: I'm the author of this one.
And for EF 6 this one:
https://github.com/TomaszMierzejowski/EntityFramework.BulkExtensions
Both are extending DbContext with Bulk operations and have the same syntax call:
context.BulkInsert(entitiesList);
context.BulkUpdate(entitiesList);
context.BulkDelete(entitiesList);
EFCore version have additionally BulkInsertOrUpdate method.

Assuming that the classes in apiData are the same as your entities, you should be able to use Attach(newAccount, originalAccount) to update an existing entity.
For bulk inserts I use AddRange(listOfNewEntitities). If you have a lot of entities to insert it is advisable to batch them. Also you may want to dispose and recreate the DbContext on each batch so that it's not using too much memory.
var accounts = new List<Account>();
var context = new YourDbContext();
context.Configuration.AutoDetectChangesEnabled = false;
foreach (var account in apiData)
{
accounts.Add(account);
if (accounts.Count % 1000 == 0)
// Play with this number to see what works best
{
context.Set<Account>().AddRange(accounts);
accounts = new List<Account>();
context.ChangeTracker.DetectChanges();
context.SaveChanges();
context?.Dispose();
context = new YourDbContext();
}
}
context.Set<Account>().AddRange(accounts);
context.ChangeTracker.DetectChanges();
context.SaveChanges();
context?.Dispose();
For bulk updates, there's not anything built in in LINQ to SQL. There are however libraries and solutions to address this. See e.g. Here for a solution using expression trees.

List vs. Dictionary
You check in a list every time if the entity exists which is bad. You should create a dictionary instead to improve performance.
var existingAccounts = _accountRepository.GetAllList().ToDictionary(x => x.AccountID);
Account existingAccount;
if(existingAccounts.TryGetValue(account.AccountId, out existingAccount))
{
// ...code....
}
Add vs. AddRange
You should be aware of Add vs. AddRange performance when you add multiple records.
Add: Call DetectChanges after every record is added
AddRange: Call DetectChanges after all records is added
So at 10,000 entities, Add method have taken 875x more time to add entities in the context simply.
To fix it:
CREATE a list
ADD entity to the list
USE AddRange with the list
SaveChanges
Done!
In your case, you will need to create an InsertRange method to your repository.
EF Extended
You are right. This library updates all data with the same value. That is not what you are looking for.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library may perfectly fit for your enterprise if you want to improve your performance dramatically.
You can easily perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
Example:
public void ProcessApiData(List<Account> apiData)
{
// Insert or Update using the primary key (AccountID)
CurrentUnitOfWork.BulkMerge(apiData);
}

Migrating Db and Entity framework

I have to take data from an existing database and move it into a new database that has a new design. So the new database has other columns and tables than the old one.
So basically I need to read tables from the old database and put that data into the new structure, some data won't be used anymore and other data will be placed in other columns or tables etc.
My plan was to just read the data from the old database with basic queries like
Select * from mytable
and use Entity Framework to map the new database structure. Then I can basically do similar to this:
while (result.Read())
{
context.Customer.Add(new Customer
{
Description = (string) result["CustomerDescription"],
Address = (string) result["CuAdress"],
//and go on like this for all properties
});
}
context.saveChanges();
I think it is more convenient to do it like this to avoid writing massive INSERT-statements and so on, but is there any problems in doing like this? Is this considered bad for some reason that I don't understand. Poor performance or any other pitfalls? If anyone has any input on this it would be appreciated, so I don't start with this and it turns out to be a big no-no for some reason.

Something that you could perhaps also try, is merely to write a new DBContext class for the new target database.
Then simply write a console application with a static method which copies entities and properties from the one context to the other.
This will ensure that your referential integrity remains intact and saves you a lot of hassle in terms of having to write SQL code, since EF does all the heavy lifting for you in this regard.
If the dbContext contains a lot of entity dbsets I recommend that you use some sort of automapper.
But, this depends on the amount of data that you are trying to move. If we are talking terrabytes, I would rather suggest you do not take this approach.

How does the Entity Framework walks through a collection that is too big?

I am new to Entity Framework.
And I have one concern:
I need to walktrhough a quite big amount of data that is gathered via a LINQ to Entities that combines couple of properties from different entities in an anonymous type.
If I need to read the returned items of this query one by one until the end, am I under the risk of OutOfMemory exception as the collection is BIG or EF uses SqlDataReader implicitly?
( Or should I use EntityDateReader to ensure that I am reading the Db in a sequential order (But then I have to generate my query as a string I guess) )

As I see it there are two things you can do, firstly turn off tracking by using .AsNoTracking this will in most cases cut your memory set in half which may be enough.
If your set is still too big use skip and take to pull down the resultset in chunks. You should also use this in conjunction with AsNoTracking to ensure no memory is consumed with tracking
EDIT:
For example you could use something like the following the following to loop through all items in chunks of 1000. The below code should only hold 1000 items at a time in memory.
int numberOfItems = ctx.MySet.Count();
for(int i = 0; i < numberOfItems + 1000; i+=1000)
{
foreach(var item in ctx.MySet.AsNoTracking().Skip(i).Take(1000).AsEnumerable())
{
//do stuff with your entity
}
}

If the amount of the data is as big as you said i would recommend that you don't use EF for such a case. EF is great but you sometimes need to fallback and use standard SQL to get better performance.
Take a look at Dapper.NET https://github.com/SamSaffron/dapper-dot-net
If you really want to use EF in every case i would recommend that you use a Bounded Contexts (multiple DBContexts).
Splitting your model in multiple smaller contexts will improve the performance as you will use less resources when EF creates an in-memory model of the context.
The larger the context is the more resources are expended to generate
and maintain that in-memory model.
Also you can Detach and Attach when sharing instances of multiple context so you still don't load all the model.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.