C# - Entity Framework - Large seed data code-first - c#

I'm going to create an initial table in my app to store all the cities/states of my country.
This is a relative large data set: 5k+ registries.
Reading this post enlightened to me a good way to do this, altough I think that leaving a sql file, that will be imported by the EF, in the open is a security flaw.
The file format is irrelevant: I can make it a XLS or a TXT if I want; instead of executing it as a SQL command as show in the post I can simply read it as a stream and generate the objects as shown in the tutorial of the next hiperlink.
Reading this tutorial about data seed in code-first, I saw that the seed method will be executed in the database initialization process and the seed objects are generated in the seed method.
My questions:
About the seed methods, what is the best approach, the SQL-file approach or the object approach?
I personally think that the object approach is more secure, but can be slower, possibility that generates my second question:
The seed method that is executed in the database initialization process, is executed ONLY when the DB is created? This is a little unclear to me.
Thanks.

Reference System.Data in your database project and add the NuGet package EntityFramework.BulkInsert.
Insert your data in the seed method if you detect that it's not there yet:
protected override void Seed(BulkEntities context)
{
if (!context.BulkItems.Any())
{
var items = Enumerable.Range(0, 100000)
.Select(s => new BulkItem
{
Name = s.ToString(),
Status = "asdf"
});
context.BulkInsert(items, 1000);
}
}
Inserting 100,000 items takes about 3 seconds over here.

Related

Entity Framework update/insert multiple entities

Just a bit of an outline of what i am trying to accomplish.
We keep a local copy of a remote database (3rd party) within our application. To download the information we use an api.
We currently download the information on a schedule which then either inserts new records into the local database or updates the existing records.
here is how it currently works
public void ProcessApiData(List<Account> apiData)
{
// get the existing accounts from the local database
List<Account> existingAccounts = _accountRepository.GetAllList();
foreach(account in apiData)
{
// check if it already exists in the local database
var existingAccount = existingAccounts.SingleOrDefault(a => a.AccountId == account.AccountId);
// if its null then its a new record
if(existingAccount == null)
{
_accountRepository.Insert(account);
continue;
}
// else its a new record so it needs updating
existingAccount.AccountName = account.AccountName;
// ... continue updating the rest of the properties
}
CurrentUnitOfWork.SaveChanges();
}
This works fine, however it just feels like this could be improved.
There is one of these methods per Entity, and they all do the same thing (just updating different properties) or inserting a different Entity. Would there be anyway to make this more generic?
It just seems like a lot of database calls, would there be anyway to "Bulk" do this. I've had a look at this package which i have seen mentioned on a few other posts https://github.com/loresoft/EntityFramework.Extended
But it seems to focus on bulk updating a single property with the same value, or so i can tell.
Any suggestions on how i can improve this would be brilliant. I'm still fairly new to c# so i'm still searching for the best way to do things.
I'm using .net 4.5.2 and Entity Framework 6.1.3 with MSSQL 2014 as the backend database
For EFCore you can use this library:
https://github.com/borisdj/EFCore.BulkExtensions
Note: I'm the author of this one.
And for EF 6 this one:
https://github.com/TomaszMierzejowski/EntityFramework.BulkExtensions
Both are extending DbContext with Bulk operations and have the same syntax call:
context.BulkInsert(entitiesList);
context.BulkUpdate(entitiesList);
context.BulkDelete(entitiesList);
EFCore version have additionally BulkInsertOrUpdate method.
Assuming that the classes in apiData are the same as your entities, you should be able to use Attach(newAccount, originalAccount) to update an existing entity.
For bulk inserts I use AddRange(listOfNewEntitities). If you have a lot of entities to insert it is advisable to batch them. Also you may want to dispose and recreate the DbContext on each batch so that it's not using too much memory.
var accounts = new List<Account>();
var context = new YourDbContext();
context.Configuration.AutoDetectChangesEnabled = false;
foreach (var account in apiData)
{
accounts.Add(account);
if (accounts.Count % 1000 == 0)
// Play with this number to see what works best
{
context.Set<Account>().AddRange(accounts);
accounts = new List<Account>();
context.ChangeTracker.DetectChanges();
context.SaveChanges();
context?.Dispose();
context = new YourDbContext();
}
}
context.Set<Account>().AddRange(accounts);
context.ChangeTracker.DetectChanges();
context.SaveChanges();
context?.Dispose();
For bulk updates, there's not anything built in in LINQ to SQL. There are however libraries and solutions to address this. See e.g. Here for a solution using expression trees.
List vs. Dictionary
You check in a list every time if the entity exists which is bad. You should create a dictionary instead to improve performance.
var existingAccounts = _accountRepository.GetAllList().ToDictionary(x => x.AccountID);
Account existingAccount;
if(existingAccounts.TryGetValue(account.AccountId, out existingAccount))
{
// ...code....
}
Add vs. AddRange
You should be aware of Add vs. AddRange performance when you add multiple records.
Add: Call DetectChanges after every record is added
AddRange: Call DetectChanges after all records is added
So at 10,000 entities, Add method have taken 875x more time to add entities in the context simply.
To fix it:
CREATE a list
ADD entity to the list
USE AddRange with the list
SaveChanges
Done!
In your case, you will need to create an InsertRange method to your repository.
EF Extended
You are right. This library updates all data with the same value. That is not what you are looking for.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library may perfectly fit for your enterprise if you want to improve your performance dramatically.
You can easily perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
Example:
public void ProcessApiData(List<Account> apiData)
{
// Insert or Update using the primary key (AccountID)
CurrentUnitOfWork.BulkMerge(apiData);
}

How to... Display Data from Database

I've got an Application which consists of 2 parts at the moment
A Viewer that receives data from a database using EF
A Service that manipulates data from the database at runtime.
The logic behind the scenes includes some projects such as repositories - data access is realized with a unit of work. The Viewer itself is a WPF-Form with an underlying ViewModel.
The ViewModel contains an ObservableCollection which is the datasource of my Viewer.
Now the question is - How am I able to retrieve the database-data every few minutes? I'm aware of the following two problems:
It's not the latest data my Repository is "loading" - does EF "smart" stuff and retrieves data from the local cache? If so, how can I force EF to load the data from the database?
Re-Setting the whole ObservableCollection or adding / removing entities from another thread / backgroundworker (with invokation) is not possible. How am I supposed to solve this?
I will add some of my code if needed but at the moment I don't think that this would help at all.
Edit:
public IEnumerable<Request> GetAllUnResolvedRequests() {
return AccessContext.Requests.Where(o => !o.IsResolved);
}
This piece of code won't get the latest data - I edit some rows manually (set IsResolved to true) but this method retrieves it nevertheless.
Edit2:
Edit3:
var requests = AccessContext.Requests.Where(o => o.Date >= fromDate && o.Date <= toDate).ToList();
foreach (var request in requests) {
AccessContext.Entry(request).Reload();
}
return requests;
Final Question:
The code above "solves" the problem - but in my opinion it's not clean. Is there another way?
When you access an entity on a database, the entity is cached (and tracked to track changes that your application does until you specify AsNoTracking).
This has some issues (for example, performance issues because the cache increases or you see an old version of entities that is your case).
For this reasons, when using EF you should work with Unit of work pattern (i.e. you should create a new context for every unit of work).
You can have a look to this Microsoft article to understand how implement Unit of work pattern.
http://www.asp.net/mvc/overview/older-versions/getting-started-with-ef-5-using-mvc-4/implementing-the-repository-and-unit-of-work-patterns-in-an-asp-net-mvc-application
In your case using Reload is not a good choice because the application is not scalable. For every reload you are doing a query to database. If you just need to return desired entities the best way is to create a new context.
public IEnumerable<Request> GetAllUnResolvedRequests()
{
return GetNewContext().Requests.Where(o => !o.IsResolved).ToList();
}
Here is what you can do.
You can define the Task (which keeps running on ThreadPool) that periodically checks the Database (consider that periodically making EF to reload data has its own cost).
And You can define SQL Dependency on your query so that when there is a change in data, you can notify the main thread for the same.

Migrating Db and Entity framework

I have to take data from an existing database and move it into a new database that has a new design. So the new database has other columns and tables than the old one.
So basically I need to read tables from the old database and put that data into the new structure, some data won't be used anymore and other data will be placed in other columns or tables etc.
My plan was to just read the data from the old database with basic queries like
Select * from mytable
and use Entity Framework to map the new database structure. Then I can basically do similar to this:
while (result.Read())
{
context.Customer.Add(new Customer
{
Description = (string) result["CustomerDescription"],
Address = (string) result["CuAdress"],
//and go on like this for all properties
});
}
context.saveChanges();
I think it is more convenient to do it like this to avoid writing massive INSERT-statements and so on, but is there any problems in doing like this? Is this considered bad for some reason that I don't understand. Poor performance or any other pitfalls? If anyone has any input on this it would be appreciated, so I don't start with this and it turns out to be a big no-no for some reason.
Something that you could perhaps also try, is merely to write a new DBContext class for the new target database.
Then simply write a console application with a static method which copies entities and properties from the one context to the other.
This will ensure that your referential integrity remains intact and saves you a lot of hassle in terms of having to write SQL code, since EF does all the heavy lifting for you in this regard.
If the dbContext contains a lot of entity dbsets I recommend that you use some sort of automapper.
But, this depends on the amount of data that you are trying to move. If we are talking terrabytes, I would rather suggest you do not take this approach.

Best practice to refresh an EF5 result from database

What is the best way to refresh data in Entity Framework 5? I've got an WPF application showing statistics from a database where data is changing all the time. Every 10 seconds the application is updating the result but the default behaviour for EF seems to be to cache the previous results. I would thus like a way to invalidate the previous results so a new set of data can be loaded.
The context of interest is defined in the following way:
public partial class MyEntities: DbContext
{
...
public DbSet<Stat> Stats { get; set; }
...
}
After some reading I was able to find a few approaches, but I have no idea of how efficient these ways are and if they come with downsides.
Create a new instance of the entities object
using (var db = new MyEntities())
{
var stats = from s in db.Stats ...
}
This works but feels inefficient because there are many other places where data is retrieved, and I don't want to reopen a new connection every time I need some data. Wouldn't it be more efficient to keep the connection open and do it another way?
Call refresh on the ObjectContext
var stats = from s in db.Stats ...
ObjectContext.Refresh(RefreshMode.StoreWins, stats );
This also assumes I'm extracting ObjectContext from the dbContext in this way:
private MyEntities db = null;
private ObjectContext ObjectContext
{
get
{
return ((IObjectContextAdapter)db).ObjectContext;
}
}
This is the solution I'm using as it is now. It seems simple. But I read somewhere that ObjectContext nowadays isn't directly accessible in DbContext because the EF team doesn't think that anyone would need it, and that you can do all things you need directly in DbContext. This makes me think that maybe this is not the best way to do it, or?
I know there is a reload method of dbContext.Entry but since I'm not reloading a single entity but rather retrieve a list of entities, I don't really know if this way will work. If I get 5 stat objects in the first query, save them in a list and do a reload on them when it's time to update, I might miss out others that have been added to the list on the database. Or have I completely misunderstood the reload method? Can I do a reload on a DbSetspecified in MyEntities?
There are a number of questions above but what I mainly want to know is what is the best practice in EF5 for asking the same query to the database over and over again? It might very well be something that I haven't discovered yet...
Actually, and even if it seems counter intuitive, the first option is the correct one, see this
DbContext are design to have short lifespans, hence their instantiation cost is quite low compared to the cost of reloading everything, it's mostly due to things like caching, and their data loading designs in general.
That's also why EF works so "naturally" well with ASP .NET MVC, since the DbContext is instantiated at each request.
That doesn't mean you have to create DbContext all over the place of course, in your context, using a DbContext per update operation (the one happening every 10secs) seems good enough, if during that operation you would need to delete a particular row, for example, you would pass the DbContext around, not create a new one.

How to mass insert/update in linq to sql?

How can I do these 2 scenarios.
Currently I am doing something like this
public class Repository
{
private LinqtoSqlContext dbcontext = new LinqtoSqlContext();
public void Update()
{
// find record
// update record
// save record ( dbcontext.submitChanges()
}
public void Insert()
{
// make a database table object ( ie ProductTable t = new ProductTable() { productname
="something"}
// insert record ( dbcontext.ProductTable.insertOnSubmit())
// dbcontext.submitChanges();
}
}
So now I am trying to load an XML file what has tons of records. First I validate the records one at a time. I then want to insert them into the database but instead of doing submitChanges() after each record I want to do a mass submit at the end.
So I have something like this
public class Repository
{
private LinqtoSqlContext dbcontext = new LinqtoSqlContext();
public void Update()
{
// find record
// update record
}
public void Insert()
{
// make a database table object ( ie ProductTable t = new ProductTable() { productname
="something"}
// insert record ( dbcontext.ProductTable.insertOnSubmit())
}
public void SaveToDb()
{
dbcontext.submitChanges();
}
}
Then in my service layer I would do like
for(int i = 0; i < 100; i++)
{
validate();
if(valid == true)
{
update();
insert()
}
}
SaveToDb();
So pretend my for loop is has a count for all the record found in the xml file. I first validate it. If valid then I have to update a table before I insert the record. I then insert the record.
After that I want to save everything in one go.
I am not sure if I can do a mass save when updating of if that has to be after every time or what.
But I thought it would work for sure for the insert one.
Nothing seems to crash and I am not sure how to check if the records are being added to the dbcontext.
The simple answer is: you do not. Linq2Sql is a lot of things - it is not a replacement for bulk upload / bulk copy. You will be a LOT more efficient using the ETL route:
Generate a flat file (csv etc.) with the new data
Load it into the database using bulk load mechanisms
If the data is updating etc. - load it into temporary tables and use the MERGE command to merge it into the main table.
Linq2Sql will by design always suck in mass insert scenarios. ORM's just are not ETL tools.
Linq2SQL (as has been noted) does not handle this well by default, but luckily there are some solutions out there. here's one i used for a website when i wanted to do some bulk deletes. It worked well for me and due to its use of extension methods it was basically indistinguishable from regular Lin2SQL methods.
I haven't really "released" this project yet, but it's a T4-based repository system that extends Linq To SQL and implements a bunch of batch operations (delete, update, create csv, etc.): http://code.google.com/p/grim-repo/. You can check out the source code and implement it however you see fit.
Also, this link has some great source code for batch operations: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx
And, also, I know it's tempting, but don't crap on the elderly. Try performing batch operations with DataAdapters/ADO.net: http://davidhayden.com/blog/dave/archive/2006/01/05/2665.aspx. It's faster, but inevitably hairier.
Finally, if you have an XML file, you can create a stored procedure that takes advantage of SQL server's built-in sproc, sp_xml_preparedocument. Check out how to use it here: http://msdn.microsoft.com/en-us/library/ms187367.aspx
Even when you add multiple records to the DataContext before calling SubmitChanges, LINQ2SQL will loop through and insert them one by one. You can verify this by implementing one of the partial methods on an entity class ("InsertMyObject(MyObject instance)"). It will be called for each pending row individually.
I don't see anything wrong with your plan -- you say it works, but you just don't know how to verify it? Can't you simply look in the database to check if the records got added?
Another way to see what records are pending in the DataContext and have not yet been added is to call GetChangeSet() on the data context and then refer to the "Inserts" property of the returned object to get a list of rows that will be inserted when SubmitChanges is called.

Categories

Resources