Updating large amounts of data in an EF Migration

Updating large amounts of data in an EF Migration - c#

I have a SQL table with a hundred million rows or so in it, and its schema is managed by EF migrations.
I want to change the values in an enum linked to that table, so i need to update all values in the DB to new values. Something like the below.
this.Sql("UPDATE MyTable SET MyEnum=0 WHERE MyEnum=-1");
This would be all good on a smaller table but because of the size of the table its not really appropriate to run such a large update in one go (I get connection timeouts, tempdb space issues, transaction logs space issues ect). It would be much preferable to do this in batches. eg:
while(ctx.MyTable.Any(m =>m.MyEnum == -1))
{
this.Sql("UPDATE TOP (1000000) MyTable SET MyEnum=0 WHERE MyEnum=-1");
}
Unfortunatly i cant work out a way to read from the table during a migration (eg how can you do this during a migration ctx.MyTable.Any(m =>m.MyEnum == -1)). Is there a way to do this so that I can batch in my EF migration update.

My recommendation, based on knowing exactly how slow EF would be for you (even when doing it in batches), would be for you to create a Stored Procedure to do exactly what you're wanting to do, and then just have the Migration simply call that Stored Procedure while it's running. It should, hopefully, run through it fast since everything would be done on the SQL side of things.

Related

Fastest Way To Upsert Using Postgres and C#

I am writing an application in C# that will copy data from one postgres table to another on a regular basis. I am using the NPGSql library.
I have run into the following issue: When there are thousands of rows to be copied (> 10k), the program runs very slowly.
I have tried:
For my first attempt, I pulled the entirety of the destination table, then compared the data I was inserting to the data that already existed. Then, I would write an insert or update statement depending on whether it already existed but had alterations, or whether it did not exist at all. This was the worst solution, as every individual statement had to be sent as a command.
Next, I tried putting an "on conflict" trigger on the actual table. This let me send all of the inserts as bulk INSERT INTO.... statements, and the table would take care of updates. This was significantly faster, but not fast enough.
I read about Postgres's COPY method, but it does not seem to suit my needs. It seems that COPY will do ONLY an insert, and NOT an upsert. Because I am modifying this table several times, some of the data will be new, but some will be old rows that need updating.
Has anyone come up with a fast way to UPSERT, provided that I need an option to EDIT a row, not just do a blanket mass INSERT of all of my data?
Please let me know if I can provide any other information
Thank you so much for your time

First of all, I assume the tables are on different databases, otherwise I would just do this all in DML.
I think copy is definitely your friend. There is no faster way to extract or load data, and then you can let the database do the heavy lifting.
On the source database:
copy source_table
to '/var/tmp/foo.csv' csv;
On the destination database:
truncate temp_table;
copy temp_table
from '/var/tmp/foo.csv' csv;
insert into destination_table
select *
from temp_table t
where not exists (
select null
from destination_table d
where t.id = d.id
);
update destination_table d
set
field1 = t.field1,
field2 = t.field2
from temp_table t
where
d.id = t.id and
(d.field1 is distinct from t.field1 or
d.field2 is distinct from t.field2)
It would be great if you can do something like this if the data is readily available:
Couple of other comments:
the insert into uses an anti-join, and this is my favorite construct to insert missing records
on the update, it's important to specify the criteria for what you udpate -- don't update everything; only those records that have changed. This will make a big difference in performance. Hopefully there are a set number of fields you can use to determine if a record has changed.
If there is a field that indicates the record has been updated (last_update_date or something similar), a slightly lazier and wonderful approach is to delete those records and let the anti-join insert re-insert them. This would omit the need for the update statement and would be much less code for tables with lots of columns

Is there a way to have Entity Framework Core partially save data?

I'm working on some bulk inserts with Entity Framework Core. To minimize round trips to the database, the new inserts are batched in groups of 100 before being added to the database context and saved using SaveChanges().
The current problem is that if any of the batch fail to insert because of, e.g., unique key violations on the table, the entire transaction is rolled back. In this scenario it would be ideal for it to simply discard the records that were unable to be inserted, and insert the rest.
I'm more than likely going to need to write a stored procedure for this, but is there any way to have Entity Framework Core skip over rows that fail to insert?

In your stored procedure, use a MERGE statement instead of an INSERT and then only use the WHEN NOT MATCHED
MERGE #tvp incoming
INTO targetTable existing WITH (HOLDLOCK)
ON (incoming.PK = existing.PK)
WHEN NOT MATCHED
INSERT
The records that match will be discarded. The #tvp is the Table Valued Parameter that is being given to the stored proc from your app code.
There are locking considerations when using the MERGE statement that may or may not apply to your scenario. It's worth reading up on concurrency and atomicity for it to make sure you cover the rest of your bases.

if you are going to the stored procedure then you can declare TVPs. In C# when you will fill the TVP it will fail and you will know in the catch that it failed then to go with recursion of this 100 rows.
Recursive function
It will break the N row into n/2 and will again call the TVP filling. If the first set is ok it will proceed and the second set will fail. The set which will fail you can simply call this recursive function again on that. It will keep your safe records in TVP and failed records seperately. You can call this recursive function up to X. Where X is a number such as 5,6,7. After that, you will be having only bad records.
If you need to know about TVP you can see this
Note : you can not use Parallel Query execution for this approach.

Fastest way to insert 1 million rows in SQL Server [duplicate]

This question already has answers here:
Insert 2 million rows into SQL Server quickly
(8 answers)
Closed 8 years ago.
I am writing a stored procedure to insert rows into a table. The problem is that in some operation we might want to insert more than 1 million rows and we want to make it fast. Another thing is that in one of the column, it is Nvarchar(MAX). We might want to put avg 1000 characters in this column.
Firstly, I wrote a prc to insert row by row. Then I generate some random data for insert with the NVARCHAR(MAX) column to be a string of 1000 characters. Then use a loop to call the prc to insert the rows. The perf is very bad which takes 48 mins if I use SQL server to log on the database server to insert. If I use C# to connect to the server in my desktop (that is what we usually want to do ), it takes about more than 90mins.
Then, I changed the prc to take a table type parameter as the input. I prepared the rows somehow and put them in the table type parameter and do the insert by the following command:
INSERT INTO tableA SELECT * from #tableTypeParameterB
I tried batch size as 1000 rows and 3000 rows (Put 1000-3000 rows in the #tableTypeParameterB to be inserted for one time). The performance is still bad. It takes about 3 mins to insert 1 million rows if I run it in the SQL server and take about 10 mins if I use C# program to connect from my desktop.
The tableA has a clustered index with 2 columns.
My target is to make the insert as fast as possible (My idea target is within 1 min). Is there any way to optimize it?
Just an update:
I tried the Bulk Copy Insert which was suggested by some people below. I tried use the SQLBULKCOPY to insert 1000 row and 10000 row at a time. The performance is still 10 mins to insert 1 million row (Every row has a column with 1000 characters). There is no performance improve. Is there any other suggestions?
An update based on the comments require.
The data is actually coming from UI. The user will change use UI to bulk select, we say, one million rows and change one column from the old value to new value. This operation will be done in a separate procedure.But here what we need to do is that make the mid-tier service to get the old value and new value from the UI and insert them in the table. The old value and new value may be up to 4000 characters and the average is 1000 characters. I think the long string old/new value slow down the speed because when I change the test data old value/new value to 20-50 characters and insert is very fast no matter use SQLBulkCopy or table type variable

I think what you are looking for is Bulk Insert if you prefer using SQL.
Or there is also the ADO.NET for Batch Operations option, so you keep the logic in your C# application. This article is also very complete.
Update
Yes I'm afraid bulk insert will only work with imported files (from within the database).
I have an experience in a Java project where we needed to insert millions of rows (data came from outside the application btw).
Database was Oracle, so of course we used the multi-line insert of Oracle. It turned out that the Java batch update was much faster than the multi-valued insert of Oracle (so called "bulk updates").
My suggestion is:
Compare the performance between the multi-value insert of SQL Server code (then you can read from inside your database, a procedure if you like) with the ADO.NET Batch Insert.
If the data you are going to manipulate is coming from outside your application (if it is not already in the database), I would say just go for the ADO.NET Batch Inserts. I think that its your case.
Note: Keep in mind that batch inserts usually operate with the same query. That is what makes them so fast.

Calling a prc in a loop incurs many round trips to SQL.
Not sure what batching approach you used but you should look into table value parameters: Docs are here. You'll want to still batch write.
You'll also want to consider memory on your server. Batching (say 10K at a time) might be a bit slower but might keep memory pressure lower on your server since you're buffering and processing a set at a time.
Table-valued parameters provide an easy way to marshal multiple rows
of data from a client application to SQL Server without requiring
multiple round trips or special server-side logic for processing the
data. You can use table-valued parameters to encapsulate rows of data
in a client application and send the data to the server in a single
parameterized command. The incoming data rows are stored in a table
variable that can then be operated on by using Transact-SQL.
Another option is bulk insert. TVPs benefit from re-use however so it depends on your usage pattern. The first link has a note about comparing:
Using table-valued parameters is comparable to other ways of using
set-based variables; however, using table-valued parameters frequently
can be faster for large data sets. Compared to bulk operations that
have a greater startup cost than table-valued parameters, table-valued
parameters perform well for inserting less than 1000 rows.
Table-valued parameters that are reused benefit from temporary table
caching. This table caching enables better scalability than equivalent
BULK INSERT operations.
Another comparison here: Performance of bcp/BULK INSERT vs. Table-Valued Parameters

Here is an example what I've used before with SqlBulkCopy. Grant it I was only dealing with around 10,000 records but it did it inserted them a few seconds after the query ran. My field names were the same so it was pretty easy. You might have to modify the DataTable field names. Hope this helps.
private void UpdateMemberRecords(Int32 memberId)
{
string sql = string.Format("select * from Member where mem_id > {0}", memberId);
try {
DataTable dt = new DataTable();
using (SqlDataAdapter da = new SqlDataAdapter(new SqlCommand(sql, _sourceDb))) {
da.Fill(dt);
}
Console.WriteLine("Member Count: {0}", dt.Rows.Count);
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(ConfigurationManager.AppSettings("DestDb"), SqlBulkCopyOptions.KeepIdentity)) {
sqlBulk.BulkCopyTimeout = 600;
sqlBulk.DestinationTableName = "Member";
sqlBulk.WriteToServer(dt);
}
} catch (Exception ex) {
throw;
}
}

If you have SQL2014, then the speed of In-Memory OLTP is amazing;
http://msdn.microsoft.com/en-au/library/dn133186.aspx

Depending on your end goal, it may be a good idea to look into Entity Framework (or similar). This abstracts out the SQL so that you don't really have to worry about it in your client application, which is how things should be.
Eventually, you could end up with something like this:
using (DatabaseContext db = new DatabaseContext())
{
for (int i = 0; i < 1000000; i++)
{
db.Table.Add(new Row(){ /* column data goes here */});
}
db.SaveChanges();
}
The key part here (and it boils down to a lot of the other answers) is that Entity Framework handles building the actual insert statement and committing it to the database.
In the above code, nothing will actually be sent to the database until SaveChanges is called and then everything is sent.
I can't quite remember where I found it, but there is research around that suggests it is worth while to call SaveChanges every so often. From memory, I think every 1000 entries is a good choice for committing to the database. Committing every entry, compared to every 100 entries, doesn't provide much performance benefit and 10000 takes it past the limit. Don't take my word for that though, the numbers could be wrong. You seem to have a good grasp on the testing side of things though, so have a play around with things.

Recommend usage of temp table or table variable in Entity Framework 4. Update Performance Entity framework

I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.

For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.

I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam

how can I speed up insertion of many rows to a table via ADO.NET?

I have a table that has 5 columns: AcctId (int), Address1 (varchar), Address2 (varchar), Person1 (varchar), Person2 (varchar) . I'm generating random data to insert into this table via a C# console application. I've tried doing this random data insert via SQL-Server and decided it was not a good solution -- SQL is not good at random on an each-row basis. Generating the random data -- 975k rows of it -- takes a minimal amount of time. It's in a List of custom objects.
I need to take this random data and update many rows in the database with the new random data. I tried updating the rows one at a time, very slow because of the repeated searching of the List object in code. So I think the best approach is to put all the randomized data into a table in the database, then update all the other tables that use this data. I.e. UPDATE t SET t.Address1=d.Address1 FROM Table1 t INNER JOIN RandomizedData d ON d.AcctId = t.Acct_ID. The database is very un-normalized so this Acct data is sprinkled all over the place. I've got no control of the normalization.
So, having decided to insert all of the randomized data into a single table, I set out to create insert scripts:
USE TheDatabase
Insert tmp_RandomizedData
SELECT 1,'4392 EIGHTH AVE','','JENNIFER CARTER','BARBARA CARTER' UNION ALL
SELECT 2,'2168 MAIN ST','HNGR F','DANIEL HERNANDEZ','SUSAN MARTIN'
// etc another 98 times...
// FYI, this is not real data!
I'm building this INSERT script in batches of 100. It's taking on average 175 ms to run each insert. Does this seem like a long time? It's going to take about 35 mins to run the whole insert.
The table doesn't have a primary key or any indexes. I was planning on adding those after all the data is inserted (thinking that that would be faster).
Is there a better way to do this?

The SQLBulkCopy class in .net can blast records in pretty quickly. I used this to transfer data from an i-Series database to SQL Tables very rapidly.

Use BCP. You can use this article as a guide. It's for VB6 but the gist is exactly the same. The trick is to use the BULK INSERT command.

... Read more of your question, you might also want to look at Sql RedGates sample data generator, it generates tons of data really, really, fast.
Use larger batches, 50,000 to 75,000 rows. On SQL 2000 on hardware from 2000, the sweet spot for inserts was 50,000 rows. This was on a live production database, with indexes, during the day on a very large table.
Small batch sizes are better for inserts into a highly active table and where there is a high deadlock risk. Is anyone else using this table while your doing inserts?
Is this a one time import? Let it run over night.
Finally, INSERT statements executed via ADO.NET isn't really an optimal ETL solution. SSIS, DTS, (or any other ETL solution, such as Talend) would be more appropriate for heavy duty data moving. On the other hand, if all you have is a hammer...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.