I am wondering how can do a mass insert and bulk copy at the same time? I have 2 tables that should be affect by the bulk copy as they both depend on each other.
So I want it that if while inserting table 1 a record dies it gets rolled back and table 2 never gets updated. Also if table 1 inserts good and table 2 an update fails table 1 gets rolled back.
Can this be done with bulk copy?
Edit
I should have mentioned I am doing the bulk insert though C#.
It sort of looks like this but this is an example I been working off. So I am not sure if I have to alter it to be a stored procedure(not sure how it would look and how the C# code would look)
private static void BatchBulkCopy()
{
// Get the DataTable
DataTable dtInsertRows = GetDataTable();
using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity))
{
sbc.DestinationTableName = "TBL_TEST_TEST";
// Number of records to be processed in one go
sbc.BatchSize = 500000;
// Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table
// sbc.ColumnMappings.Add("ID", "ID");
sbc.ColumnMappings.Add("NAME", "NAME");
// Number of records after which client has to be notified about its status
sbc.NotifyAfter = dtInsertRows.Rows.Count;
// Event that gets fired when NotifyAfter number of records are processed.
sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);
// Finally write to server
sbc.WriteToServer(dtInsertRows);
sbc.Close();
}
}
I am wondering how can do a mass
insert and bulk copy at the same time?
I have 2 tables that should be affect
by the bulk copy as they both depend
on each other. So I want it that if
while inserting table 1 a record dies
it gets rolled back and table 2 never
gets updated. Also if table 1 inserts
good and table 2 an update fails table
1 gets rolled back. Can this be done
with bulk copy?
No - the whole point of SqlBulkCopy is to get data into your database as fast as possible. It will just dump the data into a single table.
The normal use case will be to then inspect that table once it's imported, and begin to "split up" that data and store it into whatever place it needs to go - typically through a stored procedure (since the data already is on the server, and you want to distribute it to other tables - you don't want to pull all that data back down to the client, inspect it, and then send it back to the server one more time).
SqlBulkCopy only grabs a bunch of data and drops it into a table - very quickly so. It cannot split up data into multiple tables based on criteria or conditions.
You can run bulk inserts inside of a user defined transaction so do something like this:
BEGIN TRANSACTION MyDataLoad
BEGIN TRY
BULK INSERT ...
BULK INSERT ...
COMMIT TRANSACTJION MyDataLoad
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH
However, there may be other ways to accomplish what you want. Are the tables empty before you bulk insert into them? When you say the tables depend on each other, do you mean that there are foreign key constraints you want enforced?
Related
We are using SqlBulkCopy.WriteToServer to bulk insert to SQL Server and works very well. However, it fails when a record already exists. What we need is to "ignore" those rows that already exist, and insert the non-existing ones.
SqlException: Violation of PRIMARY KEY constraint 'PK__Pharmacy__3214EC072C1E8537'.
Cannot insert duplicate key in object 'dbo.Pharmacy'. The duplicate key value is (797cba76-8bbd-4dbd-a360-4f8e8a6ef85b)
How can we use SqlBulkCopy.WriteToServer to insert rows, if they do not exist, without breaking or failing.
try
{
// Write from the source to the destination.
bulkCopy.WriteToServer(dt);
}
catch (Exception ex)
{
throw new Exception($"BulkInject error in {dt.TableName}", ex);
}
Update:
It's important to mention that this works well most of the time, 98% of the time and bulk inserting properly. Just 2% of the time, some rows already exist that will cause the bulk insert to fail.
What we need: we need to "ignore" those rows if exist
What we do: Data transfer of source database to dest db. It's not a full transfer. We transfer a subset of the source data. The dest db is NOT empty. It already contains data. So update is NOT an option. We need to insert if not exists.
There are around 30 tables that we do bulk insert from source to dest db. So we have a generic function that does field mapping, bulk inserting, etc... it's the same function that handles all these tables.
Again, what we need: We are using SqlBulkCopy.WriteToServer and we need to "ignore" rows if they exist. Thanks
SqlBulkCopy as the name suggest is for copying (inserting) bulk records and it cannot perform update operation. Hence comes Table Valued Parameter to the rescue, which allows us to pass multiple records using a DataTable to a Stored Procedure where we can do the processing...
SQL Server 2008(or higher) came up with a nice function called MERGE, which allows to perform INSERT operation when records are not present and UPDATE when records are present in the table.
You can create a User Defined Table Type in SQL Server
Finally the following similar kind of stored procedure is created which will accept the whole DataTable as parameter and then will insert all records into the table that are not present in the table and the one that already exists will be updated.
CREATE PROCEDURE [dbo].[Update_Pharmacy]
#tblCustomers CustomerType READONLY
AS
BEGIN
SET NOCOUNT ON;
MERGE INTO Customers c1
USING #tblCustomers c2
ON c1.CustomerId=c2.Id
WHEN MATCHED THEN
UPDATE SET c1.Name = c2.Name
,c1.Country = c2.Country
WHEN NOT MATCHED THEN
INSERT VALUES(c2.Id, c2.Name, c2.Country);
END
The performance will be less as Bulk Copy Utlity but somehow it is one of option i have used while working on a reporting database from realtime DB.
I want to perform bulk insert from CSV to MySQL database using C#, I'm using MySql.Data.MySqlClient for connection. CSV columns are refereed into multiple tables and they are dependent on primary key value, for example,
CSV(column & value): -
emp_name, address,country
-------------------------------
jhon,new york,usa
amanda,san diago,usa
Brad,london,uk
DB Schema(CountryTbl) & value
country_Id,Country_Name
1,usa
2,UK
3,Germany
DB Schema(EmployeeTbl)
Emp_Id(AutoIncrement),Emp_Name
DB Schema(AddressTbl)
Address_Id(AutoIncrement), Emp_Id,Address,countryid
Problem statement:
1> Read data from CSV to get the CountryId from "CountryTbl" for respective employee.
2> Insert data into EmployeeTbl and AddressTbl with CountryId
Approach 1
Go as per above problem statement steps, but that will be a performance hit (Row-by-Row read and insert)
Approach 2
Use "Bulk Insert" option "MySqlBulkLoader", but that needs csv files to read, and looks that this option is not going to work for me.
Approach 3
Use stored proc and use the procedure for upload. But I don't want to use stored proc.
Please suggest if there is any other option by which I can do bulk upload or suggest any other approach.
Unless you have hundreds of thousands of rows to upload, bulk loading (your approach 2) probably is not worth the extra programming and debugging time it will cost. That's my opinion, for what it's worth (2x what you paid for it :)
Approaches 1 and 3 are more or less the same. The difference lies in whether you issue the queries from c# or from your sp. You still have to work out the queries. So let's deal with 1.
The solutions to these sorts of problems depend on make and model of RDBMS. If you decide you want to migrate to SQL Server, you'll have to change this stuff.
Here's what you do. For each row of your employee csv ...
... Put a row into the employee tbl
INSERT INTO EmployeeTbl (Emp_Name) VALUES (#emp_name);
Notice this query uses the INSERT ... VALUES form of the insert query. When this query (or any insert query) runs, it drops the autoincremented Emp_Id value where a subsequent invocation of LAST_INSERT_ID() can get it.
... Put a row into the address table
INSERT INTO AddressTbl (Emp_Id,Address,countryid)
SELECT LAST_INSERT_ID() AS Emp_Id,
#address AS Address,
country_id AS countryid
FROM CountryTbl
WHERE Country_Name = #country;
Notice this second INSERT uses the INSERT ... SELECT form of the insert query. The SELECT part of all this generates one row of data with the column values to insert.
It uses LAST_INSERT_ID() to get Emp_Id,
it uses a constant provided by your C# program for the #address, and
it looks up the countryid value from your pre-existing CountryTbl.
Notice, of course, that you must use the C# Parameters.AddWithValue() method to set the values of the # parameters in these queries. Those values come from your CSV file.
Finally, wrap each thousand rows or so of your csv in a transaction, by preceding their INSERT statements with a START TRANSACTION; statement and ending them with a COMMIT; statement. That will get you a performance improvement, and if something goes wrong the entire transaction will get rolled back so you can start over.
This question already has answers here:
Insert 2 million rows into SQL Server quickly
(8 answers)
Closed 8 years ago.
I am writing a stored procedure to insert rows into a table. The problem is that in some operation we might want to insert more than 1 million rows and we want to make it fast. Another thing is that in one of the column, it is Nvarchar(MAX). We might want to put avg 1000 characters in this column.
Firstly, I wrote a prc to insert row by row. Then I generate some random data for insert with the NVARCHAR(MAX) column to be a string of 1000 characters. Then use a loop to call the prc to insert the rows. The perf is very bad which takes 48 mins if I use SQL server to log on the database server to insert. If I use C# to connect to the server in my desktop (that is what we usually want to do ), it takes about more than 90mins.
Then, I changed the prc to take a table type parameter as the input. I prepared the rows somehow and put them in the table type parameter and do the insert by the following command:
INSERT INTO tableA SELECT * from #tableTypeParameterB
I tried batch size as 1000 rows and 3000 rows (Put 1000-3000 rows in the #tableTypeParameterB to be inserted for one time). The performance is still bad. It takes about 3 mins to insert 1 million rows if I run it in the SQL server and take about 10 mins if I use C# program to connect from my desktop.
The tableA has a clustered index with 2 columns.
My target is to make the insert as fast as possible (My idea target is within 1 min). Is there any way to optimize it?
Just an update:
I tried the Bulk Copy Insert which was suggested by some people below. I tried use the SQLBULKCOPY to insert 1000 row and 10000 row at a time. The performance is still 10 mins to insert 1 million row (Every row has a column with 1000 characters). There is no performance improve. Is there any other suggestions?
An update based on the comments require.
The data is actually coming from UI. The user will change use UI to bulk select, we say, one million rows and change one column from the old value to new value. This operation will be done in a separate procedure.But here what we need to do is that make the mid-tier service to get the old value and new value from the UI and insert them in the table. The old value and new value may be up to 4000 characters and the average is 1000 characters. I think the long string old/new value slow down the speed because when I change the test data old value/new value to 20-50 characters and insert is very fast no matter use SQLBulkCopy or table type variable
I think what you are looking for is Bulk Insert if you prefer using SQL.
Or there is also the ADO.NET for Batch Operations option, so you keep the logic in your C# application. This article is also very complete.
Update
Yes I'm afraid bulk insert will only work with imported files (from within the database).
I have an experience in a Java project where we needed to insert millions of rows (data came from outside the application btw).
Database was Oracle, so of course we used the multi-line insert of Oracle. It turned out that the Java batch update was much faster than the multi-valued insert of Oracle (so called "bulk updates").
My suggestion is:
Compare the performance between the multi-value insert of SQL Server code (then you can read from inside your database, a procedure if you like) with the ADO.NET Batch Insert.
If the data you are going to manipulate is coming from outside your application (if it is not already in the database), I would say just go for the ADO.NET Batch Inserts. I think that its your case.
Note: Keep in mind that batch inserts usually operate with the same query. That is what makes them so fast.
Calling a prc in a loop incurs many round trips to SQL.
Not sure what batching approach you used but you should look into table value parameters: Docs are here. You'll want to still batch write.
You'll also want to consider memory on your server. Batching (say 10K at a time) might be a bit slower but might keep memory pressure lower on your server since you're buffering and processing a set at a time.
Table-valued parameters provide an easy way to marshal multiple rows
of data from a client application to SQL Server without requiring
multiple round trips or special server-side logic for processing the
data. You can use table-valued parameters to encapsulate rows of data
in a client application and send the data to the server in a single
parameterized command. The incoming data rows are stored in a table
variable that can then be operated on by using Transact-SQL.
Another option is bulk insert. TVPs benefit from re-use however so it depends on your usage pattern. The first link has a note about comparing:
Using table-valued parameters is comparable to other ways of using
set-based variables; however, using table-valued parameters frequently
can be faster for large data sets. Compared to bulk operations that
have a greater startup cost than table-valued parameters, table-valued
parameters perform well for inserting less than 1000 rows.
Table-valued parameters that are reused benefit from temporary table
caching. This table caching enables better scalability than equivalent
BULK INSERT operations.
Another comparison here: Performance of bcp/BULK INSERT vs. Table-Valued Parameters
Here is an example what I've used before with SqlBulkCopy. Grant it I was only dealing with around 10,000 records but it did it inserted them a few seconds after the query ran. My field names were the same so it was pretty easy. You might have to modify the DataTable field names. Hope this helps.
private void UpdateMemberRecords(Int32 memberId)
{
string sql = string.Format("select * from Member where mem_id > {0}", memberId);
try {
DataTable dt = new DataTable();
using (SqlDataAdapter da = new SqlDataAdapter(new SqlCommand(sql, _sourceDb))) {
da.Fill(dt);
}
Console.WriteLine("Member Count: {0}", dt.Rows.Count);
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(ConfigurationManager.AppSettings("DestDb"), SqlBulkCopyOptions.KeepIdentity)) {
sqlBulk.BulkCopyTimeout = 600;
sqlBulk.DestinationTableName = "Member";
sqlBulk.WriteToServer(dt);
}
} catch (Exception ex) {
throw;
}
}
If you have SQL2014, then the speed of In-Memory OLTP is amazing;
http://msdn.microsoft.com/en-au/library/dn133186.aspx
Depending on your end goal, it may be a good idea to look into Entity Framework (or similar). This abstracts out the SQL so that you don't really have to worry about it in your client application, which is how things should be.
Eventually, you could end up with something like this:
using (DatabaseContext db = new DatabaseContext())
{
for (int i = 0; i < 1000000; i++)
{
db.Table.Add(new Row(){ /* column data goes here */});
}
db.SaveChanges();
}
The key part here (and it boils down to a lot of the other answers) is that Entity Framework handles building the actual insert statement and committing it to the database.
In the above code, nothing will actually be sent to the database until SaveChanges is called and then everything is sent.
I can't quite remember where I found it, but there is research around that suggests it is worth while to call SaveChanges every so often. From memory, I think every 1000 entries is a good choice for committing to the database. Committing every entry, compared to every 100 entries, doesn't provide much performance benefit and 10000 takes it past the limit. Don't take my word for that though, the numbers could be wrong. You seem to have a good grasp on the testing side of things though, so have a play around with things.
This question already has answers here:
How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key
(7 answers)
Closed 8 years ago.
In my sqlserver table i have following columns defined,
stationid,dateofevent,itemname,sitename,clicks
To populate above table , we have a c# application. In which inserts data in a loop. Data comes from remote machine and once data received by server(another c# application) , it imserts into sql server and send back OK response to remote client. When client receives response , it archives data into another table and deletes the data from actual table.
Incase if client fails to archive , stored procedure from server side will take care of preventing duplicate record insert.
Set #previousClickCount=( SELECT Clicks FROM [Analytics] as pc
where DATEADD(dd, 0, DATEDIFF(dd, 0, [DateOfEvent]))=#date
and ItemType=#type
and stationId = #stationId
and ItemName=#itemName
and SiteName=#siteName)
If #previousClickCount Is Null
Begin
-- Row for this item is not found in DB so inserting a new row
Insert into Analytics(StationId,DateOfEvent,WeekOfYear,MonthOfYear,Year,ItemType,ItemName,Clicks,SiteName)
VALUES(#stationId,#date,DATEPART(wk,#date),DATEPART(mm,#date),DATEPART(YYYY,#date),#type,#itemName,#clicks,#siteName)
End
Later we decided to move to bulk insert in server side code. So that we can avoid looping.
bulkCopy.DestinationTableName = SqlTableName;
bulkCopy.BatchSize = 1000;
bulkCopy.BulkCopyTimeout = 1000;
bulkCopy.WriteToServer(dataTable);
But incase client side failed to archive data then server will insert duplicate record.Is there any way to check this in bulk insert or whether can we add any constraint like,insert only if the itemname not present for the particular date then insert.
Regards
Sangeetha
There is absolutely a way to do checks in SqlBulkCopy - by not doing them.
DO not insert into the final table (which is better anyway, SqlBulkCopy has serious bad programming on it's locking) but into a temporary table (that you can create dynamically).
Then you can execute custom SQL to move the data into the final table, for example with a MERGE statement that avoids duplicates. THis not only will give you a lot better multi client behavior (as you avoid the really bad lock behavior of SqlBulkCopy on the rel dat table) but also allows all kinds of ETL stuff to happen with the uploaded data.
I am doing a conversion with SqlBulkCopy. I currently have an IList collection of classes which basically i can do a conversion to a DataTable for use with SqlBulkCopy.
Problem is that I can have 3 records with the same ID.
Let me explain .. here are 3 records
ID Name Address
1 Scott London
1 Mark London
1 Manchester
Basically i need to insert them sequentially .. hence i insert record 1 if it doesn't exist, then the next record if it exists i need to update the record rather than insert a new 1 (notice the id is still 1) so in the case of the second record i replace both columns Name And Address on ID 1.
Finally on the 3rd record you notice that Name doesn't exist but its ID 1 and has an address of manchester so i need to update the record but NOT CHANGING Name but updating Manchester.. hence the 3rd record would make the id1 =
ID Name Address
1 Mark Manchester
Any ideas how i can do this? i am at a loss.
Thanks.
EDIT
Ok a little update. I will manage and merge my records before using SQLbulkCopy. Is it possible to get a list of what succeeded and what failed... or is it a case of ALL or nothing? I presume there is no other alternative to SQLbulkCopy but to do updates?
it would be ideal to be able to Insert everything and the ones that failed are inserted into a temp table ... hence i only need to worry about correcting the ones in my failed table as the others i know are all OK
Since you need to process that data into a DataTable anyway (unless you are writing a custom IDataReader), you should merge the records before giving them to SqlBulkCopy; for example (in pseudo code):
/* create empty data-table */
foreach(row in list) {
var row = /* try to get exsiting row from data-table based on id */
if(row == null) { row = /* create and append row to data-table */ }
else { merge non-trivial properties into existing row */
}
then pass the DataTable to SqlBulkCopy once you have the desired data.
Re the edit; in that scenario, I would upload to a staging table (just a regular table that has a schema like the uploaded data, but typically no foreign keys etc), then use regular TSQL to move the data into the transactional tables. In addition to full TSQL support this also allows better logging of operations. In particular, perhaps look at the OUTPUT clause of INSERT which can help complex bulk operations.
You can't do updates with bulk copy (bulk insert), only insert. Hence the name.
You need to fix the data before you insert them. If this means you have updates to pre-existing rows, you can't insert those as that will generate the key conflict.
You can either bulk insert into a temporary table, and run the appropriate insert or update statements, only insert the new rows and issue update statements for the rest, or delete the pre-existing rows after fetching them and fixing the data before reinserting.
But there's no way to persuade bulk copy to update an existing row.