I'm looking for any advice on what's the optimum way of inserting a large number of records into a database (SQL2000 onwards) based upon a collection of objects.
Currently my code looks something similar to the snippet below and each record is inserted using a single sql simple INSERT command (opening and closing the database connection each time the function is called! - I'm sure this must slow things down?).
The routine needs to be able to cope with routinely inserting up to 100,000 records and I was wondering if there is a faster way (I'm sure there must be???). I've seen a few posts mentioning using xml based data and bulk copy routines - is this something I should consider or can anyone provide any simple examples which I could build upon?
foreach (DictionaryEntry de in objectList)
{
eRecord record = (eRecord)de.Value;
if (!record.Deleted)
{
createDBRecord(record.Id,
record.Index,
record.Name,
record.Value);
}
}
Thanks for any advice,
Paul.
Doing that way will be relatively slow. You need to consider a bulk INSERT technique either using BCP or BULK INSERT, or if you are using .NET 2.0 you can use the SqlBulkCopy class. Here's an example of using SqlBulkCopy: SqlBulkCopy - Copy Table Data Between SQL Servers at High Speeds
Here's a simple example for SQL Server 2000:
If you have a CSV file, csvtest.txt, in the following format:
1,John,Smith
2,Bob,Hope
3,Kate,Curie
4,Peter,Green
This SQL script will load the contents of the csv file into a database table (If any row contains errors it will be not inserted but other rows will be):
USE myDB
GO
CREATE TABLE CSVTest
(
ID INT,
FirstName VARCHAR(60),
LastName VARCHAR(60)
)
GO
BULK INSERT
CSVTest
FROM 'c:\temp\csvtest.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
SELECT *
FROM CSVTest
GO
You could write out the dictionary contents into a CSV file and then BULK INERT that file.
See also: Using bcp and BULK INSERT
If you implement your own IDataReader method, you could avoid writing to an imtermediate file. See ADO.NET 2.0 Tutorial : SqlBulkCopy Revisited for Transferring Data at High Speeds
Related SO question: how to pass variables like arrays / datatable to SQL server?
Related
I want to perform bulk insert from CSV to MySQL database using C#, I'm using MySql.Data.MySqlClient for connection. CSV columns are refereed into multiple tables and they are dependent on primary key value, for example,
CSV(column & value): -
emp_name, address,country
-------------------------------
jhon,new york,usa
amanda,san diago,usa
Brad,london,uk
DB Schema(CountryTbl) & value
country_Id,Country_Name
1,usa
2,UK
3,Germany
DB Schema(EmployeeTbl)
Emp_Id(AutoIncrement),Emp_Name
DB Schema(AddressTbl)
Address_Id(AutoIncrement), Emp_Id,Address,countryid
Problem statement:
1> Read data from CSV to get the CountryId from "CountryTbl" for respective employee.
2> Insert data into EmployeeTbl and AddressTbl with CountryId
Approach 1
Go as per above problem statement steps, but that will be a performance hit (Row-by-Row read and insert)
Approach 2
Use "Bulk Insert" option "MySqlBulkLoader", but that needs csv files to read, and looks that this option is not going to work for me.
Approach 3
Use stored proc and use the procedure for upload. But I don't want to use stored proc.
Please suggest if there is any other option by which I can do bulk upload or suggest any other approach.
Unless you have hundreds of thousands of rows to upload, bulk loading (your approach 2) probably is not worth the extra programming and debugging time it will cost. That's my opinion, for what it's worth (2x what you paid for it :)
Approaches 1 and 3 are more or less the same. The difference lies in whether you issue the queries from c# or from your sp. You still have to work out the queries. So let's deal with 1.
The solutions to these sorts of problems depend on make and model of RDBMS. If you decide you want to migrate to SQL Server, you'll have to change this stuff.
Here's what you do. For each row of your employee csv ...
... Put a row into the employee tbl
INSERT INTO EmployeeTbl (Emp_Name) VALUES (#emp_name);
Notice this query uses the INSERT ... VALUES form of the insert query. When this query (or any insert query) runs, it drops the autoincremented Emp_Id value where a subsequent invocation of LAST_INSERT_ID() can get it.
... Put a row into the address table
INSERT INTO AddressTbl (Emp_Id,Address,countryid)
SELECT LAST_INSERT_ID() AS Emp_Id,
#address AS Address,
country_id AS countryid
FROM CountryTbl
WHERE Country_Name = #country;
Notice this second INSERT uses the INSERT ... SELECT form of the insert query. The SELECT part of all this generates one row of data with the column values to insert.
It uses LAST_INSERT_ID() to get Emp_Id,
it uses a constant provided by your C# program for the #address, and
it looks up the countryid value from your pre-existing CountryTbl.
Notice, of course, that you must use the C# Parameters.AddWithValue() method to set the values of the # parameters in these queries. Those values come from your CSV file.
Finally, wrap each thousand rows or so of your csv in a transaction, by preceding their INSERT statements with a START TRANSACTION; statement and ending them with a COMMIT; statement. That will get you a performance improvement, and if something goes wrong the entire transaction will get rolled back so you can start over.
This question already has answers here:
Insert 2 million rows into SQL Server quickly
(8 answers)
Closed 8 years ago.
I am writing a stored procedure to insert rows into a table. The problem is that in some operation we might want to insert more than 1 million rows and we want to make it fast. Another thing is that in one of the column, it is Nvarchar(MAX). We might want to put avg 1000 characters in this column.
Firstly, I wrote a prc to insert row by row. Then I generate some random data for insert with the NVARCHAR(MAX) column to be a string of 1000 characters. Then use a loop to call the prc to insert the rows. The perf is very bad which takes 48 mins if I use SQL server to log on the database server to insert. If I use C# to connect to the server in my desktop (that is what we usually want to do ), it takes about more than 90mins.
Then, I changed the prc to take a table type parameter as the input. I prepared the rows somehow and put them in the table type parameter and do the insert by the following command:
INSERT INTO tableA SELECT * from #tableTypeParameterB
I tried batch size as 1000 rows and 3000 rows (Put 1000-3000 rows in the #tableTypeParameterB to be inserted for one time). The performance is still bad. It takes about 3 mins to insert 1 million rows if I run it in the SQL server and take about 10 mins if I use C# program to connect from my desktop.
The tableA has a clustered index with 2 columns.
My target is to make the insert as fast as possible (My idea target is within 1 min). Is there any way to optimize it?
Just an update:
I tried the Bulk Copy Insert which was suggested by some people below. I tried use the SQLBULKCOPY to insert 1000 row and 10000 row at a time. The performance is still 10 mins to insert 1 million row (Every row has a column with 1000 characters). There is no performance improve. Is there any other suggestions?
An update based on the comments require.
The data is actually coming from UI. The user will change use UI to bulk select, we say, one million rows and change one column from the old value to new value. This operation will be done in a separate procedure.But here what we need to do is that make the mid-tier service to get the old value and new value from the UI and insert them in the table. The old value and new value may be up to 4000 characters and the average is 1000 characters. I think the long string old/new value slow down the speed because when I change the test data old value/new value to 20-50 characters and insert is very fast no matter use SQLBulkCopy or table type variable
I think what you are looking for is Bulk Insert if you prefer using SQL.
Or there is also the ADO.NET for Batch Operations option, so you keep the logic in your C# application. This article is also very complete.
Update
Yes I'm afraid bulk insert will only work with imported files (from within the database).
I have an experience in a Java project where we needed to insert millions of rows (data came from outside the application btw).
Database was Oracle, so of course we used the multi-line insert of Oracle. It turned out that the Java batch update was much faster than the multi-valued insert of Oracle (so called "bulk updates").
My suggestion is:
Compare the performance between the multi-value insert of SQL Server code (then you can read from inside your database, a procedure if you like) with the ADO.NET Batch Insert.
If the data you are going to manipulate is coming from outside your application (if it is not already in the database), I would say just go for the ADO.NET Batch Inserts. I think that its your case.
Note: Keep in mind that batch inserts usually operate with the same query. That is what makes them so fast.
Calling a prc in a loop incurs many round trips to SQL.
Not sure what batching approach you used but you should look into table value parameters: Docs are here. You'll want to still batch write.
You'll also want to consider memory on your server. Batching (say 10K at a time) might be a bit slower but might keep memory pressure lower on your server since you're buffering and processing a set at a time.
Table-valued parameters provide an easy way to marshal multiple rows
of data from a client application to SQL Server without requiring
multiple round trips or special server-side logic for processing the
data. You can use table-valued parameters to encapsulate rows of data
in a client application and send the data to the server in a single
parameterized command. The incoming data rows are stored in a table
variable that can then be operated on by using Transact-SQL.
Another option is bulk insert. TVPs benefit from re-use however so it depends on your usage pattern. The first link has a note about comparing:
Using table-valued parameters is comparable to other ways of using
set-based variables; however, using table-valued parameters frequently
can be faster for large data sets. Compared to bulk operations that
have a greater startup cost than table-valued parameters, table-valued
parameters perform well for inserting less than 1000 rows.
Table-valued parameters that are reused benefit from temporary table
caching. This table caching enables better scalability than equivalent
BULK INSERT operations.
Another comparison here: Performance of bcp/BULK INSERT vs. Table-Valued Parameters
Here is an example what I've used before with SqlBulkCopy. Grant it I was only dealing with around 10,000 records but it did it inserted them a few seconds after the query ran. My field names were the same so it was pretty easy. You might have to modify the DataTable field names. Hope this helps.
private void UpdateMemberRecords(Int32 memberId)
{
string sql = string.Format("select * from Member where mem_id > {0}", memberId);
try {
DataTable dt = new DataTable();
using (SqlDataAdapter da = new SqlDataAdapter(new SqlCommand(sql, _sourceDb))) {
da.Fill(dt);
}
Console.WriteLine("Member Count: {0}", dt.Rows.Count);
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(ConfigurationManager.AppSettings("DestDb"), SqlBulkCopyOptions.KeepIdentity)) {
sqlBulk.BulkCopyTimeout = 600;
sqlBulk.DestinationTableName = "Member";
sqlBulk.WriteToServer(dt);
}
} catch (Exception ex) {
throw;
}
}
If you have SQL2014, then the speed of In-Memory OLTP is amazing;
http://msdn.microsoft.com/en-au/library/dn133186.aspx
Depending on your end goal, it may be a good idea to look into Entity Framework (or similar). This abstracts out the SQL so that you don't really have to worry about it in your client application, which is how things should be.
Eventually, you could end up with something like this:
using (DatabaseContext db = new DatabaseContext())
{
for (int i = 0; i < 1000000; i++)
{
db.Table.Add(new Row(){ /* column data goes here */});
}
db.SaveChanges();
}
The key part here (and it boils down to a lot of the other answers) is that Entity Framework handles building the actual insert statement and committing it to the database.
In the above code, nothing will actually be sent to the database until SaveChanges is called and then everything is sent.
I can't quite remember where I found it, but there is research around that suggests it is worth while to call SaveChanges every so often. From memory, I think every 1000 entries is a good choice for committing to the database. Committing every entry, compared to every 100 entries, doesn't provide much performance benefit and 10000 takes it past the limit. Don't take my word for that though, the numbers could be wrong. You seem to have a good grasp on the testing side of things though, so have a play around with things.
I must sync 2 tables of two databases one of which is a MSSQL and the other one MySQL. I have to do this through a Windows Service. I currently have created a Windows Service, and what I currently have added to the service is : getting the data from the MySQL database and inserting it into a Data Adapter from which I plan to move the data to the MSSQL database using an insertion through transaction.Can you tell me what is the best approach to this problem and if what I'm doing right now is on the right track, first time I'm doing such a thing.
Well, probably there is a couple ways to solve this. You didn't mention about count and size of your tables, desired synchronization frequency so my answer might be not 100% relevant. Anyway my proposal is:
Each table should have additional column - SyncDate (TIMESTAMP) which by default is null.
Sync logic which is implemented in Windows Service periodically checks each table if there is some data to sync by executing query like this (use pure IDataReader for performance reasons):
SELECT * from TEST_TABLE where SyncDate is null
If above statement returns data then:
collect package of rows to insert (for example 500 rows)
begin transaction on target database and execute insert statement for collected package (set value for column SyncDate to DateTime.Now)
when insert is complete execute statement like this in source db:
UPDATE TEST_TABLE SET SyncDate = #DateTime.Now where ID_PRIMARY_KEY IN (IDENTIFIRES FROM PACKAGE)
commit transaction
collect another packages and repeat algorithm until DataReader provides data
Above algorithm should be applied on both databases
Probably your database has foreign keys so you have to remember that dependant tables should be synchronized in valid order
Wherever it possible you can benefit from multi threading
For this kind of job, SQL Server Integration Services is the way to go.
SSIS is made to Extract, transform, and load data. (ETL process)
You design a worklow with each transformation step of your data, from MySQL to MSSQL.
You then create a job and configure its schedule, and it will be executed by SQL Server.
No need to develop something from scratch and hard to maintain.
If you are already at a point where you are ready to move the MySQL data to SQL. I would suggest you put them on a separate table and issue a Merge command to merge the data. Make sure that you flag back the data that has been updated so you can now get them back and push to MySQL:
MERGE TableFromMySQL AS TARGET USING TableFromSQL AS SOURCE
ON (TARGET.PrimaryKey = SOURCE.PrimaryKey)
WHEN Matched AND (TARGET.Field1 <> SOURCE.Field1
OR TARGET.Field2 <> SOURCE.Field2
OR .. put more field comparison here that you want to sync
THEN
UPDATE
SET TARGET.Field1 = SOURCE.Field1,
TARGET.Field2 = SOURCE.Field2,
//flag the target field as updated
TARGET.IsUpdated = 1,
TARGET.LastUpdatedOn = GETDATE()
WHEN Not Matched
THEN
INSERT(PrimaryKey, Field1, Field2, IsUpdated, LastUpdatedOn)
VALUES(SOURCE.PrimaryKey, SOURCE.Field1, SOURCE.Field2, 1, GETDATE());
At this point, you can now query the MySQL table with IsUpdated = 1 and push all values to MYSQL database. I hope this helps.
If you add the mySql database as a linked server (sp_addlinkedserver) then I think you could do the following in a stored proc on your MS SQL database:
insert table1(col1)
select col1
from openquery('MySqlLinkedDb','select col1 from mySqlDb.dbo.table2')
If you have to do this through a Windows Service, I am curious how you are receiving the data from MySQL? What is initiating the call to the service and what form is the data passed in? Are you getting changes or are you getting a couple refresh of the data?
If you receive a DataTable or some other kind of IEnumerable and its a complete refresh, for example, you could use SqlBulkCopy.WriteToServer(). That would be the fastest and most efficient.
Tom
From C# end: Sending multiple insert statement for inserting 20 or more rows in database or
Piling up insert statement in single string and executing that sql string. Which is more efficient? What are advantages and disadvantages?
E.g:
for(int i=0;i<21;i++)
{
//Insert command here
}
or
string qry="Insert into table1 () values "
for(int i=0;i<21;i++)
{
qry+="(values)";
}
If you are trying to perform a bulk insert operation I would recommend you look at the SqlBulkCopy class, which is optimized for this specific operation. If, instead, you are just curious about the relative performance of the various methods of data insertsion - from my experiments using table valued parameters had very high performance when compared to large amounts of ad-hoc insert statements- many of them 'batch inserts', like you mentioned.
Long story short - you're going to have to measure whats best for you using your data and schema. But I do recommend TVP's and the sqlbulkcopy class.
if you are sending a bunch of inserts, you should use a transaction as that VASTLY increases the efficiency.
BEGIN;
INSERT .... ;
INSERT .... ;
INSERT .... ;
INSERT .... ;
INSERT .... ;
COMMIT;
is far more efficient and noticeably faster than
INSERT .... ;
INSERT .... ;
INSERT .... ;
INSERT .... ;
INSERT .... ;
I use 2 methods (outside of entity framework). As stated if you have a ton of records > 5000 use the SqlBulkCopy. Nothing will be faster than that. It uses datatables so its not the greatest when working with objects.
Another option which we use with our business objects is XML. Our business object has a serialize method which produces xml...then we have a stored procedure such as the following. It has a higher sub cost than just passing in strings, but you will have sql server verifying the syntax is correct, the column names are correct etc. Plus sql server will cache the execution plan. Passing in strings will be prone to typo's and problems when making changes.
#xDoc XML
INSERT INTO dbo.Test(Id,Txt)
SELECT Data.Id,Data.Txt
FROM
(SELECT
X.Data.value('Id[1]', 'int') AS Id,
X.Data.value('Txt[1]', 'tinyint') AS Txt
FROM #xDoc.nodes('MyRootNode/MyRecord') AS X(Data)) AS Data;
I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.
My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?
BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.
My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.
My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:
SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)
to get the ones that are present, and (if both sets are needed)
SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL
to get the ones that are absent. Worth benchmarking, at least.
Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).
MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.
Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.
Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?
What do you need to do with those entries that do or don't exist in your table??
Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!
http://blogs.conchango.com/davidportas/archive/2007/11/14/SQL-Server-2008-MERGE.aspx
http://www.sql-server-performance.com/articles/dba/SQL_Server_2008_MERGE_Statement_p1.aspx
http://blogs.msdn.com/brunoterkaly/archive/2008/11/12/sql-server-2008-merge-capability.aspx
Your statement would look something like this:
MERGE INTO
(your target table) AS t
USING
(your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN -- new rows does not exist in base table
....(do whatever you need to do)
WHEN MATCHED THEN -- row exists in base table
... (do whatever else you need to do)
;
To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:
BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH
(
FIELDTERMINATOR =',',
ROWTERMINATOR =' |\n'
)
BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)
Marc
PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:
In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement.
By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.
Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.
If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.
If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.
If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.
Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.
Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter.
Steps you need to perform:
Create a custom Sql Type
CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)
Create a stored procedure which accepts the Type
CREATE PROCEDURE spInsertMyType
#Data MyType READONLY
AS
xxxx
Call using C#
SqlCommand insertCommand = new SqlCommand(
"spInsertMyType", connection);
insertCommand.CommandType = CommandType.StoredProcedure;
SqlParameter tvpParam =
insertCommand.Parameters.AddWithValue(
"#Data", dataReader);
tvpParam.SqlDbType = SqlDbType.Structured;
Links: Table-valued Parameters in Sql 2008
Definitely do not do it one-by-one.
My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:
<ROOT>
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
....
</ROOT>
Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it #FilterTable). The store procedure looks like:
CREATE PROCEDURE dbo.sp_MultipleParams(#FilterXML NVARCHAR(MAX))
AS BEGIN
SET NOCOUNT ON
DECLARE #x XML
SELECT #x = CONVERT(XML, #FilterXML)
-- temporary table (must have it, because cannot join on XML statement)
DECLARE #FilterTable TABLE (
"ID" UNIQUEIDENTIFIER
)
-- insert into temporary table
-- #important: XML iS CaSe-SenSiTiv
INSERT #FilterTable
SELECT x.value('#ID', 'UNIQUEIDENTIFIER')
FROM #x.nodes('/ROOT/MyObject') AS R(x)
SELECT o.ID,
SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
FROM #FilterTable o
LEFT JOIN dbo.MyTable t
ON o.ID = t.ID
GROUP BY o.ID
END
GO
You run it as:
EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'
And your results look like:
ID FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0