I have to parse a big XML file and import (insert/update) its data into various tables with foreign key constraints.
So my first thought was: I create a list of SQL insert/update statements and execute them all at once by using SqlCommand.ExecuteNonQuery().
Another method I found was shown by AMissico: Method
where I would execute the sql commands one by one. No one complained, so I think its also a viable practice.
Then I found out about SqlBulkCopy, but it seems that I would have to create a DataTable with the data I want to upload. So, SqlBulkCopy for every table. For this I could create a DataSet.
I think every option supports SqlTransaction. It's approximately 100 - 20000 records per table.
Which option would you prefer and why?
You say that the XML is already in the database. First, decide whether you want to process it in C# or in T-SQL.
C#: You'll have to send all data back and forth once, but C# is a far better language for complex logic. Depending on what you do it can be orders of magnitude faster.
T-SQL: No need to copy data to the client but you have to live with the capabilities and perf profile of T-SQL.
Depending on your case one might be far faster than the other (not clear which one).
If you want to compute in C#, use a single streaming SELECT to read the data and a single SqlBulkCopy to write it. If your writes are not insert-only, write to a temp table and execute as few DML statements as possible to update the target table(s) (maybe a single MERGE).
If you want to stay in T-SQL minimize the number of statements executed. Use set-based logic.
All of this is simplified/shortened. I left out many considerations because they would be too long for a Stack Overflow answer. Be aware that the best strategy depends on many factors. You can ask follow-up questions in the comments.
Don't do it from C# unless you have to, it's a huge overhead and SQL can do it so much faster and better by itself
Insert to table from XML file using INSERT INTO SELECT
Related
Insert Query vs SqlBulkCopy - which one is best as performance wise for insert records from One DataBase to Another DataBase
I know SQLBulkcopy used for large no of records.
If records less than 10 then which one is better.
Please share your views.
As you are asking for less than 10 records,I would suggest you that use simple insert query.
But if you want to use SQLBulkCopy then first you should know when to use it.
Read For Knowledge
BULK INSERT
The BULK INSERT command is the in-process method for bringing data from a text file into SQL Server. Because it runs in process with Sqlservr.exe, it is a very fast way to load data files into SQL Server.
IMO, in your case, using SqlBulkCopy is way overkill..
You can use a User-defined Type (UDT) as a table-valued parameter (TVP) passed to a stored procedure if you have less than 1,000 rows or so. After that threshold, the performance gain of SqlBulkCopy begins to outweigh its initial overhead. SqlBulkCopy works best with Heap Tables (tables without clustered indexes).
I found this article helpful for my own solution - https://www.sentryone.com/blog/sqlbulkcopy-vs-table-valued-parameters-bulk-loading-data-into-sql-server.
Here's a link to a site with some benchmarks.
I've used SqlBulkCopy extensively in the past and it's very efficient. However, you may need to redesign your tables to be able to use it effectively. For example, when you use SqlBulkCopy you may want to use client side generated identifiers, so you will have to invent some scheme that allows you to generate stable, robust and predictable IDs. This is very different from your typical insert into table with auto generated identity column.
As an alternative to SqlBulkCopy you can use the method discussed in the link I provided where you use a combination of union all and insert into statements, it has excellent performance as well. Altough, as the dataset increases I think SqlBulkCopy will be the faster option. A hybrid approach is probably warranted here where you switch to SqlBulkCopy when record count is high.
I recon SqlBulkCopy will win out for larger data sets but you need to understand that a SqlBulkCopy operation can be used to forgo some checks. This will of course speed things up even more but also allow you to violate conditions that you have imposed on your database.
I have a very large number of rows (10 million) which I need to select out of a SQL Server table. I will go through each record and parse out each one (they are xml), and then write each one back to a database via a stored procedure.
The question I have is, what's the most efficient way to do this?
The way I am doing it currently is I open 2 SqlConnection's (one for read one for write). The read one uses the SqlDataReader of which it basically does a select * from table and I loop through the dataset. After I parse each record I do an ExecuteNonQuery (using parameters) on the second connection.
Is there any suggestions to make this more efficient, or is this just the way to do it?
Thanks
It seems that you are writing rows one-by-one. That is the slowest possible model. Write bigger batches.
There is no need for two connections when you use MARS. Unfortunately, MARS forces a 14 byte row versioning tag in each written row. Might be totally acceptable, or not.
I had very slimier situation and here what I did:
I made two copies of same database.
One is optimized for reading and another is optimized for writing.
In config, i kept two connection string ConnectionRead and ConnectionWrite.
Now in DataLayer when I have read statement(select..) I switch my connection to ConnectionRead connection string and when writing using other one.
Now since I have to keep both the databases in sync, I am using SQL replication for this job.
I can understand implementation depends on many aspect but approach may help you.
I agree with Tim Schmelter's post - I did something very similar... I actually used a SQLCLR procedure which read the data from a XML column in a SQL table into an in-memory (table) using .net (System.Data) then used the .net System.Xml namespace to deserialize the xml, populated another in-memory table (in the shape of the destination table) and used the sqlbulkcopy to populate that destination SQL table with those parsed attributes I needed.
SQL Server is engineered for set-based operations... If ever I'm shredding/iterating (row-by-row) I tend to use SQLCLR as .Net is generally better at iterative/data-manipulative processing. An exception to my rule is when working with a little metadata for data-driven processes, cleanup routines where I may use a cursor.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have two tables where I need to extract data from one, do some modifications to that data and then write it to a different table.
I was wondering what is the most space/time efficient way to do this.
Is it better to read one record, modify and write the single record to the other table and loop this or is it better to read the whole thing, modify it and then write all of it into the other table.
I will be writing this using C# and Linq.
The tables have different column headings and structure.
Indeed, the most performant way is to use a stored procedure or something (and then of course use batch/set operations).
If you have to choose C#, choose the option that has the fewest I/O-operations, since those are almost always the performance-breakers. That usually means: read everything in one go, modify, and then write it all in one go, but this is all very dependent on the amount of data you are modifying.
Most efficient way would be to do this entirely in the backend. Write a stored procedure (most likely no loops will be required it should be a matter of INSERT/SELECT) and call that SP from your .NET code.
The best way to do is the ETL process or script.
if the modification require any end user UI activity before inserting into the table then the C#, using linq is good. if the Modification on each record is same then use the ETL or SQL Scripts to perform this.
few guidlines.
For Data Fetching/INSERTING from One Table , Use Stored Procedure at SQL Side thats better.
using ADO.Net for fetching and inserting record is faster than LINQ.
***for same processing on multiple record
For bulk processing of record use table variables to fetch and interate on each record.
table variable is quite faster than temporary variables, it also helps in querying records.
Modify the operation or cursor or any logical based iteration on record or editing in columns data then
insert it through bulk insert in the other table.
SQL Server is quite proficient on records processing in large scale. i would not suggest to get your same business logic into the C#,LINQ based app. try to process your business logic at sql server unless the end user needs to edit the record.
i have moved 1.8 millions records from tables to a newly database structure tables, it happened accurately in 11 minutes but it took 27 minutes to (verification)queries to verify that everything is at right place.
might be it help you.
A big question is your amount of data. A .Net client cannot "write the whole thing" in one request. Inserts and updates happen row-by-row. It certainly makes sense to read the data in one request (or in batches if it is too big to process everything in-memory).
But if you have 100,000s or millions of rows, this process will take many minutes regardless. I therefore would reevaluate your assertion that "the manipulation in the middle requires it [C#]". There are probably ways around this, for example by creating some sort of control table in your database beforehand that you can then use in a query or stored procedure to apply the modifications. The difference in performance makes it worth while to be creative in this situation.
I have a complete working mini example here:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
but the gist is this
Use an IDataReader to read data from the source database.
Perform some validations and calculations in the C# code.
The massaged data needs to be put into the destination database.
The trouble here is that "row by row" kills performance, but if you do ~everything in one hit, it may be too much.
So what I do is send down X number of rows (think 1,000 as a example).....via xml to a stored procedure.......and bulk insert X number of rows at a time.
You may have poison records (or that don't pass validation). My model is "get as many that will work into the database, log and deal with the poison records later".
The code will log the xml that did not pass.
Not included in the demo, but an enchancement.
If the bulk insert (of 1,000) does not work, perhaps have a subroutine that passes them in one-by-one at that point........and log the handful that do not work.
The downloadable example is older, but has the skeleton.
Well you can try to update column by column ,so that you don't have to loop for every row and your server trips can be reduced to the 2*number(One for fetching data one for inserting ) of columns you have .
You can get primary keys while fetching records use
where in clause
and insert those values into another table.
But you have to elaborate more of your problem to get a satisfactory answer ,above approach will reduce server trips and looping .
Or you can use SqlDatadapter.Insert
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldataadapter.insertcommand.aspx
------Hope it works.
I have a lot of data which needs to be paired based on a few simple criteria. There is a time window (both records have a DateTime column), if one record is very close in time (within 5 seconds) to another then it is a potential match, the record which is the closest in time is considered a complete match. There are other fields which help narrow this down also.
I wrote a stored procedure which does this matching on the server before returning the
full, matched dataset to a C# application. My question is, would it be better to pull in the 1 million (x2) rows and deal with them in C#, or is sql server better suited to perform this matching? If Sql server is, then what is the fastest way of pairing data using datetime fields?
Right now I select all records from Table 1/Table 2 into temporary tables, iterate through each record in Table 1, look for a match in Table 2 and store the match (if one exists) in a temporary table, then I delete both records in their own temporary tables.
I had to rush this piece for a game I'm writing, so excuse the bad (very bad) procedure... It works, it's just horribly inefficient! The whole SP is available on pastebin: http://pastebin.com/qaieDsW7
I know the SP is written poorly, so saying "hey, dumbass... write it better" doesn't help! I'm looking for help in improving it, or help/advice on how I should do the whole thing differently! I have about 3/5 days to rewrite it, I can push that deadline back a bit, but I'd rather not if you guys can help me in time! :)
Thanks!
Ultimately, compiling your your data on the database side is preferable 99% of the time, as it's designed for data crunching (through the use of indexes, relations, etc). A lot of your code can be consolidated by the use of joins to compile the data in exactly the format you need. In fact, you can bypass almost all your temp tables entirely and just fill a master Event temp table.
The general pattern is this:
INSERT INTO #Events
SELECT <all interested columns>
FROM
FireEvent
LEFT OUTER JOIN HitEvent ON <all join conditions for HitEvent>
This way you match all fire events to zero or more HitEvents. After our discussion in chat, you can even limit it to zero or one hit event by wrapping it in a subquery and using a window function for ROW_NUMBER() OVER (PARTITION BY HitEvent.EventID ORDER BY ...) AS HitRank and add a WHERE HitRank = 1 to the outer query. This is ultimately what you ended up doing and got the results you were expecting (with a bit of work and learning in the process).
If the data is already in the database, that is where you should do the work. You absolutely should learn to display and query plans using SQL Server Management Studio, and become able to notice and optimize away expensive computations like nested loops.
Your task probably does not require any use of temporary tables. Temporary tables tend to be efficient when they are relatively small and/or heavily reused, which is not your case.
I would advise you to try to optimize the stored procedure if is not running fast enough and not rewrite it in C#. Why would you want to transfer millions of rows out of SQL Server anyway?
Unfortunately I don't have an SQL Server installation so I can't test your script, but I don't see any CREATE INDEX statements in there. If you didn't just skipped them for brevity, then you should surely analyze your queries and see which indexes are needed.
So the answer depends on several factors like resources available per client/server (Ram/CPU/Concurrent Users/Concurrent processes, etc.)
Here are some basic rules that will improve your performance regardless of what you use:
Loading a million rows into c# program is not a good practice. Unless this is a stand alone process with plenty of ram.
Uniqueidentifiers will never out perform Integers. Comparisons
Common Table Expression are a good alternative for fast performing matching. How to use CTE
Finally you have to consider output. If there is constant reading and writing that affects the user interface, then you should manage that in memory (c#), otherwise all CRUD operations should be kept inside the database.
I am inserting record in the database (100,1.000, 10.000 and 100.000) using 2 methods
(is a table with no primary key and no index)
using a for and inserting one by one
using a stored procedure
The times are, of course better using stored procedure.
My questions are: 1)if i use a index will the operation go faster and 2)Is there any other way to make the insertion
PS:I am using ibatis as ORM if that makes any difference
Check out SqlBulkCopy.
It's designed for fast insertion of bulk data. I've found it to be fastest when using the TableLock option and setting a BatchSize of around 10,000, but it's best to test the different scenarios with your own data.
You may also find the following useful.
SQLBulkCopy Performance Analysis
No, I suspect that, if you use an index, it will actually go slower. That's because it has to update the index as well as inserting the data.
If you're reasonably certain that the data won't have duplicate keys, add the index after you've inserted all the rows. That way, it built once rather than being added to and re-balanced on every insert.
That's a function of the DBMS. I know it's true for the one I use frequently (which is not SQLServer).
I know this is slightly off-topic, but it's a shame you're not using SQL Server 2008, as there's been a massive improvement in this area with the advent of the MERGE statement and user-defined table types (which allow you to pass-in a 'table' of data to the stored procedure or statement so you can insert/update many records in one go).
For some more information, have a look at http://www.sql-server-helper.com/sql-server-2008/merge-statement-with-table-valued-parameters.aspx
It was already discussed : Insert data into SQL server with best performance.