BulkCopy from Stored Procedure - c#

I have tables A, B and C in database. I have to put the result obtained from A and B into table C.
Currently, I have an SP that returns the result of the A and B to the C# application. This result will be copied into table C using "System.Data.SqlClient.SqlBulkCopy". The advanatge is during the insert using bulkcopy, log files are not created.
I want to avoid this extra traffic, by handling the insert in the SP itself. However, it should not be using any log files. Any way to achieve this?
Please share your thoughts.
Volume Of Data: 150,000
Database : SQL Server 2005
The database is in full recovery model; it cannot be changed.. Is SELECT INTO usefull in such scenario?
EDIT: When I use System.Data.SqlClient.SqlBulkCopy, the operation is getting completed in 3 mnutes; in normal insert it takes 30 minutes... This particular operation need not be recovered; however other operations in the database has to be recoveed - hence I cannot change the recovery mode of the whole database.
Thanks
Lijo

You can use SELECT INTO with the BULK_LOGGED recovery model in order minimise the number of records written to the transaction log as described in Example B of the INTO Clause documentation (MSDN):
ALTER DATABASE AdventureWorks2008R2 SET RECOVERY BULK_LOGGED;
GO
-- Put your SELECT INTO statement here
GO
ALTER DATABASE AdventureWorks2008R2 SET RECOVERY FULL;
This is also required for bulk inserts if you wish to have minimal impact on the transaction log as described in Optimizing Bulk Import Performance (MSDN):
For a database under the full recovery model, all row-insert operations that are performed during bulk import are fully logged in the transaction log. For large data imports, this can cause the transaction log to fill rapidly. For bulk-import operations, minimal logging is more efficient than full logging and reduces the possibility that a bulk-import operation will fill the log space. To minimally log a bulk-import operation on a database that normally uses the full recovery model, you can first switch the database to the bulk-logged recovery model. After bulk importing the data, switch the recovery model back to the full recovery model.
(emphasis mine)
I.e. if you don't already set the database recovery model to BULK_LOGGED before performing a bulk insert then you won't currently be getting the benefit of minimal transaction logging with bulk insers either and so the transaction log won't be source of your slowdown. (The SqlBulkCopy class doesn't do this for you automatically or anything)

Maybe you can use select into.
Try to take a look at http://msdn.microsoft.com/en-us/library/ms191244.aspx

Can you give an example of the processing your procedure does?
Typically, I would think a set-based insert of 150,000 rows (no linked servers or anything) would take almost no time on most installations.
How long does just selecting the 150,000 rows with a query take?
Are you using a cursor and loop instead of a single INSERT INTO C SELECT * FROM (some combination of A and B)?
Is there any blocking which is causing the operation to wait for other operations to complete?
If your database is in full recovery model, it is going to log the operation - that's the point of using the database that way. The database has been told to use that model and it's going to do that to ensure it can comply.
Imagine if you told the database that a column needed to be unique but it didn't actually enforce it for you! It would be worth less than a comment on a post-it note which fell off a specification document!

In SQL Server 2008 you do not need to return the data to the client/application before proceeding with a minimally logged operation. You can do it within the stored procedure immediately following your query that produces the result to be inserted to Table C.
See Insert: Specifically "Using INSERT INTO…SELECT to Bulk Load Data with Minimal Logging"
[Edit]: Having since expanded your question to include that you are using the FULL recovery model, you therefore cannot benefit from minimally logged operations.
Instead you should concentrate your efforts on optimising your data insert process, than concern yourself with logging overhead.

Insert data into table C in parts using insert into c select * from AandB WHERE ID < SOMETHING. Or you can take send output of a and b data as xml to stored procedure to insert bulk data.
Hope this will help you.

Related

Concurrency And Locking Across Load Balanced Application

I am writing an application where users can create items with a start date and an end date and save them to a SQL database hosted in Microsoft Sql Server. The rule in the application is that only a single item can be active for a given time (no overlapping items). The application also needs to be load balanced, which means (as far as I know) traditional semaphores / locking won't work.
A few additional items:
The records are persisted into two tables (based on a business requirement).
Users are allowed to "insert" records in the middle of an existing record. Inserted records adjust the start & end dates of any pre-existing records to prevent overlapping items (if necessary).
Optimally we want to accomplish this using our ORM and .Net. We don't have as much leeway to make database schema changes but we can create transactions and do other kinds of SQL operations through our ORM.
Our goal is to prevent the following from happening:
Saves from multiple users resulting in overlapping items in either table (ex. users 1 & 2 query the database, see that there aren't overlapping records, and save at the same time)
Saves from multiple users resulting in a different state in each of the destination tables (ex. Two users "insert" records, and the action is interleaved between the two tables. Table A looks as though User 1 went first, and table B looks as though User 2 went first.)
My question is how could I lock or prevent multiple users from saving / inserting at the same time across load balanced servers.
Note: We are currently looking into using sp_getapplock as it seems like it would do what we want, if you have experience with this or feel like it would be a bad decision and want to elaborate that would be appreciated as well!
Edit: added additional info
There are at least a couple of options:
You can create a stored procedure which wraps the INSERT operation in a transaction:
begin tran
select to see if there is an existing record
you can:
- invalidate the previous record
- insert a new record
or
- raise an error
commit tran
catch error
rollback tran
You can employ a last-in-wins strategy where you don't employ a write level lock, but rather a read-level pseuodo-lock, essentially ignoring all records except the latest one.

bypassing Entity Framework to import / update a large amount of data

i have a fully working production site based on entity framework and now i need to import a large amount of data weekly into the database.
the data comes in the form of text files which i go through line by line, check against the database to see if it exists and if it does update anything that has changed or just insert it if not.
the problem im having is that it takes around 32 hours to run the full import process and some of the files have to be manually split into smaller chunks to avoid memory issues seemingly caused by entity framework. i have managed to slow down the memory increase but the last time i ran a file without splitting it, it ran for about 12 hours before running out of memory at somewhere over 1.5gb.
so can someone suggest to me the best way of importing this data, i have heard of sqlbulkcopy but wasnt sure if it was the correct thing to use. can anyone provide any examples? or suggest anything more appropriate. for instance, should i create a duplicate of the entity using standard .net sql commands and possibly use a stored procedure
Although SqlBulkCopy is handy from managed code,I reckon the fastest way is to do it is in "pure" sql -- given that SqlBulkCopy doesn't easily do upserts, you would need to execute the MERGE part below anyway
Assuming that your text file is in csv format, and it exists on the SQL Server as "C:\Data\TheFile.txt", and that line endings are normalised as CR-LF (\r\n)
And let's assume that the data is ID,Value1,Value2
this SQL command will insert into a staging table TheFile_Staging which has ID,Value,Value2 columns with compatible data types, and then update the "real" table TheFile_Table (note: code below not tested!)
truncate table TheFile_Staging
BULK INSERT TheFile_Staging FROM'C:\Data\TheFile.txt'
WITH (fieldterminator=',', rowTerminator='\r\n',FirstRow=2)
//FirstRow=2 means skip Row#1 - use this when 1st row is a header.
MERGE TheFile_Table as target
USING (SELECT ID,Value1,Value2 from TheFile_Staging) as source
on target.ID = source.ID
WHEN MATCHED THEN
UPDATE SET target.Value1=source.Value1, target.Value2=source.target2
WHEN NOT MATCHED THEN
INSERT (id,Value1,Value2) VALUES (source.Id,source.Value1,source.Value2);
You can create a stored procedure and set it to run or invoke from code, etc. The only problem with this approach is error handling bulk insert is a bit of a mess - but as long as your data coming in is ok then it's as quite fast.
Normally I'd add some kind of validation check in the WHERE clause us the USING() select of the MERGE to only take the rows that are valid in terms of data.
It's probably also worth pointing out that the definition of the staging table should omit any non-null, primary key and identity constraints, in order that the data can be read in without error esp. if there are empty fields here and there in your source data; and I also normally prefer to pull in date/time data as a plain nvarchar - this way you avoid incorrectly formatted dates causing import errors and your MERGE statement can perform a CAST or CONVERT as needed whilst at the same time ignoring and/or logging to an error table any invalid data it comes across.
Sadly you need to move away from Entity Framework in this kind of scenario; out of the box EF only does line-by-line inserts. You can do interesting things like this, or you can completely disregard EF and manually code the class that will do the bulk inserts using ADO.Net (SqlBulkCopy).
Edit: you can also keep with the current approach if the performance is acceptable, but you will need to recreate the context periodically, not use the same context for all records. I suspect that's the reason for the outrageous memory consumption.

Efficient Update of Table from One SQL Server to Another, Same Table Structure

I have one database server, acting as the main SQL Server, containing a Table to hold all data. Other database servers come in and out (different instances of SQL Server). When they come online, they need to download data from main Table (for a given time period), they then generate their own additional data to the same local SQL Server database table, and then want to update the main server with only new data, using a C# program, through a scheduled service, every so often. Multiple additional servers could be generating data at the same time, although it's not going to be that many.
Main table will always be online. The additional non-main database table is not always online, and should not be an identical copy of main, first it will contain a subset of the main data, then it generates its own additional data to the local table and updates main table every so often with its updates. There could be a decent amount of number of rows generated and/or downloaded. so an efficient algorithm is needed to copy from the extra database to the main table.
What is the most efficient way to transfer this in C#? SqlBulkCopy doesn't look like it will work because I can't have duplicate entries in main server, and it would fail if checking constraints since some entries already exist.
You could do it in DB or in C#. In all cases you must do something like Using FULL JOINs to Compare Datasets. You know that already.
Most important thing is to do it in transaction. If you have 100k rows split it to 1000 rows per transaction. Or try to determine what combination of rows per transaction is best for you.
Use Dapper. It's really fast.
If you have all your data in C#, use TVP to pass it to DB stored procedure. In stored procedure use MERGE to UPDATE/DELETE/INSERT data.
And last. In C# use Dictionary<Tkey, TValue> or something different with O(1) access time.
SQLBulkCopy is the fastest way for inserting data into a table from a C# program. I have used it to copy data between databases and so far nothing beats it speed wise. Here is a nice generic example: Generic bulk copy.
I would use a IsProcessed flag in the table of the main server and keep track of the main table's primary keys when you download data to the local db server. Then you should be able to do a delete and update to the main server again.
Here's how i would do it:
Create a stored procedure on the main table database which receives a user defined table variable with the same structure as the main table.
it should do something like -
INSERT INTO yourtable (SELECT * FROM tablevar)
OR you could use the MERGE statement for the Insert-or-Update functionality.
In code, (a windows service) load all (or a part of) the data from the secondery table and send it to the stored procedure as a table variable.
You could do it in bulks of 1000's and each time a bulk is updated you should mark it in the source table / source updater code.
Can you use linked servers for this? If yes it will make copying of data from and to main server much easier.
When copying data back to the main server I’d use IF EXISTS before each INSERT statement to additionally make sure there are no duplicates and encapsulate all insert statements into transaction so that if an error occurs transaction is rolled back.
I also agree with others on doing this in batches on 1000 or so records so that if something goes wrong you can limit the damage.

More than one insert operation in the same table at a time

I will explain firstly what I do, then I will specify where is the problem.
My application gets an XML file from an authenticated user through (file uploader), then I map (I mean migrate) the data stored in the XML file to its equivalent one in my database.
I get the data from the XML file through LINQ.
My first question
Each element in
the XML file has an equivalent entity
in my database. What is the best and
more performant way to insert
more than one record in a
specific table and guarantee that if
there is something wrong in the data, rollback the whole operation?
Is there some example of how to do
this? Do you have any suggestions concerning
validating the XML data?
My second question:
In the first question, I talk about the
the (INSERT) operation. If the
user changes some data in the XML
file then I want to update my
database with the new data. How should I
do this? Should I compare each
record, or try to insert and, if that fails,
then update this record?
each element in the XML file has an
equivalent entity in my database. What
is the best and more performance way
to insert more than one record in a
specific table .and guarantee if there
is some thing wrong in the data ,
rollback the whole operation .please
if there is some sample to do this.and
any suggestions concerning validating
the XML data.
The simple answer here is: use a transaction. The point of transactions is to provide you with a mechanism whereby you can execute multiple commands, then either commit them as a single unit of work or roll them back completely so that the database is left in a state as if your operations had never taken place.
in the first one i talk about the the
(INSERT) operation, if the user change
some data in the XML file then i wanna
to update my database with the new
data.how to o this .should i compare
each record . or try to insert if fail
then update this record?.
The try-error-different retry pattern is not a desirable one if it can be easily avoided. Your SQL should either use a statement that is designed to conditionally insert or update depending on existing data (such as the SQL Server MERGE command; I don't have Informix experience so I can't speak to what it supports or if MERGE is ANSI SQL), or you should do this conditional logic yourself within the SQL.
Use an informix stored procedure for this.
This will allow you to include exception handling for dealing with bad data.
You can then load all your entities into a temp table first, say t_work, if there is a data issue then drop the table and raise an excpetion.
BEGIN -- Start Exception Handling
ON EXCEPTION SET esql, eisam
DROP TABLE t_work;
RAISE EXCEPTION esql, eisam; -- rethrow the exception
END EXCEPTION
-- << Your logic here>>
See here for more details: http://www.pacs.tju.edu/informix/answers/english/docs/dbdk/infoshelf/sqlt/14.toc.html#540217

Keeping a history of data changes in database

Every change of data in some row in database should save the previous row data in some kind of history so user can rollback to previous row data state. Is there any good practice for that approach? Tried with DataContract and serializing and deserializing data objects but it becomes little messy with complex objects.
So to be more clear:
I am using NHibernate for data access and want to stay out off database dependency (For testing using SQL server 2005)
What is my intention is to provide data history so every time user can rollback to some previous versions.
An example of usage would be the following:
I have a news article
Somebody make some changes to that article
Main editor see that this news has some typos
It decides to rollback to previous valid version (until the newest version is corrected)
I hope I gave you valid info.
Tables that store changes when the main table changes are called audit tables. You can do this multiple ways:
In the database using triggers: I would recommend this approach because then there is no way that data can change without a record being made. You have to account for 3 types of changes when you do this: Add, Delete, Update. Therefore you need trigger functionality that will work on all three.
Also remember that a transaction can modify multiple records at the same time, so you should work with the full set of modified records, not just the last record (as most people belatedly realize they did).
Control will not be returned to the calling program until the trigger execution is completed. So you should keep the code as light and as fast as possible.
In the middle layer using code: This approach will let you save changes to a different database and possibly take some load off the database. However, a SQL programmer running an UPDATE statement will completely bypass your middle layer and you will not have an audit trail.
Structure of the Audit Table
You will have the following columns:
Autonumber PK, TimeStamp, ActionType + All columns from your original table
and I have done this in the following ways in the past:
Table Structure:
Autonumber PK, TimeStamp, ActionType, TableName, OriginalTableStructureColumns
This structure will mean that you create one audit table per data table saved. The data save and reconstruction is fairly easy to do. I would recommend this approach.
Name Value Pair:
Autonumber PK, TimeStamp, ActionType, TableName, PKColumns, ColumnName, OldValue, NewValue
This structure will let you save any table, but you will have to create name value pairs for each column in your trigger. This is very generic, but expensive. You will also need to write some views to recreate the actual rows by unpivoting the data. This gets to be tedious and is not generally the method followed.
Microsoft have introduced new auditing capabilities into SQL Server 2008. Here's an article describing some of the capabilities and design goals which might help in whichever approach you choose.
MSDN - Auditing in SQL Server 2008
You can use triggers for that.
Here is one example.
AutoAudit is a SQL Server (2005, 2008)
Code-Gen utility that creates Audit
Trail Triggers with:
* Created, Modified, and RowVerwsion (incrementing INT) columns to table
* view to reconstruct deleted rows
* UDF to reconstruct Row History
* Schema Audit Trigger to track schema changes
* Re-code-gens triggers when Alter Table changes the table
http://autoaudit.codeplex.com/
Saving serialized data always gets messy in the end, you're right to stay away from that. The best thing to do is to create a parallel "version" table with the same columns as your main table.
For instance, if you have a table named "book", with columns "id", "name", "author", you could add a table named "book_version" with columns "id", "name", "author", "version_date", "version_user"
Each time you insert or update a record on table "book", your application will also insert into "book_version".
Depending on your database system and the way you database access from your application, you may be able to completely automate this (cfr the Versionable plugin in Doctrine)
One way is to use a DB which supports this natively, like HBase. I wouldn't normally suggest "Change your DB server to get this one feature," but since you don't specify a DB server in your question I'm presuming you mean this as open-ended, and native support in the server is one of the best implementations of this feature.
What database system are you using? If you're using an ACID (atomicity, consistency, isolation, durability) compliant database, can't you just use the inbuilt rollback facility to go back to a previous transaction?
I solved this problem very nice by using NHibernate.Enverse
For those intersted read this:
http://nhforge.org/blogs/nhibernate/archive/2010/07/05/nhibernate-auditing-v3-poor-man-s-envers.aspx

Categories

Resources