Batch Updates using DataAdapter

Batch Updates using DataAdapter - c#

I have a situation where I have a bunch of SQL Update commands that all need to be executed. I know that DataSets can do batch updates, but the only way I've been able to accomplish it is to load the whole table into a dataset first. What if I want to only update a subset of the records in a table?

The easiest way out here is load table with the required rows (rows you wan to update). Double check that the RowState is "Inserted".
Assign the InsertCommand property of the adapter with your stored procedure (wrapped in an SqlCommand) that does the "update", this tweak will ensure that all the rows present in the table gets updated.
The fundamental here being: The DataAdapter runs the UpdateCommand on rows for which state is Updated, runs InsertCommand for rows which state is Inserted and finally DeleteCommand for the rows for which state is Deleted.

EDITED: Based on your comment, I'd recommend using the Bulk Copy method to go to a staging table first. Then you can do a single update on your real table based on the staging table.
=========
One way is to build the SQL Command yourself; however, I'd recommend reading about the SQL Injection possibilities to protect yourself. Depending on your situation and your platform there are other options.
For example if you are dealing with a lot of data, then you could do a bulk import into a holding table, which you would then issue a single update command off of. I've also had good success passing records in as XML (I found in my environment that I needed at least 50 rows to offset the cost of loading the DOM, and I knew that the scalability issues were not a factor in my case).
Some other things I've seen was people package the updates into a binary field, for example one binary field for each column, then they have a function which unpacks the binary field. Again you will want to test with your environment.
With that said, have you verified that simply calling a single update command or stored procedure for each update you need is not sufficent? You might not even need to batch the commands up.
Josh
Watch out for Premature Optimization it can be killer!

Related

Azure SQL handling table locks for huge inserts

In my application, I have a windows service (hosted as quartz on azure web) which runs after a particular time and reads a file and inserts data. The data can be of any length so the type in DB is "text". All records are inserted in one table.
Problem is that the service might run in parallel and try insert the records in table at the same time. Since the data might be huge and I want to have performance also, I want to make the service run in parallel. I am using EF 6.0 and LINQ. Is there a possible way to not lock the table and insert huge data.
Note: Bulk insert might not work as the data type to insert have 'text' as well.

I am assuming that there is one column of type "text" in your table and nothing else. Based on the above assumptions, I would think that there would not be fundamental semantic problems if you want to run this in parallel.
There are a couple of ways of doing this:
use SqlBulkCopy class with row batches. This will copy batches of rows and load them into the table. This can be done in parallel.
use a temporary staging table for each service, and then run INSERT INTO SELECT for each staging table. The INSERT portion will be in serial because it will take an X lock on the table, but this may be faster than loading one row by another.
Neither of these methods uses EF 6.0 and LINQ, but they will get the job done.
Hope this helps!

bypassing Entity Framework to import / update a large amount of data

i have a fully working production site based on entity framework and now i need to import a large amount of data weekly into the database.
the data comes in the form of text files which i go through line by line, check against the database to see if it exists and if it does update anything that has changed or just insert it if not.
the problem im having is that it takes around 32 hours to run the full import process and some of the files have to be manually split into smaller chunks to avoid memory issues seemingly caused by entity framework. i have managed to slow down the memory increase but the last time i ran a file without splitting it, it ran for about 12 hours before running out of memory at somewhere over 1.5gb.
so can someone suggest to me the best way of importing this data, i have heard of sqlbulkcopy but wasnt sure if it was the correct thing to use. can anyone provide any examples? or suggest anything more appropriate. for instance, should i create a duplicate of the entity using standard .net sql commands and possibly use a stored procedure

Although SqlBulkCopy is handy from managed code,I reckon the fastest way is to do it is in "pure" sql -- given that SqlBulkCopy doesn't easily do upserts, you would need to execute the MERGE part below anyway
Assuming that your text file is in csv format, and it exists on the SQL Server as "C:\Data\TheFile.txt", and that line endings are normalised as CR-LF (\r\n)
And let's assume that the data is ID,Value1,Value2
this SQL command will insert into a staging table TheFile_Staging which has ID,Value,Value2 columns with compatible data types, and then update the "real" table TheFile_Table (note: code below not tested!)
truncate table TheFile_Staging
BULK INSERT TheFile_Staging FROM'C:\Data\TheFile.txt'
WITH (fieldterminator=',', rowTerminator='\r\n',FirstRow=2)
//FirstRow=2 means skip Row#1 - use this when 1st row is a header.
MERGE TheFile_Table as target
USING (SELECT ID,Value1,Value2 from TheFile_Staging) as source
on target.ID = source.ID
WHEN MATCHED THEN
UPDATE SET target.Value1=source.Value1, target.Value2=source.target2
WHEN NOT MATCHED THEN
INSERT (id,Value1,Value2) VALUES (source.Id,source.Value1,source.Value2);
You can create a stored procedure and set it to run or invoke from code, etc. The only problem with this approach is error handling bulk insert is a bit of a mess - but as long as your data coming in is ok then it's as quite fast.
Normally I'd add some kind of validation check in the WHERE clause us the USING() select of the MERGE to only take the rows that are valid in terms of data.
It's probably also worth pointing out that the definition of the staging table should omit any non-null, primary key and identity constraints, in order that the data can be read in without error esp. if there are empty fields here and there in your source data; and I also normally prefer to pull in date/time data as a plain nvarchar - this way you avoid incorrectly formatted dates causing import errors and your MERGE statement can perform a CAST or CONVERT as needed whilst at the same time ignoring and/or logging to an error table any invalid data it comes across.

Sadly you need to move away from Entity Framework in this kind of scenario; out of the box EF only does line-by-line inserts. You can do interesting things like this, or you can completely disregard EF and manually code the class that will do the bulk inserts using ADO.Net (SqlBulkCopy).
Edit: you can also keep with the current approach if the performance is acceptable, but you will need to recreate the context periodically, not use the same context for all records. I suspect that's the reason for the outrageous memory consumption.

Efficient Update of Table from One SQL Server to Another, Same Table Structure

I have one database server, acting as the main SQL Server, containing a Table to hold all data. Other database servers come in and out (different instances of SQL Server). When they come online, they need to download data from main Table (for a given time period), they then generate their own additional data to the same local SQL Server database table, and then want to update the main server with only new data, using a C# program, through a scheduled service, every so often. Multiple additional servers could be generating data at the same time, although it's not going to be that many.
Main table will always be online. The additional non-main database table is not always online, and should not be an identical copy of main, first it will contain a subset of the main data, then it generates its own additional data to the local table and updates main table every so often with its updates. There could be a decent amount of number of rows generated and/or downloaded. so an efficient algorithm is needed to copy from the extra database to the main table.
What is the most efficient way to transfer this in C#? SqlBulkCopy doesn't look like it will work because I can't have duplicate entries in main server, and it would fail if checking constraints since some entries already exist.

You could do it in DB or in C#. In all cases you must do something like Using FULL JOINs to Compare Datasets. You know that already.
Most important thing is to do it in transaction. If you have 100k rows split it to 1000 rows per transaction. Or try to determine what combination of rows per transaction is best for you.
Use Dapper. It's really fast.
If you have all your data in C#, use TVP to pass it to DB stored procedure. In stored procedure use MERGE to UPDATE/DELETE/INSERT data.
And last. In C# use Dictionary<Tkey, TValue> or something different with O(1) access time.

SQLBulkCopy is the fastest way for inserting data into a table from a C# program. I have used it to copy data between databases and so far nothing beats it speed wise. Here is a nice generic example: Generic bulk copy.
I would use a IsProcessed flag in the table of the main server and keep track of the main table's primary keys when you download data to the local db server. Then you should be able to do a delete and update to the main server again.

Here's how i would do it:
Create a stored procedure on the main table database which receives a user defined table variable with the same structure as the main table.
it should do something like -
INSERT INTO yourtable (SELECT * FROM tablevar)
OR you could use the MERGE statement for the Insert-or-Update functionality.
In code, (a windows service) load all (or a part of) the data from the secondery table and send it to the stored procedure as a table variable.
You could do it in bulks of 1000's and each time a bulk is updated you should mark it in the source table / source updater code.

Can you use linked servers for this? If yes it will make copying of data from and to main server much easier.
When copying data back to the main server I’d use IF EXISTS before each INSERT statement to additionally make sure there are no duplicates and encapsulate all insert statements into transaction so that if an error occurs transaction is rolled back.
I also agree with others on doing this in batches on 1000 or so records so that if something goes wrong you can limit the damage.

Keeping a history of data changes in database

Every change of data in some row in database should save the previous row data in some kind of history so user can rollback to previous row data state. Is there any good practice for that approach? Tried with DataContract and serializing and deserializing data objects but it becomes little messy with complex objects.
So to be more clear:
I am using NHibernate for data access and want to stay out off database dependency (For testing using SQL server 2005)
What is my intention is to provide data history so every time user can rollback to some previous versions.
An example of usage would be the following:
I have a news article
Somebody make some changes to that article
Main editor see that this news has some typos
It decides to rollback to previous valid version (until the newest version is corrected)
I hope I gave you valid info.

Tables that store changes when the main table changes are called audit tables. You can do this multiple ways:
In the database using triggers: I would recommend this approach because then there is no way that data can change without a record being made. You have to account for 3 types of changes when you do this: Add, Delete, Update. Therefore you need trigger functionality that will work on all three.
Also remember that a transaction can modify multiple records at the same time, so you should work with the full set of modified records, not just the last record (as most people belatedly realize they did).
Control will not be returned to the calling program until the trigger execution is completed. So you should keep the code as light and as fast as possible.
In the middle layer using code: This approach will let you save changes to a different database and possibly take some load off the database. However, a SQL programmer running an UPDATE statement will completely bypass your middle layer and you will not have an audit trail.
Structure of the Audit Table
You will have the following columns:
Autonumber PK, TimeStamp, ActionType + All columns from your original table
and I have done this in the following ways in the past:
Table Structure:
Autonumber PK, TimeStamp, ActionType, TableName, OriginalTableStructureColumns
This structure will mean that you create one audit table per data table saved. The data save and reconstruction is fairly easy to do. I would recommend this approach.
Name Value Pair:
Autonumber PK, TimeStamp, ActionType, TableName, PKColumns, ColumnName, OldValue, NewValue
This structure will let you save any table, but you will have to create name value pairs for each column in your trigger. This is very generic, but expensive. You will also need to write some views to recreate the actual rows by unpivoting the data. This gets to be tedious and is not generally the method followed.

Microsoft have introduced new auditing capabilities into SQL Server 2008. Here's an article describing some of the capabilities and design goals which might help in whichever approach you choose.
MSDN - Auditing in SQL Server 2008

You can use triggers for that.
Here is one example.
AutoAudit is a SQL Server (2005, 2008)
Code-Gen utility that creates Audit
Trail Triggers with:
* Created, Modified, and RowVerwsion (incrementing INT) columns to table
* view to reconstruct deleted rows
* UDF to reconstruct Row History
* Schema Audit Trigger to track schema changes
* Re-code-gens triggers when Alter Table changes the table
http://autoaudit.codeplex.com/

Saving serialized data always gets messy in the end, you're right to stay away from that. The best thing to do is to create a parallel "version" table with the same columns as your main table.
For instance, if you have a table named "book", with columns "id", "name", "author", you could add a table named "book_version" with columns "id", "name", "author", "version_date", "version_user"
Each time you insert or update a record on table "book", your application will also insert into "book_version".
Depending on your database system and the way you database access from your application, you may be able to completely automate this (cfr the Versionable plugin in Doctrine)

One way is to use a DB which supports this natively, like HBase. I wouldn't normally suggest "Change your DB server to get this one feature," but since you don't specify a DB server in your question I'm presuming you mean this as open-ended, and native support in the server is one of the best implementations of this feature.

What database system are you using? If you're using an ACID (atomicity, consistency, isolation, durability) compliant database, can't you just use the inbuilt rollback facility to go back to a previous transaction?

I solved this problem very nice by using NHibernate.Enverse
For those intersted read this:
http://nhforge.org/blogs/nhibernate/archive/2010/07/05/nhibernate-auditing-v3-poor-man-s-envers.aspx

Maintain a local copy of a table from an external database table, ADO.NET

We have built an application which needs a local copy of a table from another database. I would like to write an ado.net routine which will keep the local table in sync with the master. Using .net 2.0, C# and ADO.NET.
Please note I really have no control over the master table which is in a third party, mission critical app I don't wish to mess with.
For example Here is the master data table:
ProjectCodeId Varchar(20) [PK]
ProjectCode Varchar(20)
ProjectDescrip Varchar(50)
OtherUneededField int
OtherUneededField2 int
The local table we need to keep in sync...
ProjectCodeId Varchar(20) [PK]
ProjectCode Varchar(20)
ProjectDescrip Varchar(50)
Perhaps a better approach to this question is what have you done in the past to this type of problem? What has worked best for you or should be avoided at all costs?
My goal with this question is to determine a good way to handle this. So often I am combining data from two or more disjointed data sources. I haven't included database platforms for this reason, it really shouldn't matter. In this current situation both databases are MSSQL, but I prefer the solution not use linked databases or DTS, etc.
Sure, truncating the local table and refilling it each time from the master is an option, but with thousands of rows I don't think this is very efficient. Do you?

EDIT: First, recognize that what you are doing is hand-rolled replication and replication is never simple.
You need to track and apply all of the CRUD state changes. That said, ADO.NET can do this.
To track changes to the source you can use Query Notification with your source database. This requires special permission against the database so the owner of the source database will need to take action to enable this solution. I haven't used this technique myself, but here is a description of it.
See "Query Notifications in SQL Server (ADO.NET)"
Query notifications were introduced in
Microsoft SQL Server 2005 and the
System.Data.SqlClient namespace in
ADO.NET 2.0. Built upon the Service
Broker infrastructure, query
notifications allow applications to be
notified when data has changed. This
feature is particularly useful for
applications that provide a cache of
information from a database, such as a
Web application, and need to be
notified when the source data is
changed.
To apply changes from the source db table you need to retrieve the data from the target db table, apply the changes to the target rows and post the changes back to the target db.
To apply the changes you can either
1) Delete and reinsert all of the rows (simple), or
2) Merge row-by-row changes (hard).
Delete and reinsert is self explanatory, so I won't go into detail on that.
For row-by-row change tracking here is an approach. (I am assuming here that Query Notification doesn't give you row-by-row change information, so you have to calculate it.)
You need to determine which rows were modified and identify inserted and deleted rows. Create a DataView with a sort for each table to get a Find method you can use to lookup matching rows by ID.
Identify modified rows by using a datetime/timestamp column, or by comparing all field values. Copy modified values to the target row.
Identify added and deleted rows by looping over the respective table DataViews and using the Find method of the other DataView to identify rows that do not appear in the first table. Insert or delete rows from the target table as required. (The Delete method doesn't remove the row but marks it for deletion by the TableAdapter Update.)
Good luck!
+tom

I would push in the direction where the application that is inserting the data would insert into one db/table then the other in the same function. Make the application do the work, the db will be pushed already.

Some questions - what db platform? how are you using the data?
I'm going to assume you're just using this data as a lookup... and as you have no timestamp and no ability modify the existing table, i'd just blow away the local copy periodically and pull it down from the master table again.
Unless you've got a hell of a lot of data the overhead for this should be pretty small.
If you need to synch back to the master table, you'll need to do something a bit more exotic.

Can you use SQL replication? This would be preferable to writing code to do it no?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.