Should I use Table Lock while inserting data using Multithreaded SqlBulkCopy - c#

I am inserting data into table using SqlBulkCopy in parallel threads. various links on internet says TableLock is a good option to use SqlBulkCopy.
Data is huge and continuous.
does RowLock give better performance in multithreading? I am confused between the time taken by one thread to complete insertion will cause waiting other threads to wait if table lock is enabled whereas row locking should not make any effect for insertion.

You should use a table lock; SqlBulkCopy has it's own table lock that is concurrent for multiple bulk copy insert operations (the Bulk update (BU) lock). Parallel threads will help up until you saturate network I/O. A rule of thumb (from my personal testing) is to use n consumers, where n is 2x the number of cores for the server. Experiment yourself. You most definitely do not want a row lock as that defeats a lot of the optimizations that the bulk operation provides.
Your best bets for improving performance are:
keep your batch size as large as possible - you may need to increase
the timeout of the operation to prevent failures on long runs (set
timeout to zero for no limit). To do the entire set in one go set the batch size to zero. If you have a very large data set look at streaming directly from the source to bulk copy.
drop any indices on the target table, then defragment and create the
index when you're done the load. Combined with a table lock this
should minimize the log overhead and improve performance. You also can't use TABLOCK and concurrent operations on a table with a clustered index.

Related

Are there any best practices on upper limits how frequently to run an INSERT?

An application has the requirement to store data in a simple DB (single table likely, fewer than 10 fields). Data is acquired by polling a remote service about at about 10Hz, we might get a handful of rows each poll.
The application and DB will be on servers used by other products so while it doesn't need to be super-performant, it mustn't cause serious degradation to other applications or hog resources.
How frequently is it advisable to make SQL INSERTs? Is tens or hundreds of INSERTs per second reasonable or is it considered preferable to batch, maybe once per second or every several seconds? Are there any common practices? DBs aren't my main area so the overhead of individual calls (c#) code->DB (MSSQL) isn't something I know anything about.
this is rather question of your needs, that for you need to considerate many options :
single row insert (generate overhead)
bulk insert (depending of your needs may no be aplicable)
If you are executing single row insert there will overhead, but also, you have better control of unique keys as an example
if you are executing bulk insert there is is lower control of data quality.
Common implementation I am doing is :
bulk insert to temporary table
processing data from temporary table to main table in chunks of 1000 records (you may test performance, that depend of server performance/usage but my tests in past give best result between 300 and 3.000 records)
I would recommend you the following approaches expressed in pseudocode:
First approach:
For each remote service polling
start transaction
if the number of rows < bulk copy threshold
send all inserts in single batch
otherwise
bulk copy rows
end if
commit transaction
end for
bulk copy threshold is the number of rows from which is more efficient to use bulk copy than a batch of inserts.
to bulk copy rows there is a specialized class SqlBulkCopy that is the most efficient inserting many rows.
If the first approach cannot handle the many rows that are coming, you can use the following second approach.
loop
while (batch.rows < batch size) and (time elapsed since last bulk copy < batch interval)
add rows to batch from web service
end while
start transaction
bulk copy batch
commit transaction
clear batch
end loop
batch size and batch interval are numbers that you need to adjust.
First approach gives you the lowest latency. As soon as the data is fetched from the remote web service it is inserted on SQL Server. But might be the case that it cannot handle the many rows that are comming.
Second approach has greater latency, however it can handle much more rows.
Second approach could be improved. Instead of waiting to fill the batch or interval expires to send the data to SQL Server you can send the rows to SqlBulkCopy continously. It can be achieved by implementing an IDataReader from remote sevice polling. The DataReader would have an internal buffer (a queue) filled from the last call to remote web service. When interval elapses or batch size is reached and there in no more rows in the queue DataReader.Read() returns false. If there is no more rows in the queue but interval is not elapsed or batch size is not reached, the DataReader calls the remote web service again to the fill the internal queue.
This refinement has some advantages:
Consume much less memory. You don't need to keep the entire batch in memory, only the internal queue, just a few rows.
It is sending rows to SQL Server as they come. You don't need to wait to fill the batch to send it to SQL Server.
It can handle more rows per second.
In general, running an insert ten times is more expensive that running one insert for 10 rows. There are basically two reasons:
There is overhead for each transaction. 10 transactions have 10x overhead.
There is overhead for running a query. 10x queries have 10x overhead.
(The second is mitigated if the query plan is cached, but even so, looking up the plan in the cache is overhead.)
That said, SQL Server on reasonable hardware should be able to handle tens and probably hundreds of inserts per second. If that meets your application requirements, then no both in changing them.
If it doesn't, you can start working on optimizations, such as caching inserts on the application side so you can do larger volume inserts. No doubt, there are tools/libraries on the application side that can help with this.

EF.NET Core: Multiple insert streams within one transaction

I have a lot of rows (300k+) to upsert into SQL server database in a shortest possible period of time, so the idea was to use parallelization and partition the data and use async to pump the data into SQL, X threads at the time, 100 rows per context, with context being recycled to minimize tracking overhead. However, that means more than one connection is to be used in parallel and thus CommittableTransaction/TransactionScope would use distributed transaction which would cause parallelized transaction enlistment operation to return the infamous "This platform does not support distributed transactions." exception.
I do need the ability to commit/rollback the entire set of upserts. Its part of the batch upload process and any error should rollback the changes to previously working/stable condition, application wise.
What are my options? Short of using one connection and no parallelization?
Note: Problem is not so simple as a batch of insert commands, if that was the case, I would just generate inserts and run them on server as query or indeed use SqlBulkCopy. About half of them are updates, half are inserts where new keys are generated by SQL Server which need to be obtained and re-keyed on child objects which would be inserted next, rows are spread over about a dozen tables in a 3-level hierarchy.
Nope. Totally wrong approach. Do NOT use EF for that - bulk insert ETL is not what Object Relational Mappers are made for and a lot of design decisions are not productive for that. You would also not use a small car instead of a truck to transport 20 tons of goods.
300k rows are trivial if you use SqlBulkCopy API in some sort.

update sql server rows, while reading the same table

I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation canĀ“t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.

Database table insert locks from a multi threaded application

I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)

Parallel processing of database queue

There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.

Categories

Resources