I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)
Related
There is something that worries me about my application. I have a SQL query that does a bunch of inserts into the database across various tables. I timed how long it takes to complete the process, it takes about 1.5 seconds. At this point I'm not even done developing the query, I still have more inserts to program into this. So I fully expect this to process to take even longer, perhaps up to 3 seconds.
Now, it is important that all of this data be consistent and finish either completely, or not at all. So What I'm wondering about is, is it OK for a transaction to take that long. Doesn't it lock up the table, so selects, inserts, updates, etc... cannot be run until the transaction is finished? My concern is if this query is being run frequently it could lock up the entire application so that certain parts of it become either incredibly slow, or unusable. With a low user base, I doubt this would be an issue, but if my application should gain some traction, this query could potentially be a lot.
Should I be concerned about this or am I missing something where the database won't act how I am thinking. I'm using a SQL Server 2014 database.
To note, I timed this by using the StopWatch C# object immediately before the transaction starts, and stop it right after the changes are committed. So it's about as accurate as can be.
You're right to be concerned about this, as a transaction will lock the rows it's written until the transaction commits, which can certainly cause problems such as deadlocks, and temporary blocking which will slow the system response. But there are various factors that determine the potential impact.
For example, you probably largely don't need to worry if your users are only updating and querying their own data, and your tables have indexing to support both read and write query criteria. That way each user's row locking will largely not affect the other users--depending on how you write your code of course.
If your users share data, and you want to be able to support efficient searching across multiple user's data even with multiple concurrent updates for example, then you may need to do more.
Some general concepts:
-- Ensure your transactions write to tables in the same order
-- Keep your transactions as short as possible by preparing the data to be written as much as possible before starting the transaction.
-- If this is a new system (and even if not new), definitely consider enabling Snapshot Isolation and/or Read Committed Snapshot Isolation on the database. SI will (when explicitly set on the session) allow your read queries not to be blocked by concurrent writes. RCSI will allow all your read queries by default not to be blocked by concurrent writes. But read this to understand both the benefits and gotchas of both isolation levels: https://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/
I think its depend on your code, how you used loop effectively, select query and the other statement.
I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation canĀ“t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.
I have a system that I am trying to build that matches users up in real-time. Based on specific criteria, the user's are matched up 1 to 1. I have the following database tables (kind of like a chat-roulette type system):
Pool
+UserId
+Gender
+City
+Solo (bool)
Matched
+UserId
+PartnerId
When a user enters a certain page, they are added to the Pool table with Solo set to true. The system then searches for another user by querying to the Pool table and returns the results where Solo is true (meaning they do not have a partner) and Gender and City are also equal to whatever they query. If a match is returned, put both users in the Matched database and turn both of their Solo columns in the Pool table to false. If they break the connection, they are deleted from the Matched table and their Solo column will change to true. I am having trouble trying to architect the way this will work with being thread-safe and concurrency. Here are some questions that I got stuck on:
-What if 2 users query the Pool database at the same time and both return the same "solo" user? How do I prevent this?
-What if 1 user queries the Pool before a user's solo column gets changed, so now that user is returned in the result set, but he is technically not solo
-What other concurrency/thread-safe issues do i face? Is there a better way than this?
A very easy way to solve this is to wrap your algorithm in a transaction, set the isolation level to serializable and retry the whole business operation in case of a deadlock. This should solve all of your concerns in the question.
Making your application deadlock resistant is not easy in such a complex case. I understand you are just getting started with locking and concurrency in the database. This solution, although not perfect, is likely to be enough.
If you require more sophistication you probably need to do some research around pessimistic locking.
I think part of the issue is that the "Solo" field is redundant; it simply indicates whether there is a valid entry in the "Matched" table or not. I'd recommend removing it and instead just joining the "Pool" table to the "Matched" table, so you don't have to worry about issues with keeping the two in sync.
Other concurrency issues can be solved by using a concurrency tracking field. See Handling Concurrency with the Entity Framework in an ASP.NET MVC Application.
Also, just my opinion, but you may want to consider having an "audit" type of table for saving the deleted entries, or switch to using a temporal table for "Matches", rather than simply deleting the entries. That information could be very useful in the future, for tweaking your matching algorithm and such.
Queueing is a good idea. If each request for a match was queued there would be no contention or dead-lock to query the pool.
I am inserting data into table using SqlBulkCopy in parallel threads. various links on internet says TableLock is a good option to use SqlBulkCopy.
Data is huge and continuous.
does RowLock give better performance in multithreading? I am confused between the time taken by one thread to complete insertion will cause waiting other threads to wait if table lock is enabled whereas row locking should not make any effect for insertion.
You should use a table lock; SqlBulkCopy has it's own table lock that is concurrent for multiple bulk copy insert operations (the Bulk update (BU) lock). Parallel threads will help up until you saturate network I/O. A rule of thumb (from my personal testing) is to use n consumers, where n is 2x the number of cores for the server. Experiment yourself. You most definitely do not want a row lock as that defeats a lot of the optimizations that the bulk operation provides.
Your best bets for improving performance are:
keep your batch size as large as possible - you may need to increase
the timeout of the operation to prevent failures on long runs (set
timeout to zero for no limit). To do the entire set in one go set the batch size to zero. If you have a very large data set look at streaming directly from the source to bulk copy.
drop any indices on the target table, then defragment and create the
index when you're done the load. Combined with a table lock this
should minimize the log overhead and improve performance. You also can't use TABLOCK and concurrent operations on a table with a clustered index.
There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.