There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.
Related
I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation canĀ“t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.
I have process A that runs every 5 minutes and needs to write something in table "EventLog". This works all day long but at night there is another process B starting that needs to delete a lot of old data from this table. The table has millions of rows (blobs included) and many related tables (deletion by cascade) so process B runs up to ~45 minutes. While process B is running I get a lot of deadlock warnings for process A and I want to get rid of these.
The easy option would be "Don't run process A when process B is running" but there must be a better approach. I am using EntityFramework 6 and TransactionScope in both processes. I didn't find out how to set priority or something like that on my processes. Is this possible?
EDIT:
I forgot to say that I am already using one delete transaction per record, not one transaction for all records. Inside loop I create new DBContext and TransactionScope, so each record has its own transaction. My problem is that deleting a record still takes some time because of the related BLOBs and data in other related tables (lets say about 5 sec. per row). I still get deadlock situations when deleting process (B) crosses with inserting process (A).
Transactions don't have priority. Deadlock victims are chosen by the database, most commonly using things like "work required to roll it back". One way of avoiding a deadlock is to ensure that you block rather than deadlock, by accessing tables in the same order, and by taking locks at the eventual level (for example, taking an UPDLOCK when reading the data, to avoid two queries getting read locks, then one trying to escalate to a write lock). Ultimately, though, this is a tricky area - and something that takes 45M to complete (please tell me that isn't a single transaction!) is always going to cause problems.
Rework process B to not delete it all at once but in smaller batches that never take more than 1 minute. Run those in a loop until all to be deleted is done.
The best way to tackle reading from a single table in SQL Server using multiple threads and make sure not reading the same record twice in different thread using c#
Thank you for your help in advance
Are you trying to read records from the table in parallel to speed up retreiving the data or are you just worried about data corruption with threads accessing the same data?
Database Management Systems like MsSQL handle concurrency extremely well so thread safety in that respect is not something you would have to be concerned with in your code if you have mutiple threads reading the same table.
If you want to read data in parallel without any overlapping you could run a SQL command with paging, and just have each thread fetch a different page. You could have say 20 threads all read 20 different pages at once and it would be guaranteed that they are not reading the same rows. Then you can concatenate the data. The greater the page size the more performance boost you would get from creating the thread.
efficient way to implement paging
Assuming a dependency on SQL Server, you could possibly looking at the SQL Server Service Broker features to provide queuing for you. One thing to keep in mind with that is that currently SQL Server Service Broker isn't available on SQL Azure, so if you had plans on moving onto the Azure cloud that could be a problem.
Anyway - with SQL Server Service Broker the concurrent access is managed at the database engine layer. Another way of doing it is having one thread that reads the database and then dispatches threads with the message as the input. That is slightly easier than trying to use transactions in the database to ensure that messages aren't read twice.
Like I said though, SQL Server Service Broker is probably the way to go. Or a proper external queuing mechanism.
Solution 1:
I am assuming that you are attempting to process or extract data from a large table. If I were assigned this task I would first look at paging . If you are trying to split work among threads that is. So Thread 1 handles pages 0 to 10, Thread 2 handles pages 11 to 20, etc... or you could batch rows using the actual rownumber. So in your stored proc you would do this;
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY <ordering>) AS [row_number],
x, y, z
FROM
table
WHERE
<search-clauses>
) SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN #IN_Thread_Row_Start AND #IN_Thread_Row_End;
Another choice which would be more efficient is if you have a natural key, or a darn good surrogate, is to page using that and have the thread pass in the key parameters rather than the records it is interested in ( or page numbers ).
Immediate concerns with this solution would be:
ROW_NUMBER performance
CTE Performance (I believe they are stored in memory)
So if this was my problem to resolve I would look at paging via a key.
Solution 2:
The second solution would be to mark the rows as they are processing, virtually locking them, that is if you have data writer permission. So your table would have a field called Processed or Locked, as the rows are selected by your thread, they are updated as Locked = 1;
Then your select from other threads selects only rows that aren't locked. When your process is done and all rows are processed you could reset the lock.
Hard to say what will perform best w.o some trials... GL
This question is super old but still very relevant and I spent a lot of time finding this solution so i thought id post it for anyone else who happens along this. This is very common when using a sql table as a queue rather than msmq.
The solution (after a lot of investigation) is simple and can be tested by opening 2 tabs in ssms with each tab running its own transaction to simulate multiple processes/threads hitting the same table.
The quick answer is this: the key to this is using updlock and readpast hints on your selects.
To illustrate the reads working without duplication check out this simple example.
--on tab 1 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
--on tab 2 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
You will notice that the first selected record is locked and does not get duplicated by the select statement firing on the second tab/process.
Now in the real world you wouldnt just execute a select on your table like the simple example above. You would update your records as "isprocessing=1" or something similar if you are using your table as a queue. The above code just demonstrates that this allows concurrent reads without duplication.
So in the real world (if you are using your table as a queue and processing this queue with multiple services for instance) you would execute your select in a subquery to an update statement most likely.
Something like this.
begin tran
update table_queue set processing= 1 where myId in
(
SELECT TOP 50 myId FROM table_queue WITH (updlock, readpast)
)
commit tran
You may also combine yoru update statement with an output keyword so you have a list of all ids that are now locked (processing=1) so you can work with them.
if you are processing data using a table as queue this will ensure you will not duplicate records in your select statements without any need for paging or anything else.
This solution is being tested in an enterprise level application where we experienced a lot of duplication in our select statements when being monitored by many services running on many different boxes.
I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)
I am writing an application that logs status updates (GPS locations) from devices to a database. The updates occur at a set interval for each device, which is currently every 3 seconds. I'm using a simple table in SQL Server 08 for storing each update.
I've noticed that running the inserts is an area of slow down in my application. Its not a severe slow down, but noticable. Naturally, I'd like to write to the database in as an efficient way as possible. I have an idea to improve the performance and am looking for input and advice to see if it will help:
The status updates come in from an asynchronous Socket thread. In my current implementation, the database insert call is executed from this thread. I'm thinking I can create a queue for holding update data that the Socket thread can quickly add its update to and then go on its merry way. There would then be a separate thread whose sole responsibility would be checking the update queue and inserting the updates into the database.
Basically this whole process rests on the assumption that writing to the database from one location with a bunch of data all at once is more efficient than writing one row of data at a random time. Is my assumption correct, or way off base? Also, on the SQL side, is there a command to tell it to write a bunch of rows at once that would improve write performance?
This is how the database is being written to:
I'm using LinqToSQL in C#, so for each insert, I first create a DataContext instance. From the DataContext object I then call a stored procedure which inserts the location update.
The table is indexed by datetime, for the time of the update.
Have a look at the SqlBulkCopy class - this allows you to use BCP to insert chunks of data very quickly.
Also, make sure your indexes are efficient. If you have a clustered index on anything that does not increase sequentially (integer, date) then you will suffer performance slowdowns as the pages are filled up.
Have you looked MSMQ ( Microsoft Message Queuing (MSMQ)) ? That seems to me an option to take a look.
Yes, inserting in batches will typically be faster than separate inserts given your description. Each insert will require a connection to be set up and packets to be transferred. If you have a single small insert that takes one packet and you issue three of those, but you alternatively have three inserts that are small enough that they can all fit in one packet then it will help.
Quantifying it is difficult just based on your description - you'll need to do testing for that. For example, if you are keeping a dedicated connection open at all times anyway, as hova suggests, then you might see less of an impact.
Another area you might want to take a look at is whether you are setting up and tearing down a connection for each insert. That alone might make a performance improvement, negating the need for batching.
You'll also want to have as few indexes on the table as possible.
It sounds like a good idea. Why not give it a shot and see how it performs?
On the SQL side you'd want to have a look at making sure you are using parameterized queries.
Also batching your INSERT statements will certainly increase the performance.
Connection management is also key, of course that depends on how the application is built and whether it depends on a connection being there.
Are you not afraid to loose data while are you collecting data to batch copy?
I'm writing application doing the same. At start I will have to write data from 3,5k GPS devices. One device should send data each minute but it can send faster. Destination number of devices is 10,5k.
I'm wondering about inserting performance too. For now I'm saving received data to db on every packet using pure ADO.NET ICommand and stored procedure. On my test serwer (Xeon 3,4GHz and one 1TB hard disk - normal desktop ;) it takes for now 1ms or less.
#GRIMUS - should I wondering if there will be more devices?