I have two servers that runs the same query checking for specific values in a single shared DB. If the query finds the values it will alter those values. At the same time the other server might run the same query and there will be some kind of conflict while trying to alter the information.
Question: How could I best configure that the servers won't run their query at the same time and guarantee that they won't get conflicts?
Databases take care of this for you automatically. They use locks to make sure only one query access specific data at a time. These locks don't have to apply whole tables; depending on the query and transaction type, per-row locks are possible also. When you have two queries that should be grouped together, such as your select and update, transactions make sure the locks from the first query are not released until both queries have finished.
Generally, databases are meant to serve queries (and release their locks) quickly, so that two queries that arrive at about the same time will be processed in sequence with little to no observable delay to the end user. It is possible for locks to cause problems for queries that need to lock a lot of data, that need to run for a long time, or when two transactions begin to lock unrelated data but later both need data locked by the other. That is called a deadlock.
Problems with locks can be controlled by adjusting transaction isolation levels. However, it's usually a mistake to go messing with isolation levels. Most of the time the defaults will do what you need, and messing with isolation levels without fully-understanding what you're doing can make the situation worse, as well as allow queries to return stale or wrong data.
Transactions and Isolation levels are your friends here. You need to set the isolation level so that they won't interfere.
Refer to https://msdn.microsoft.com/en-gb/library/ms173763.aspx for guidance on the level you need to set.
You need to have an extra column in database ex. server_id. and write a query as:
select * from database where server_id = 1 --for the first server
select * from database where server_id = 2 --for the second server
Related
I have a lot of rows (300k+) to upsert into SQL server database in a shortest possible period of time, so the idea was to use parallelization and partition the data and use async to pump the data into SQL, X threads at the time, 100 rows per context, with context being recycled to minimize tracking overhead. However, that means more than one connection is to be used in parallel and thus CommittableTransaction/TransactionScope would use distributed transaction which would cause parallelized transaction enlistment operation to return the infamous "This platform does not support distributed transactions." exception.
I do need the ability to commit/rollback the entire set of upserts. Its part of the batch upload process and any error should rollback the changes to previously working/stable condition, application wise.
What are my options? Short of using one connection and no parallelization?
Note: Problem is not so simple as a batch of insert commands, if that was the case, I would just generate inserts and run them on server as query or indeed use SqlBulkCopy. About half of them are updates, half are inserts where new keys are generated by SQL Server which need to be obtained and re-keyed on child objects which would be inserted next, rows are spread over about a dozen tables in a 3-level hierarchy.
Nope. Totally wrong approach. Do NOT use EF for that - bulk insert ETL is not what Object Relational Mappers are made for and a lot of design decisions are not productive for that. You would also not use a small car instead of a truck to transport 20 tons of goods.
300k rows are trivial if you use SqlBulkCopy API in some sort.
I have Web API service that matches users according to some conditions.
A user must not get more than one match in a specified interval.
To create a match I'm using an SQL query (the database is SQL Server):
SELECT TOP 1
FROM table_name
WHERE last_match >= DATEADD(hour, -3, GETDATE()) and **some_other_conditions**
ORDER BY last_match desc;
As there are multiple calls to this service some_other_conditions might be identical for different users and the only thing that will differentiate the result is the last_match column.
The problem is that while user A is making this call last_match column is not updated yet and so other users might get the same match.
How can I avoid or solve this issue?
You probably can do that in SQL Server, although database is definitely not the best place to implement such things. The query needs to be tweaked a bit, by adding several hints, such as:
SELECT TOP 1 *
FROM table_name with (xlock, holdlock, rowlock, readpast)
WHERE last_match >= DATEADD(hour, -3, GETDATE()) and **some_other_conditions**
ORDER BY last_match desc;
The first 3 hints ensure that the row selected by one connection will remain inaccessible to others until its transaction is committed. The readpast hint allows the query to skip rows locked by other transactions, and keep going further, looking for a suitable match, without waiting for the locks to be released.
A couple of caveats that you need to keep in mind:
The entire unit of work has to be performed within a single client-side transaction, using the same database connection. Make sure that you do not have more than 1 connection per application thread, and that connections are not shared between threads;
The readpast hint allows to skip row-level locks, but it doesn't skip page locks. You will have to somehow make sure that no page locks, let alone partition or table locks, would be placed on this table.
If you have any amount of simultaneous connections worth mentioning, you can easily exhaust the lock buffer, which is not configurable in SQL Server. Thankfully, Database Engine can grow it dynamically when necessary, but you still can run into unexpected situations when excessive lock contention will result in a noticeable performance drop.
If your database has RCSI enabled, you will need to add the readcommittedlock hint to your query. Without it, locks wouldn't mean anything, as other connections will be able to read "previous versions" of selected matches (which, in this case, would have the same data as the row selected by another thread).
If you are going to pursue this approach, make sure to carry out a full-scale stress test before rolling it into production.
There is something that worries me about my application. I have a SQL query that does a bunch of inserts into the database across various tables. I timed how long it takes to complete the process, it takes about 1.5 seconds. At this point I'm not even done developing the query, I still have more inserts to program into this. So I fully expect this to process to take even longer, perhaps up to 3 seconds.
Now, it is important that all of this data be consistent and finish either completely, or not at all. So What I'm wondering about is, is it OK for a transaction to take that long. Doesn't it lock up the table, so selects, inserts, updates, etc... cannot be run until the transaction is finished? My concern is if this query is being run frequently it could lock up the entire application so that certain parts of it become either incredibly slow, or unusable. With a low user base, I doubt this would be an issue, but if my application should gain some traction, this query could potentially be a lot.
Should I be concerned about this or am I missing something where the database won't act how I am thinking. I'm using a SQL Server 2014 database.
To note, I timed this by using the StopWatch C# object immediately before the transaction starts, and stop it right after the changes are committed. So it's about as accurate as can be.
You're right to be concerned about this, as a transaction will lock the rows it's written until the transaction commits, which can certainly cause problems such as deadlocks, and temporary blocking which will slow the system response. But there are various factors that determine the potential impact.
For example, you probably largely don't need to worry if your users are only updating and querying their own data, and your tables have indexing to support both read and write query criteria. That way each user's row locking will largely not affect the other users--depending on how you write your code of course.
If your users share data, and you want to be able to support efficient searching across multiple user's data even with multiple concurrent updates for example, then you may need to do more.
Some general concepts:
-- Ensure your transactions write to tables in the same order
-- Keep your transactions as short as possible by preparing the data to be written as much as possible before starting the transaction.
-- If this is a new system (and even if not new), definitely consider enabling Snapshot Isolation and/or Read Committed Snapshot Isolation on the database. SI will (when explicitly set on the session) allow your read queries not to be blocked by concurrent writes. RCSI will allow all your read queries by default not to be blocked by concurrent writes. But read this to understand both the benefits and gotchas of both isolation levels: https://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/
I think its depend on your code, how you used loop effectively, select query and the other statement.
The best way to tackle reading from a single table in SQL Server using multiple threads and make sure not reading the same record twice in different thread using c#
Thank you for your help in advance
Are you trying to read records from the table in parallel to speed up retreiving the data or are you just worried about data corruption with threads accessing the same data?
Database Management Systems like MsSQL handle concurrency extremely well so thread safety in that respect is not something you would have to be concerned with in your code if you have mutiple threads reading the same table.
If you want to read data in parallel without any overlapping you could run a SQL command with paging, and just have each thread fetch a different page. You could have say 20 threads all read 20 different pages at once and it would be guaranteed that they are not reading the same rows. Then you can concatenate the data. The greater the page size the more performance boost you would get from creating the thread.
efficient way to implement paging
Assuming a dependency on SQL Server, you could possibly looking at the SQL Server Service Broker features to provide queuing for you. One thing to keep in mind with that is that currently SQL Server Service Broker isn't available on SQL Azure, so if you had plans on moving onto the Azure cloud that could be a problem.
Anyway - with SQL Server Service Broker the concurrent access is managed at the database engine layer. Another way of doing it is having one thread that reads the database and then dispatches threads with the message as the input. That is slightly easier than trying to use transactions in the database to ensure that messages aren't read twice.
Like I said though, SQL Server Service Broker is probably the way to go. Or a proper external queuing mechanism.
Solution 1:
I am assuming that you are attempting to process or extract data from a large table. If I were assigned this task I would first look at paging . If you are trying to split work among threads that is. So Thread 1 handles pages 0 to 10, Thread 2 handles pages 11 to 20, etc... or you could batch rows using the actual rownumber. So in your stored proc you would do this;
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY <ordering>) AS [row_number],
x, y, z
FROM
table
WHERE
<search-clauses>
) SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN #IN_Thread_Row_Start AND #IN_Thread_Row_End;
Another choice which would be more efficient is if you have a natural key, or a darn good surrogate, is to page using that and have the thread pass in the key parameters rather than the records it is interested in ( or page numbers ).
Immediate concerns with this solution would be:
ROW_NUMBER performance
CTE Performance (I believe they are stored in memory)
So if this was my problem to resolve I would look at paging via a key.
Solution 2:
The second solution would be to mark the rows as they are processing, virtually locking them, that is if you have data writer permission. So your table would have a field called Processed or Locked, as the rows are selected by your thread, they are updated as Locked = 1;
Then your select from other threads selects only rows that aren't locked. When your process is done and all rows are processed you could reset the lock.
Hard to say what will perform best w.o some trials... GL
This question is super old but still very relevant and I spent a lot of time finding this solution so i thought id post it for anyone else who happens along this. This is very common when using a sql table as a queue rather than msmq.
The solution (after a lot of investigation) is simple and can be tested by opening 2 tabs in ssms with each tab running its own transaction to simulate multiple processes/threads hitting the same table.
The quick answer is this: the key to this is using updlock and readpast hints on your selects.
To illustrate the reads working without duplication check out this simple example.
--on tab 1 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
--on tab 2 in ssms
begin tran
SELECT TOP 1 ordno FROM table_queue WITH (updlock, readpast)
You will notice that the first selected record is locked and does not get duplicated by the select statement firing on the second tab/process.
Now in the real world you wouldnt just execute a select on your table like the simple example above. You would update your records as "isprocessing=1" or something similar if you are using your table as a queue. The above code just demonstrates that this allows concurrent reads without duplication.
So in the real world (if you are using your table as a queue and processing this queue with multiple services for instance) you would execute your select in a subquery to an update statement most likely.
Something like this.
begin tran
update table_queue set processing= 1 where myId in
(
SELECT TOP 50 myId FROM table_queue WITH (updlock, readpast)
)
commit tran
You may also combine yoru update statement with an output keyword so you have a list of all ids that are now locked (processing=1) so you can work with them.
if you are processing data using a table as queue this will ensure you will not duplicate records in your select statements without any need for paging or anything else.
This solution is being tested in an enterprise level application where we experienced a lot of duplication in our select statements when being monitored by many services running on many different boxes.
I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)