I am getting a strange problem. I have a table that has a primary key that is generated in an INSTEAD OF trigger. It's probably the worst way to implement a primary key, but they basically fetch the maximum, increment it by 1, and use this value as the key. This happens in the instead of trigger.
I have a .Net app that starts a RepeatableRead transaction, and I insert a record into this table. This works fine as long as I don't try and do more than one insert simultaneously. If I do, I receive a PK violation error. It seems to me that the inserts both get fired, the trigger fetches the same last number for each transaction, increments it by 1, and then tries to insert both records with the same new 'identifier'.
I've spoken to the DBA that set this trigger up, and he is of the opinion that RepeatableRead should stop this from happening as it apparently locks the table for reads and other transactions will wait for the locks to be released. So, in essence, my transactions will occur in serial.
So, the question is, why would I be getting a PK violation if RepeatableRead works the way the DBA described it?
RepeatableRead won't fix that, because it allows for phantom inserts, which is exactly what you have here. The second insert is wrong because it doesn't "see" the previous insert, and you have the behaviour described.
You can fix it with serializable isolation, or by doing separate transactions (the former will increase contention, the latter reduce it, but the latter may not be possible for you).
Really, the fix is to replace the instead of trigger with a property identity constraint (this can be done on an existing table, though there are difficulties, esp. if replication is being used), or failing that using a better algorithm to set the new value.
first let me express my disgust of the approach taken to generate the primary key, now as far as the REPEATABLE READ isolation, it only protects against updates, you will still be able to read and that is the problem with your implementation, no protection against inserts either
ideally I urge you to restructure the primary key generation but if that is not possible the only thing left is to use SERIALIZABLE isolation, which protects for inserts as well, however depending on how and when you determine the next key value, you might not be able to solve it either way
Related
I have a WebApi Async controller method that calls another async method that first does a null check to see if a record exists, and if it doesn't add it to database. Problem is if I have say 3 requests come in at the same time all the null checks happen at once in various threads (i'm assuming) and I will get 2 duplicate entries. For example:
public async void DoSomething()
{
var record = {query that returns record or null}
if (record == null)
{
AddNewRecordToDatabase();
}
}
... This seems like a very common thing and maybe I'm missing something, but how do I prevent this from happening? I have to purposely try to get it to create duplicates of course, but it is a requirement to not allow it to do so.
Thanks in advance,
Lee
I would solve this by putting unique constraints in the data layer. Assuming your data source is sql, you can put a unique constraint across the columns you are querying by with "query that returns record or null" and it will prevent these duplicates. The problem with using a lock or a mutex, is that it doesn't scale across multiple instances of the service. You should be able to deploy many instances of your service (to different machines), have any of those instances handle requests, and still have consistent behavior. A mutex or lock isn't going to protect you from this concurrency issue in this situation.
I prevent this from happening with async calls by calling a stored procedure instead.
The stored procedure then makes the check, via a "On duplicate key detection" or a similar query for MSSQL db.
That way, it's merely the order of the async calls that gets to determine which is a create, and which is not.
There are several answers to this, depending on the details and what your team is comfortable with.
The best and most performant answer it to modify your c# code so that instead of calling a CRUD database operation it calls a stored procedure that you write. The stored procedure would check for existence and insert or update only as needed. The specifics are completely under your control, since you write the code.
If you want to stick with ordinary CRUD operations, you can force the database to serialize the requests one after the other by wrapping them in a transaction and using a strict transaction isolation level. On SQL Server you'd want to use serializable. This will prevent any transaction from altering the state of the table in the short time between the part where you check for existence and when you insert the record. See this article for a list of transaction isolation levels and how to apply them in c# code. If you do this there is a risk of deadlock, so you'll need to catch and swallow those specific errors.
If your only need it to ensure uniqueness, and the new record has a natural (not surrogate) key, you can add a uniqueness constraint on the key, which will prevent the second insert from succeeding. This solution doesn't work so well with surrogate keys; it doesn't really solve the problem, it just relocates it to the surrogate key generation process. But if you have a decent natural key, this is very easy to implement.
I have a system that I am trying to build that matches users up in real-time. Based on specific criteria, the user's are matched up 1 to 1. I have the following database tables (kind of like a chat-roulette type system):
Pool
+UserId
+Gender
+City
+Solo (bool)
Matched
+UserId
+PartnerId
When a user enters a certain page, they are added to the Pool table with Solo set to true. The system then searches for another user by querying to the Pool table and returns the results where Solo is true (meaning they do not have a partner) and Gender and City are also equal to whatever they query. If a match is returned, put both users in the Matched database and turn both of their Solo columns in the Pool table to false. If they break the connection, they are deleted from the Matched table and their Solo column will change to true. I am having trouble trying to architect the way this will work with being thread-safe and concurrency. Here are some questions that I got stuck on:
-What if 2 users query the Pool database at the same time and both return the same "solo" user? How do I prevent this?
-What if 1 user queries the Pool before a user's solo column gets changed, so now that user is returned in the result set, but he is technically not solo
-What other concurrency/thread-safe issues do i face? Is there a better way than this?
A very easy way to solve this is to wrap your algorithm in a transaction, set the isolation level to serializable and retry the whole business operation in case of a deadlock. This should solve all of your concerns in the question.
Making your application deadlock resistant is not easy in such a complex case. I understand you are just getting started with locking and concurrency in the database. This solution, although not perfect, is likely to be enough.
If you require more sophistication you probably need to do some research around pessimistic locking.
I think part of the issue is that the "Solo" field is redundant; it simply indicates whether there is a valid entry in the "Matched" table or not. I'd recommend removing it and instead just joining the "Pool" table to the "Matched" table, so you don't have to worry about issues with keeping the two in sync.
Other concurrency issues can be solved by using a concurrency tracking field. See Handling Concurrency with the Entity Framework in an ASP.NET MVC Application.
Also, just my opinion, but you may want to consider having an "audit" type of table for saving the deleted entries, or switch to using a temporal table for "Matches", rather than simply deleting the entries. That information could be very useful in the future, for tweaking your matching algorithm and such.
Queueing is a good idea. If each request for a match was queued there would be no contention or dead-lock to query the pool.
I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)
In a situation where i have to insert a record into a table A, and one of the fields in the table references a record in another table B. How can i make sure that until i commit the insert statement, the record in table B referenced by a the record to be inserted in table A is not tampered with.
I am thinking of including both tables into a transaction and locking all the records involved in the transaction. but that may lead to concurrency deficiency. so need your recommendation.
Thank you,
Note that even with a transaction, you;;ll need to get the isolation level right. The most paranoid (and hence the most accurate) is "serializable", which takes out locks (even range locks) when you read data, so that other spids can't play with it.
If you want to make your changes to the two tables become a single atomic action then they should be performed in a single transaction. its relatively simple in .net, you just need to use the BeginTransaction method on a SqlConnection to create a new transaction and then you make your SqlCommands etc work against the transaction rather than the connection. You can also do it using a TransactionScope but you may have issues with MSDTC.
I wouldn't be to concerned about concurrency issues about using transactions. I would shy away from trying to deal with locking issues yourself, I would start with just making updates atomic and maintaining data integrity.
Yon don't need a transaction for this, just a foreign key relationship. A relationship from table A, field B_FK referencing table B's primary key will prevent the creation of the table A row if the corresponding table B row does not exist.
If by "tampered with" you mean deleted, then John's right, a foreign-key relationship is probably what you want. But, if you mean modified, then a transaction is the only way to go. Yes, it'll mean that you have a potential bottleneck, but there's no way to avoid that if you want your operation to be "atomic". To avoid any noticeable performance degradation, you'll want to keep the lifetime of the transaction to the bare minimum.
Since you're using c# (and presumably ADO.NET), you could use the transaction features built-into the framework. However, it's better to have the transaction handled by the databases server since this means that the transaction can be started, completed & committed in a single request (see above re transaction lifetime).
I have a ASP.NET C# business webapp that is used internally. One issue we are running into as we've grown is that the original design did not account for concurrency checking - so now multiple users are accessing the same data and overwriting other users changes. So my question is - for webapps do people usually use a pessimistic or optimistic concurrency system? What drives the preference to use one over another and what are some of the design considerations to take into account?
I'm currently leaning towards an optimistic concurrency check since it seems more forgiving, but I'm concerned about the potential for multiple changes being made that would be in contradiction to each other.
Thanks!
Optimistic locking.
Pessimistic is harder to implement and will give problems in a web environment. What action will release the lock, closing the browser? Leaving the session to time out? What about if they then do save their changes?
You don't specify which database you are using. MS SQL server has a timestamp datatype. It has nothing to do with time though. It is mearly a number that will get changed each time the row gets updated. You don't have to do anything to make sure it gets changed, you just need to check it. You can achive similar by using a date/time last modified as #KM suggests. But this means you have to remember to change it each time you update the row. If you use datetime you need to use a data type with sufficient precision to ensure that you can't end up with the value not changing when it should. For example, some one saves a row, then someone reads it, then another save happens but leaves the modified date/time unchanged. I would use timestamp unless there was a requirement to track last modified date on records.
To check it you can do as #KM suggests and include it in the update statement where clause. Or you can begin a transaction, check the timestamp, if all is well do the update, then commit the transaction, if not then return a failure code or error.
Holding transactions open (as suggested by #le dorfier) is similar to pessimistic locking, but the amount of data locked may be more than a row. Most RDBM's lock at the page level by default. You will also run into the same issues as with pessimistic locking.
You mention in your question that you are worried about conflicting updates. That is what the locking will prevent surely. Both optimistic or pessimistic will, when properly implemented prevent exactly that.
I agree with the first answer above, we try to use optimistic locking when the chance of collisions is fairly low. This can be easily implemented with a LastModifiedDate column or incrementing a Version column. If you are unsure about frequency of collisions, log occurrences somewhere so you can keep an eye on them. If your records are always in "edit" mode, having separate "view" and "edit" modes could help reduce collisions (assuming you reload data when entering edit mode).
If collisions are still high, pessimistic locking is more difficult to implement in web apps, but definitely possible. We have had good success with "leasing" records (locking with a timeout)... similar to that 2 minute warning you get when you buy tickets on TicketMaster. When a user goes into edit mode, we put a record into the "lock" table with a timeout of N minutes. Other users will see a message if they try to edit a record with an active lock. You could also implement a keep-alive for long forms by renewing the lease on any postback of the page, or even with an ajax timer. There is also no reason why you couldn't back this up with a standard optimistic lock mentioned above.
Many apps will need a combination of both approaches.
here's a simple solution to many people working on the same records.
when you load the data, get the last changed date, we use LastChgDate on our tables
when you save (update) the data add "AND LastChgDate=previouslyLoadedLastChgDate" to the where clause. If the row count=0 on the update, issue error where "someone else has already saved this data" and rollback everything, otherwise the data is saved.
I generally do the above logic on header tables only and not on the details tables, since they are all in one transaction.
I assume you're experiencing the 'lost update' problem.
To counter this as a rule of thumb I use pessimistic locking when the chances of a collision are high (or transactions are short lived) and optimistic locking when the chances of a collision are low (or transactions are long lived, or your business rules encompass multiple transactions).
You really need to see what applies to your situation and make a judgment call.