I have a WebApi Async controller method that calls another async method that first does a null check to see if a record exists, and if it doesn't add it to database. Problem is if I have say 3 requests come in at the same time all the null checks happen at once in various threads (i'm assuming) and I will get 2 duplicate entries. For example:
public async void DoSomething()
{
var record = {query that returns record or null}
if (record == null)
{
AddNewRecordToDatabase();
}
}
... This seems like a very common thing and maybe I'm missing something, but how do I prevent this from happening? I have to purposely try to get it to create duplicates of course, but it is a requirement to not allow it to do so.
Thanks in advance,
Lee
I would solve this by putting unique constraints in the data layer. Assuming your data source is sql, you can put a unique constraint across the columns you are querying by with "query that returns record or null" and it will prevent these duplicates. The problem with using a lock or a mutex, is that it doesn't scale across multiple instances of the service. You should be able to deploy many instances of your service (to different machines), have any of those instances handle requests, and still have consistent behavior. A mutex or lock isn't going to protect you from this concurrency issue in this situation.
I prevent this from happening with async calls by calling a stored procedure instead.
The stored procedure then makes the check, via a "On duplicate key detection" or a similar query for MSSQL db.
That way, it's merely the order of the async calls that gets to determine which is a create, and which is not.
There are several answers to this, depending on the details and what your team is comfortable with.
The best and most performant answer it to modify your c# code so that instead of calling a CRUD database operation it calls a stored procedure that you write. The stored procedure would check for existence and insert or update only as needed. The specifics are completely under your control, since you write the code.
If you want to stick with ordinary CRUD operations, you can force the database to serialize the requests one after the other by wrapping them in a transaction and using a strict transaction isolation level. On SQL Server you'd want to use serializable. This will prevent any transaction from altering the state of the table in the short time between the part where you check for existence and when you insert the record. See this article for a list of transaction isolation levels and how to apply them in c# code. If you do this there is a risk of deadlock, so you'll need to catch and swallow those specific errors.
If your only need it to ensure uniqueness, and the new record has a natural (not surrogate) key, you can add a uniqueness constraint on the key, which will prevent the second insert from succeeding. This solution doesn't work so well with surrogate keys; it doesn't really solve the problem, it just relocates it to the surrogate key generation process. But if you have a decent natural key, this is very easy to implement.
Related
I have a system that I am trying to build that matches users up in real-time. Based on specific criteria, the user's are matched up 1 to 1. I have the following database tables (kind of like a chat-roulette type system):
Pool
+UserId
+Gender
+City
+Solo (bool)
Matched
+UserId
+PartnerId
When a user enters a certain page, they are added to the Pool table with Solo set to true. The system then searches for another user by querying to the Pool table and returns the results where Solo is true (meaning they do not have a partner) and Gender and City are also equal to whatever they query. If a match is returned, put both users in the Matched database and turn both of their Solo columns in the Pool table to false. If they break the connection, they are deleted from the Matched table and their Solo column will change to true. I am having trouble trying to architect the way this will work with being thread-safe and concurrency. Here are some questions that I got stuck on:
-What if 2 users query the Pool database at the same time and both return the same "solo" user? How do I prevent this?
-What if 1 user queries the Pool before a user's solo column gets changed, so now that user is returned in the result set, but he is technically not solo
-What other concurrency/thread-safe issues do i face? Is there a better way than this?
A very easy way to solve this is to wrap your algorithm in a transaction, set the isolation level to serializable and retry the whole business operation in case of a deadlock. This should solve all of your concerns in the question.
Making your application deadlock resistant is not easy in such a complex case. I understand you are just getting started with locking and concurrency in the database. This solution, although not perfect, is likely to be enough.
If you require more sophistication you probably need to do some research around pessimistic locking.
I think part of the issue is that the "Solo" field is redundant; it simply indicates whether there is a valid entry in the "Matched" table or not. I'd recommend removing it and instead just joining the "Pool" table to the "Matched" table, so you don't have to worry about issues with keeping the two in sync.
Other concurrency issues can be solved by using a concurrency tracking field. See Handling Concurrency with the Entity Framework in an ASP.NET MVC Application.
Also, just my opinion, but you may want to consider having an "audit" type of table for saving the deleted entries, or switch to using a temporal table for "Matches", rather than simply deleting the entries. That information could be very useful in the future, for tweaking your matching algorithm and such.
Queueing is a good idea. If each request for a match was queued there would be no contention or dead-lock to query the pool.
I am getting a strange problem. I have a table that has a primary key that is generated in an INSTEAD OF trigger. It's probably the worst way to implement a primary key, but they basically fetch the maximum, increment it by 1, and use this value as the key. This happens in the instead of trigger.
I have a .Net app that starts a RepeatableRead transaction, and I insert a record into this table. This works fine as long as I don't try and do more than one insert simultaneously. If I do, I receive a PK violation error. It seems to me that the inserts both get fired, the trigger fetches the same last number for each transaction, increments it by 1, and then tries to insert both records with the same new 'identifier'.
I've spoken to the DBA that set this trigger up, and he is of the opinion that RepeatableRead should stop this from happening as it apparently locks the table for reads and other transactions will wait for the locks to be released. So, in essence, my transactions will occur in serial.
So, the question is, why would I be getting a PK violation if RepeatableRead works the way the DBA described it?
RepeatableRead won't fix that, because it allows for phantom inserts, which is exactly what you have here. The second insert is wrong because it doesn't "see" the previous insert, and you have the behaviour described.
You can fix it with serializable isolation, or by doing separate transactions (the former will increase contention, the latter reduce it, but the latter may not be possible for you).
Really, the fix is to replace the instead of trigger with a property identity constraint (this can be done on an existing table, though there are difficulties, esp. if replication is being used), or failing that using a better algorithm to set the new value.
first let me express my disgust of the approach taken to generate the primary key, now as far as the REPEATABLE READ isolation, it only protects against updates, you will still be able to read and that is the problem with your implementation, no protection against inserts either
ideally I urge you to restructure the primary key generation but if that is not possible the only thing left is to use SERIALIZABLE isolation, which protects for inserts as well, however depending on how and when you determine the next key value, you might not be able to solve it either way
I'm working on a online sales web site. I'm using C# 4,0 and SQL server 2008 and I want to control and prevent users simultaneously insert into the table like dbo.orders... How can I do that?
Inserts will not be a problem, but updates can be. The term that you need to research is database concurrency. There are four basic models you can implement each with its own pros and cons. Some are better suited for certain situations and there are hundreds of articles on the web for this subject.
Are you trying to solve this in C# code on in SQL? Because in SQL it's simple. If you add BEGIN TRAN in the beginning of the stored procedure and COMMIT in the end, this will act as a lock in C# preventing concurrent code executions effectively serializing the requests. So if there are two inserts, they will be executed one after another. One thing to remember is that it will be blocking operation, i.e. the second insert won't start until the first one is finished (regardless successfully or not).
In your Add method you can use Locking with lock keyword, this will allow one thread at a time.
I am using Cache in a web service method like this:
var pblDataList = (List<blabla>)HttpContext.Current.Cache.Get("pblDataList");
if (pblDataList == null)
{
var PBLData = dc.ExecuteQuery<blabla>(#"SELECT blabla");
pblDataList = PBLData.ToList();
HttpContext.Current.Cache.Add("pblDataList", pblDataList, null,
DateTime.Now.Add(new TimeSpan(0, 0, 15)),
Cache.NoSlidingExpiration, CacheItemPriority.Normal, null);
}
But I wonder, is this code thread-safe? The web service method is called by multiple requesters. And more then one requester may attempt to retrieve data and add to the Cache at the same time while the cache is empty.
The query takes 5 to 8 seconds. Would introducing a lock statement around this code prevent any possible conflicts? (I know that multiple queries can run simultaneously, but I want to be sure that only one query is running at a time.)
The cache object is thread-safe but HttpContext.Current will not be available from background threads. This may or may not apply to you here, it's not obvious from your code snippet whether or not you are actually using background threads, but in case you are now or decide to at some point in the future, you should keep this in mind.
If there's any chance that you'll need to access the cache from a background thread, then use HttpRuntime.Cache instead.
In addition, although individual operations on the cache are thread-safe, sequential lookup/store operations are obviously not atomic. Whether or not you need them to be atomic depends on your particular application. If it could be a serious problem for the same query to run multiple times, i.e. if it would produce more load than your database is able to handle, or if it would be a problem for a request to return data that is immediately overwritten in the cache, then you would likely want to place a lock around the entire block of code.
However, in most cases you would really want to profile first and see whether or not this is actually a problem. Most web applications/services don't concern themselves with this aspect of caching because they are stateless and it doesn't matter if the cache gets overwritten.
You are correct. The retrieving and adding operations are not being treated as an atomic transaction. If you need to prevent the query from running multiple times, you'll need to use a lock.
(Normally this wouldn't be much of a problem, but in the case of a long running query it can be useful to relieve strain on the database.)
I believe the Add should be thread-safe - i.e. it won't error if Add gets called twice with the same key, but obviously the query might execute twice.
Another question, however, is is the data thread-safe. There is no guarantee that each List<blabla> is isolated - it depends on the cache-provider. The in-memory cache provider stores the objects directly, so there is a risk of collisions if any of the threads edit the data (add/remove/swap items in the list, or change properties of one of the items). However, with a serializing provider you should be fine. Of course, this then demands that blabla is serializable...
This seems like perhaps a naive question, but I got into a discussion with a co-worker where I argued that there is no real need for a cache to be thread-safe/synchronized as I would assume that it does not matter who is putting in a value, as the value for a given key should be "constant" (in that it is coming from the same source ultimately). If the values can change readily, then the cache itself does not seem to be all the useful (in that if you care that the value is "currently correct" you should go to the original source).
The main reason I see to make at least the GET synchronized is that if it is very expensive to miss in the cache and you don't want multiple threads each going out to get a value to put back in the cache. Even then, you'd need something that actually blocks all consumers during a read-fetch-put cycle.
Anyhow, my working assumption is that a hash is by its very nature thread-safe because for any {key,value} combination, the value is either null or something that it doesn't matter who go there "first" to write.
Question is: Is this a reasonable assumption?
Update: The real scope of my question is around very simple id->value style caches (or {parameters}->{calculated value} where no matter who writes to the cache, the value will be the same and we are just trying to save from "re-calculating"/going back to the database. The actual graph of the object isn't relevant and the cache is generally long-lived.
For most implementations of a hash, you'd need to synchronize. What if the hash table needs to be expanded/rehashed? What if two threads are trying to add something to the hash table where the keys are different, but the hashes collide? They could both be modifying the same slot in the hash table in different ways at the same time. Assuming you're using a hash table to implement your cache (which you imply in your question) I suggest reading a little about the details of how hash tables are implemented if you're not already familiar with this.
Writes aren't always atomic. You must either use atomic data types or provide some synchronization (RCU, locks etc.). No shared data is thread-safe per se. Or make this go away by sticking to lock-free algorithms (that is, where possible and feasible).
As long as the cost for acquiring and releasing a lock is less than the cost for recreating the object (from a file or database or whatever) all accesses to a cache should indeed be synchronized. If it’s not you don’t really need a cache at all. :)
If you want to avoid data corruption, you must synchronize. This is especially true when the cache contains multiple tables that must be updated atomically. Imagine you have a database for a DMV (department of motor vehicles). You add a new person to the database, that person will have records for auto registrations plus records for tickets received for records for home address and perhaps other contact information. If you don't update these tables atomically -- in the database and in the cache -- then any client pulling data out of the cache may get inconsistent data.
Yes, any one piece of data may be constant, but databases very commonly hold data that -- if not updated together and atomically -- can cause database clients to get incorrect or incomplete or inconsistent results.
If you are using Java 5 or above you can use a ConcurrentHashMap. This supports multiple readers and writers in a threadsafe manner.