I have a system that I am trying to build that matches users up in real-time. Based on specific criteria, the user's are matched up 1 to 1. I have the following database tables (kind of like a chat-roulette type system):
Pool
+UserId
+Gender
+City
+Solo (bool)
Matched
+UserId
+PartnerId
When a user enters a certain page, they are added to the Pool table with Solo set to true. The system then searches for another user by querying to the Pool table and returns the results where Solo is true (meaning they do not have a partner) and Gender and City are also equal to whatever they query. If a match is returned, put both users in the Matched database and turn both of their Solo columns in the Pool table to false. If they break the connection, they are deleted from the Matched table and their Solo column will change to true. I am having trouble trying to architect the way this will work with being thread-safe and concurrency. Here are some questions that I got stuck on:
-What if 2 users query the Pool database at the same time and both return the same "solo" user? How do I prevent this?
-What if 1 user queries the Pool before a user's solo column gets changed, so now that user is returned in the result set, but he is technically not solo
-What other concurrency/thread-safe issues do i face? Is there a better way than this?
A very easy way to solve this is to wrap your algorithm in a transaction, set the isolation level to serializable and retry the whole business operation in case of a deadlock. This should solve all of your concerns in the question.
Making your application deadlock resistant is not easy in such a complex case. I understand you are just getting started with locking and concurrency in the database. This solution, although not perfect, is likely to be enough.
If you require more sophistication you probably need to do some research around pessimistic locking.
I think part of the issue is that the "Solo" field is redundant; it simply indicates whether there is a valid entry in the "Matched" table or not. I'd recommend removing it and instead just joining the "Pool" table to the "Matched" table, so you don't have to worry about issues with keeping the two in sync.
Other concurrency issues can be solved by using a concurrency tracking field. See Handling Concurrency with the Entity Framework in an ASP.NET MVC Application.
Also, just my opinion, but you may want to consider having an "audit" type of table for saving the deleted entries, or switch to using a temporal table for "Matches", rather than simply deleting the entries. That information could be very useful in the future, for tweaking your matching algorithm and such.
Queueing is a good idea. If each request for a match was queued there would be no contention or dead-lock to query the pool.
Related
The problem is that when a service receives messages from several other services and wants to apply those changes to a table, can this simultaneous change not cause a problem ?
To be more precise, the problem is that when a service receives two different messages from two different queues and wants to apply those received changes to the database, this synchronization will probably cause a problem !
Suppose a message contains updated user information and a message from another queue related to another case where these changes or updates are to be applied to Mongo ( assuming these changes occur at the same time or with a little distance ) . If the database is making changes to the author information, the information about the term collection must be updated at the same time or in a few moments later .
The table information for this service is as follows :
To deal with Concurrency Conflict, this usually comes in two flavors:
Pessimistic concurrency control
Pessimistic, or negative, concurrency control is when a record is locked at the time the user begins his or her edit process. In this concurrency mode, the record remains locked for the duration of the edit. The primary advantage is that no other user is able to get a lock on the record for updating, effectively informing any requesting user that they cannot update the record because it is in use.
There are several drawbacks to pessimistic concurrency control. If the user goes for a coffee break, the record remains locked, denying anyone else the ability to update the record, even if it has been untouched by the initial requestor. Also, in order to maintain record locks, a persistent connection to the database server is required. Since web applications can have hundreds or thousands of simultaneous users, a persistent connection to the database cannot be maintained without having tremendous resources on the database server. Moreover, some database tools are licensed based on the number of concurrent connections. As such, applications that use pessimistic concurrency would require additional licenses for use.
Optimistic concurrency control
Optimistic Concurrency means we allow concurrency conflicts happen. But we also (want to) believe that it will not happen. And if it happens anyway, we react on it in some manner. It’s supported in Entity Framework – you have got concurrency exceptions to handle, you can add a column of row version type (or timestamp) to database table and so on… It’s probably a good moment to stop and come back to the subject in separate post!
Frameworks such as Entity Framework have optimistic concurrency control built in (although it may be turned off). It’s instructive to quickly see how it works. Basically there are three steps:
Get an entity from the DB and disconnect.
Edit in memory.
Update the db with changes using a special update clause. Something
like: “Update this row WHERE the current values are same as original
values”.
There are some useful articles to help u with Optimistic concurrency control.
OPTIMISTIC CONCURRENCY IN MONGODB USING .NET AND C#
Document-Level Optimistic Concurrency in MongoDB
I use Transactions for concurrent updates. Query with ID before updating operation.
I'm considering implementing sql database caching using the following scheme:
In the ASP.NET webapplication I want a continuously running thread that check's a table, say dbStatus, to see if field dbDirty has been set true. If so, the local in-memory cache is updated, querying a view in which all needed tables are present.
When any of the tables in the view is updated, a trigger on that table is fired setting dbStatus.dbDirty true. So this would mean I have to add a on insert,update,delete trigger on those tables
One of the reasons I want to implement such a caching scheme is that the same database is used in a Winform version of this application.
My question: is this a viable approach?
Many thanks in advance for helping me with this one, Paul
This is a viable approach.
The main problem you need to be aware of is that ASP.NET worker processes can exit at any time for many reasons (deployment, recycle, reboot, bluescreen, bug, ...). This means that your code must tolerate being aborted (in fact just disappearing) at any time.
Also, consider that your app can run two times at the same time during worker recycling and if you run multiple servers for HA.
Also, cross-request state in a web app requires you to correctly synchronize your actions. This sounds like you might need to solve some race conditions.
Besides that this approach works.
Consider incrementing a version number instead of a boolean. That makes it easier to avoid synchronization issues such as lost updates because there is no need to reset the flag. There is only one writer. That's easier than multiple writers.
I currently have a method which reads data to determine if an update is needed, and then pushes the update to the database (dependency injected). The method is hit very hard, and I found concurrency related bugs, namely, multiple updates since several threads read the data before the first update.
I solved this using a lock, it works quite nicely. How may I instead use a TransactionScope to do the same thing? Can I? Will it block another thread as a lock would? Further, can I 'lock' an a specific 'id' as I am doing with a lock (I keep a Dictionary that stores an object to lock on for each id)?
I am using Entity Framework 5, though its hidden by a repository and unit of work pattern.
Application level locking may not be a solution for this problem. First of all you usually need to lock only single record or range of records. Next you may later need to lock another modifications and get into quite complex code.
This situation is usually handled with either optimistic or pessimistic concurrency.
Optimistic concurrency - you will have additional database generated column (database usually have special type for that like timestamp or rowversion). Database will automatically update that column every time you update the record. If you configure this column as row version EF will include the column in the where condition of the update => the executed update will search for the record with given key and row version. If the record is found it will be updated. If the record is not found it means either record with the key doesn't exist or someone else has updated the record since current process loaded its data => you will get exception and you can try to refresh data and save changes again. This mode is useful for records which are not updated too much. In your case it can cause just another troubles.
Pessimistic concurrency - this mode uses database locking instead. When you query the record you will lock it for update so no one else can also lock it for update or update directly. Unfortunately this mode currently doesn't have direct support in EF and you must execute it through raw SQL. I wrote an article explaining the pessimistic concurrency and its usage with EF. Even pessimistic concurrency may not be a good solution for database under heavy load.
If you really build a solution where a lot of concurrent processes tries to update same data all the time you may end up with redesign because there will be no reliable high performing solution based on locking or rerunning failed updates.
I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)
I have a ASP.NET C# business webapp that is used internally. One issue we are running into as we've grown is that the original design did not account for concurrency checking - so now multiple users are accessing the same data and overwriting other users changes. So my question is - for webapps do people usually use a pessimistic or optimistic concurrency system? What drives the preference to use one over another and what are some of the design considerations to take into account?
I'm currently leaning towards an optimistic concurrency check since it seems more forgiving, but I'm concerned about the potential for multiple changes being made that would be in contradiction to each other.
Thanks!
Optimistic locking.
Pessimistic is harder to implement and will give problems in a web environment. What action will release the lock, closing the browser? Leaving the session to time out? What about if they then do save their changes?
You don't specify which database you are using. MS SQL server has a timestamp datatype. It has nothing to do with time though. It is mearly a number that will get changed each time the row gets updated. You don't have to do anything to make sure it gets changed, you just need to check it. You can achive similar by using a date/time last modified as #KM suggests. But this means you have to remember to change it each time you update the row. If you use datetime you need to use a data type with sufficient precision to ensure that you can't end up with the value not changing when it should. For example, some one saves a row, then someone reads it, then another save happens but leaves the modified date/time unchanged. I would use timestamp unless there was a requirement to track last modified date on records.
To check it you can do as #KM suggests and include it in the update statement where clause. Or you can begin a transaction, check the timestamp, if all is well do the update, then commit the transaction, if not then return a failure code or error.
Holding transactions open (as suggested by #le dorfier) is similar to pessimistic locking, but the amount of data locked may be more than a row. Most RDBM's lock at the page level by default. You will also run into the same issues as with pessimistic locking.
You mention in your question that you are worried about conflicting updates. That is what the locking will prevent surely. Both optimistic or pessimistic will, when properly implemented prevent exactly that.
I agree with the first answer above, we try to use optimistic locking when the chance of collisions is fairly low. This can be easily implemented with a LastModifiedDate column or incrementing a Version column. If you are unsure about frequency of collisions, log occurrences somewhere so you can keep an eye on them. If your records are always in "edit" mode, having separate "view" and "edit" modes could help reduce collisions (assuming you reload data when entering edit mode).
If collisions are still high, pessimistic locking is more difficult to implement in web apps, but definitely possible. We have had good success with "leasing" records (locking with a timeout)... similar to that 2 minute warning you get when you buy tickets on TicketMaster. When a user goes into edit mode, we put a record into the "lock" table with a timeout of N minutes. Other users will see a message if they try to edit a record with an active lock. You could also implement a keep-alive for long forms by renewing the lease on any postback of the page, or even with an ajax timer. There is also no reason why you couldn't back this up with a standard optimistic lock mentioned above.
Many apps will need a combination of both approaches.
here's a simple solution to many people working on the same records.
when you load the data, get the last changed date, we use LastChgDate on our tables
when you save (update) the data add "AND LastChgDate=previouslyLoadedLastChgDate" to the where clause. If the row count=0 on the update, issue error where "someone else has already saved this data" and rollback everything, otherwise the data is saved.
I generally do the above logic on header tables only and not on the details tables, since they are all in one transaction.
I assume you're experiencing the 'lost update' problem.
To counter this as a rule of thumb I use pessimistic locking when the chances of a collision are high (or transactions are short lived) and optimistic locking when the chances of a collision are low (or transactions are long lived, or your business rules encompass multiple transactions).
You really need to see what applies to your situation and make a judgment call.