I'm dealing with a fairly large-scale C# application which occasionally hits a SQL Server deadlock. I can't figure out what's causing it. My bandaid solution for now is:
(1) Catch the SqlException.
(2) See if the error code is 1205 (i.e. deadlock).
(3) If it is, Sleep() for a couple of seconds and retry the entire transaction. You can assume the previously failed transaction was successfully rolled back.
This works. The problem occurs only once or twice per week so the performance penalty is trivial. It still bugs me that it's occurring though. How do I figure out why it is occurring?
I know a deadlock occurs when two or more SQL Server threads are contending for the same resources. I obviously know which one of my transactions LOSES that battle. It's always the same query. But I'd like to know which which transaction is WINNING the battle. Maybe it's the same code block. Maybe not. I have no way to tell. Is there some special tool I should use to find the deadlock's source?
More info: The losing transaction isn't doing anything particularly exotic; just two large deletes via ExecuteNonQuery() followed by two large bulk inserts using the SqlBulkCopy class -- all in the same SqlTransaction. Both READER_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION are turned on. There are no humans making ad-hoc queries against the database. My application is the only user.
Again, this works flawlessly 99.99%+ of the time... the deadlock is quite rare. It manifests only once or twice per week.
Related
I am using the latest version of Entity Framework on my application (but I don't think EF is the issue here, just stating what ORM we are using) and have this multi-tenant architecture. I was doing some stress tests, built in C#, wherein it creates X-number of tasks that runs in parallel to do some stuff. At some point at the beginning of the whole process, it will create a new database for each task (each tenant in this case) and then continues to process the bulk of the operation. But on some tasks, it throws 2 SQL Exceptions on that exact part of my code where it tries to create a new database.
Exception #1:
Could not obtain exclusive lock on database 'model'. Retry the
operation later. CREATE DATABASE failed. Some file names listed could
not be created. Check related errors.
Exception #2:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding.
It's either of those two and throws on the same line of my code (when EF creates the database). Apparently in SQL Server, when creating a database it does it one at a time and locks the 'model' database (see here) thus some tasks that are waiting throws a timeout or that lock on 'model' error.
Those tests were done on our development SQL Server 2014 instance (12.0.4213) and if I execute, say, 100 parallel tasks there will bound to be an error thrown on some tasks or sometimes even nearly half the tasks I executed.
BUT here's the most disturbing part in all these, when testing it on my other SQL server instance (12.0.2000), which I have installed locally on my PC, no such error throws and completely finishes all the tasks I executed (even 1000 tasks in parallel!).
Solutions I've tried so far but didn't work:
Changed the timeout of the Object context in EF to infinite
Tried adding a longer or infinite timeout on the connection string
Tried adding a Retry strategy on EF and made it longer and run more often
Currently, trying to install Virtual machine with a similar environment to our Dev server (uses Windows Server 2014 R2) and test on specific version of SQL Server to try to see if the versions have anything to do with it (yeah, I'm that desperate :))
Anyway, here is a simple C# console application you can download and try to replicate the issue. This test app will execute N-number of tasks you input and simply creates a database and does cleanup right afterwards.
2 observations:
Since the underlying issue has something to do with concurrency, and access to a "resource" which at a key point only allows a single, but not a concurrent, accessor, it's unsurprising that you might be getting differing results on two different machines when executing highly concurrent scenarios under load. Further, SQL Server Engine differences might be involved. All of this is just par for the course for trying to figure out and debug concurrency issues, especially with an engine involved that has its own very strong notions of concurrency.
Rather than going against the grain of the situation by trying to make something work or fully explain a situation, when things are empirically not working, why not change approach by designing for cleaner handling of the problem?
One option: acknowledge the reality of SQL Server's need to have a exclusive lock on model db by regulating access via some kind of concurrency synchronization mechanism--a System.Threading.Monitor sounds about right for what is happening here and it would allow you to control what happens when there is a timeout, with a timeout of your choosing. This will help prevent the kind of locked up type scenario that may be happening on the SQL Server end, which would be an explanation for the current "timeouts" symptom (although stress load might be the sole explanation).
Another option: See if you can design in such a way that you don't need to synchronize at all. Get to a point where you never request more than one database create simultaneously. Some kind of queue of the create requests--and the queue is guaranteed to be serviced by, say, only one thread--with requesting tasks doing async/await patterns on the result of the creates.
Either way, you are going to have situations where this slows down to a crawl under stress testing, with super stressed loads causing failure. The key questions are:
Can your design handle some multiple of the likely worst case load and still show acceptable performance?
If failure does occur, is your response to the failure "controlled" in a way that you have designed for.
Probably you have different LockTimeoutSeconds and QueryTimeoutSeconds set on the development and local instances for SSDT (DacFx Deploy), which is deploying the databases.
For example LockTimeoutSeconds is used to set lock_timeout. If you have a small number here, this is the reason for
Could not obtain exclusive lock on database 'model'. Retry the operation later. CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
You can use the query below to identify what timeout is set by SSDT
select session_id, lock_timeout, * from sys.dm_exec_sessions where login_name = 'username'
To increase the default timeout, find the identifier of the user, which is deploying the database here
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList
Then find the following registry key
HKEY_USERS\your user identifier\Microsoft\VisualStudio\your version\SQLDB\Database
and change the values for LockTimeoutSeconds and QueryTimeoutSeconds
I have developed an Hangfire application using MVC running in IIS, and it is working absolutely fine, till I saw the size of my SQL Server log file, which grew whopping 40 GB overnight!!
As per information from our DBA, there was an long running transaction, with the following SQL statement (I have 2 hangfire queues in place)-
(#queues1 nvarchar(4000),#queues2 nvarchar(4000),#timeout float)
delete top (1) from [HangFire].JobQueue with (readpast, updlock, rowlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))
and Queue in (#queues1,#queues2)
On exploring the Hangfire library, I found that it is used for dequeuing the jobs, and doing a very simple task that should not take any significant time.
I couldn't found anything that would have caused this error. transactions are used correctly with using statements and object are Disposed in event of exception.
As suggested in some posts, I have checked the recovery mode of my database and verified that it is simple.
I have manually killed the hanged transaction to reclaim the log file space, but it come up again after few hours. I am observing it continuously.
What could be the reason for such behavior? and how it can be prevented?
The issue seems to be intermittent, and it could be of extremely high risk to be deployed on production :(
Starting from Hangfire 1.5.0, Hangfire.SqlServer implementation wraps the whole processing of a background job with a transaction. Previous implementation used invisibility timeout to provide at least once processing guarantee without requiring a transaction, in case of an unexpected process shutdown.
I've implemented a new model for queue processing, because there were a lot of confusion for new users, especially ones who just installed Hangfire and played with it under a debugging session. There were a lot of questions like "Why my job is still under processing state?". I've considered there may be problems with transaction log growth, but I didn't know this may happen even with Simple Recovery Model (please see this answer to learn why).
It looks like there should be a switch, what queue model to use, based on transactions (by default) or based on invisibility timeout. But this feature will be available in 1.6 only and I don't know any ETAs yet.
Currently, you can use Hangfire.SqlServer.MSMQ or any other non-RDBMS queue implementations (please see the Extensions page). Separate database for Hangfire may also help, especially if your application changes a lot of data.
I have a simple stored procedure in T-SQL that is instant when run from SQL Server Management Studio, and has a simple execution plan. It's used in a C# web front-end, where it is usually quick, but occasionally seems to get itself into a state where it sits there and times-out. It then does this consistently from any web-server. The only way to fix it that I’ve found is to drop and recreate it. It only happens with a single stored procedure, out of a couple of hundred similar procedures that are used in the application.
I’m looking for an answer that’s better than making a service to test it every n minutes and dropping and recreating on timeout.
As pointed out by other responses, the reasons could be many, varying from execution plan, to the actual SP code, to anything else. However, in my past experience, I faced a similar problem due to 'parameter sniffing'. Google for it and take a look, it might help. Basically, you should use local variables in your SP instead of the parameters passed in.
Not sure if my situation is too uncommon to be useful to others (It involved use of table variables inside the stored proc). But here is the story anyways.
I was working on an issue where a stored proc would take 10 seconds in most cases, but 3-4 minutes every now and then. After a little digging around, I found a pattern in the issue :
This being a stored proc that takes in a start date and and an end date, if I ran this for a year's worth of data (which is what people normally do), it ran in 10 sec. However when the query plan cache was cleared out, and if someone ran it for a day (uncommon use case), all further calls for a 1-year range would take 3-4 minutes, until I did a DBCC FREEPROCCACHE
The following 2 things were what fixed the problem
My first suspect was Parameter sniffing. Fixed it immediately using the local variable approach This, however, improved performance only by a small percentage (<10%).
In a clutching-at-straws approach, I changed the table variables that the original developer had used in this stored proc, to temp tables. This was what fixed the issue finally. Now that I know that this was the problem, I am doing some reading online, and have come across a few links such as
http://www.sqlbadpractices.com/using-table-variable-for-large-table-vs-temporary-table/
which seem to correspond with the issue I am seeing.
Happy coding!!
It's hard to say for sure without seeing SP code.
Some suggestions.
SQL server by default reuses execution plan for stored procedure. The plan is generated upon the first execution. That may cause a problem. For example, for the first time you provide input with very high selectivity, and SQL Server generates the plan keeping that in mind. Next time you pass low selectivity input, but SP reuses the old plan, causing very slow execution.
Having different execution paths in SP causes the same problem.
Try creating this procedure WITH RECOMPILE option to prevent caching.
Hope that helps.
Run SQL Profiler and execute it from the web site until it happens again. When it pauses / times out check to see what is happening on the SQL server itself.
There are lots of possibilities here depending on what the s'proc actually does. For example, if it is inserting records then you may have issues where the database server needs to expand the database and/or log file size to accept new data. If it's happening on the log file and you have slow drives or are nearing the max of your drive space then it could timeout.
If it's a select, then those tables might be locked for a period of time due to other inserts happening... Or it might be reusing a bad execution plan.
The drop / recreate dance is may only be delaying the execution to the point that the SQL server can catch up or it might be causing a recompile.
My original thought was that it was an index but on further reflection, I don't think that dropping and recreating the stored prod would help.
It most probably your cached execution plan that is causing this.
Try using DBCC FREEPROCCACHE to clean your cache the next time this happens. Read more here http://msdn.microsoft.com/en-us/library/ms174283.aspx
Even this is a reactive step - it does not really solve the issue.
I suggest you execute the procedure in SSMS and check out the actual Execution Plan and figure out what is causing the delay. (in the Menu, go to [View] and then [Include Actual Execution Plan])
Let me just suggest that this might be unrelated to the procedure itself, but to the actual operation you are trying to do on the database.
I'm no MS SQL expert, but I would'n be surprised that it behaves similarly to Oracle when two concurrent transactions try to delete the same row: the transaction that first reaches the deletion locks the row and the second transaction is then blocked until the first one either commits or rolls back. If that was attempted from your procedure it might appear as "stuck" (until the "locking" transaction is finished).
Do you have any long-running transactions that might lock rows that your procedure is accessing?
I understand how editing rows can cause concurrency issues, but concurrency issues being caused by selecting rows is something I do not understand.
If a query selects data from a database, how can a concurrency issue arise? Is it if there is a change made to the data I'm selecting, things will blow up?
In any case, if there is a concurrency issue caused by a select query, what is the best way to handle it? This is what I have in mind, but I wouldn't be surprised at all if it were wrong.
try
{
var SelectQuery =
from a DB.Table
where a.Value == 1
select new {Result = a};
}
catch
{
//retry query??
}
In this case your select operation essentially amounts to a read / query. Even read only operations can cause concurrency issues in an application.
The simplest example is when the object being read from has thread affinity and the read occurs from a different thread. This can cause a race since the data is being accessed in an improper way.
The best way to handle a concurrency issue is to simply avoid it. If you have 2 threads playing with the same peice of data using a lock to serialize access to the data is probably the best approach. Although a definitive solution requires a bit more detail.
Can you explain what is happening here and why the race is occurring? Do other threads modify this object while you are reading it?
When your query is run, a SQL query will be generated to correspond to your query. If other threads (or anything else) are attempting to modify the tables involved in your query, the database server will generally detect this and take care of the logic necessary to keep this from causing any real problems. It may take a little longer for your query to execute if it keeps bumping heads with update statements, but the only real problem would be if the system detects that some combination of running transactions is actually causing a deadlock. In this case, it will kill one of those connections. I believe this would only happen if your statements are attempting to update database values--not from selects.
A more important point to make, looking at your example, is that the code that you put in the try/catch block doesn't actually do any querying. It just builds an expression tree. The SQL query will not actually be run until you do something that causes this expression tree to be evaluated, like calling SelectQuery.ToList().
Keep in mind that there are a number of things that can "go wrong" when you're trying to query a database. Maybe somebody's doing massive updates of the data you're trying to select, and your connection times out before finishing the query. Maybe a cable gets bumped, or a random bit of cosmic radiation causes a bit somewhere to get lost. Then again, maybe your query has something wrong with it, or maybe the database context you're using is not synchronized to the database schema. Some of the things that could go wrong are only intermittent, and you could just try again like your question suggests. Other things might be longer-lasting, and will keep recurring. For these latter cases, if you try to repeat your action until you stop getting errors, your thread may hang there for a very long time.
So when you're deciding how to handle database connection problems, pay attention to how often you expect each type of problem to occur. I have seen code that attempts to run a transaction three times before giving up, like this. But when it comes to everyday queries, this sort of thing happens so rarely that I personally would just allow the exception to trickle up to where the user interface can say "There was an unexpected error. Please try again. If the problem persists, contact your administrator." Or something like that.
I'm writing an application in C# which accesses a SQL Server 2005 database. The application is quite database intensive, and even if I try to optimize all access, set up proper indexes and so on I expect that I will get deadlocks sooner or later. I know why database deadlocks occur, but I doubt I'll be able to release the software without deadlocks occuring at some time. The application is using Entity Framework for database access.
Are there any good pattern for handling SQLExceptions (deadlocked) in the C# client code - for example to re-run the statement batch after x milliseconds?
To clarify; I'm not looking for a method on how to avoid deadlocks in the first place (isolation levels, indexes, order of statements etc) but rather how to handle them when they actually occur.
I posted a code sample to handle exactly this a while back, but SO seemed to lose my account in the interim so I can't find it now I'm afraid and don't have the code I used here.
Short answer - wrap the thing in a try..catch. If you catch an error which looks like a deadlock, sleep for a short random time and increment a retry the counter. If you get another error or the retry counter clears your threshold, throw the error back up to the calling routine.
(And if you can, try to bung this in a general routine and run most/all of your DB access through it so you're handling deadlocks program-wide.)
EDIT: Ah, teach me not to use Google! The previous code sample I and others gave is at How to get efficient Sql Server deadlock handling in C# with ADO?
Here is the approach we took in the last application framework I worked on. When we detected a deadlock, we simply reran the transaction. We did this up to 5 times. If after 5 times it failed, we would throw an exception. I don't recall a time that the second attempt ever failed. We would know because we were logging all activity in the backend code. So we knew any time a deadlock occurred, and we knew if it failed more than 5 times. This approach worked well for us.
Randy