Patterns for handling a SQL deadlock in C#? - c#

I'm writing an application in C# which accesses a SQL Server 2005 database. The application is quite database intensive, and even if I try to optimize all access, set up proper indexes and so on I expect that I will get deadlocks sooner or later. I know why database deadlocks occur, but I doubt I'll be able to release the software without deadlocks occuring at some time. The application is using Entity Framework for database access.
Are there any good pattern for handling SQLExceptions (deadlocked) in the C# client code - for example to re-run the statement batch after x milliseconds?
To clarify; I'm not looking for a method on how to avoid deadlocks in the first place (isolation levels, indexes, order of statements etc) but rather how to handle them when they actually occur.

I posted a code sample to handle exactly this a while back, but SO seemed to lose my account in the interim so I can't find it now I'm afraid and don't have the code I used here.
Short answer - wrap the thing in a try..catch. If you catch an error which looks like a deadlock, sleep for a short random time and increment a retry the counter. If you get another error or the retry counter clears your threshold, throw the error back up to the calling routine.
(And if you can, try to bung this in a general routine and run most/all of your DB access through it so you're handling deadlocks program-wide.)
EDIT: Ah, teach me not to use Google! The previous code sample I and others gave is at How to get efficient Sql Server deadlock handling in C# with ADO?

Here is the approach we took in the last application framework I worked on. When we detected a deadlock, we simply reran the transaction. We did this up to 5 times. If after 5 times it failed, we would throw an exception. I don't recall a time that the second attempt ever failed. We would know because we were logging all activity in the backend code. So we knew any time a deadlock occurred, and we knew if it failed more than 5 times. This approach worked well for us.
Randy

Related

How to decide what exceptions are worth retrying when reading and writing to MongoDB (C# driver)?

By looking at this official documentation it seems that there are basically three types of errors thrown by the MongoDB C# driver:
errors thrown when the driver is not able to properly select or connect to a Server to issue the query against. These errors lead to a TimeoutException
errors thrown when the driver has successfully selected a Server to run the query against, but the server goes down while the query is being executed. These errors manifest themselves as MongoConnectionException
errors thrown during a write operations. These errors leads to MongoWriteException or MongoBulkWriteException depending on the type of write operation being performed.
I'm trying to make my software using MongoDB a bit more resilient to transient errors, so I want to find which exceptions are worth retry.
The problem is not implementing a solid retry policy (I usually employ Polly .NET for that), but instead understanding when the retry makes sense.
I think that retrying on exceptions of type TimeoutException doesn't make sense, because the driver itself waits for a few seconds before timing out an operation (the default is 30 seconds, but you can change that via the connection string options). The idea is that retry the operation after you have waited for 30 seconds before timing out is probably a waste of time. For instance if you decide to implement 3 retries with 1 second of waiting time between them, it takes up to 93 seconds to fail an operation (30 + 30 + 30 + 1 + 1 + 1). This is a huge time.
As documented here retrying on MongoConnectionException is only safe when doing idempotent operations. From my point of view, it makes sense to always retry on these kind of errors provided that the performed operation is idempotent.
The hard bit in deciding a good retry strategy for writes is when you get an exception of type MongoWriteException or MongoBulkWriteException.
Regarding the exceptions of type MongoWriteException is probably worth retrying all the exceptions having a ServerErrorCategory other than DuplicateKey. As documented here you can detect the duplicate key errors by using this property of the MongoWriteException.WriteError object.
Retrying duplicate key errors probably doesn't make sense because you will get them again (that's not a transient error).
I have no idea how to handle errors of type MongoBulkWriteException safely. In that case you are inserting multiple documents to MongoDB and it is entirely possible that only some of them have failed, while the others have been successfully written to MongoDB. So retrying the exact same bulk insert operation could lead to write the same document twice (bulk writes are not idempotent in nature). How can I handle this scenario ?
Do you have any suggestion ?
Do you know any working example or reference regarding retrying queries on MongoDB for the C# driver ?
Retry
Let's start with the basics of Retry.
There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will vanish sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:
The potentially introduced observable impact is acceptable
The operation can be redone without any irreversible side effect
The introduced complexity is negligible compared to the promised reliability
Let’s review them one by one:
The word failure indicates that the effect is observable by the requester as well, for example via higher latency / reduced throughput / etc.. If the “penalty“ (delay or reduced performance) is unacceptable then retry is not an option for you.
This requirement is also known as idempotent operation. If I call the action with the same input several times then it will produce the exact same result. In other words, the operation acts like it only depends on its parameter and nothing else influences the result (like other objects' state).
This condition is even though one of the most crucial, this is the one that is almost always forgotten. As always there are trade-offs (If I introduce Z then it will increase X but it might decrease Y) and we should be fully aware of them.
Unless it will give us some unwanted surprises in the least expected time.
Mongo Exception
Let's continue with exceptions that the MongoDb's C# client can throw.
I haven't used MongoDb in last couple of years so this knowledge may have been outdated. But I hope the essence did not change since.
I would also encourage you to introduce detection logic first (catch and log) before you try to mitigate the problem (for example with retry). This will give information about the frequency and amount of occurrences. It will also give you insight about the nature of the problems.
MongoConnectionException with a SocketException as Inner
When:
There is server selection problem
The connection has timed out
The chosen server is unavailable
Retry:
If the problem is due to network issue then it might be useful to retry
If the root cause is misconfiguration then retry won't help
Log:
ConnectionId and Message
ToJson might be useful as well
MongoWriteException or MongoWriteConcernException
When:
There was a persistence problem
Retry:
It depends, if you perform a create operation and the server can detect duplicates (DuplicateKeyError) then it is better to try to write the record multiple times then have one failed write attempt
Most of the time updates are not idempotent but if you use some sort of record versioning then you can try to perform a retry and fail during the optimistic locking
Deletion could be implemented in an idempotent way. This is true for soft and hard delete as well.
Log:
WriteError, WriteConcernError and Message
In case of MongoWriteConcernExpcetion: Code, Command and Result

A write transaction is already opened by this thread in RavenDB4

I am attempting to update RavenDB storage for hangfire to RavenDB4 and I sometimes receive the following exception:
Raven.Client.Exceptions.RavenException: 'System.InvalidOperationException: A write transaction is already opened by this thread
I checked for unclosed session, but all session but one use using and the last one is specific because it is part of a class that acts like a transaction builder and is disposed on commit. I was unable to find what operations might take longer in the background or what could cause it.
I'd appreciate a little help with narrowing down what could cause this, because I have absolutely no idea and documentation didn't help much.
After upgrading to nightly version of RavenDB4 instead of RavenDB 4.0.0-rc-40025 (after Ayende Rahien suggested it should be a server issue) I never got this exception. I scheduling thousands of jobs before posting this as an answer to make sure it was server side issue.
Before the upgrade I got the exception pretty much every time I scheduled many jobs.

Parallel execution of CREATE DATABASE statements result to an error but not on separate SQL Server instance

I am using the latest version of Entity Framework on my application (but I don't think EF is the issue here, just stating what ORM we are using) and have this multi-tenant architecture. I was doing some stress tests, built in C#, wherein it creates X-number of tasks that runs in parallel to do some stuff. At some point at the beginning of the whole process, it will create a new database for each task (each tenant in this case) and then continues to process the bulk of the operation. But on some tasks, it throws 2 SQL Exceptions on that exact part of my code where it tries to create a new database.
Exception #1:
Could not obtain exclusive lock on database 'model'. Retry the
operation later. CREATE DATABASE failed. Some file names listed could
not be created. Check related errors.
Exception #2:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding.
It's either of those two and throws on the same line of my code (when EF creates the database). Apparently in SQL Server, when creating a database it does it one at a time and locks the 'model' database (see here) thus some tasks that are waiting throws a timeout or that lock on 'model' error.
Those tests were done on our development SQL Server 2014 instance (12.0.4213) and if I execute, say, 100 parallel tasks there will bound to be an error thrown on some tasks or sometimes even nearly half the tasks I executed.
BUT here's the most disturbing part in all these, when testing it on my other SQL server instance (12.0.2000), which I have installed locally on my PC, no such error throws and completely finishes all the tasks I executed (even 1000 tasks in parallel!).
Solutions I've tried so far but didn't work:
Changed the timeout of the Object context in EF to infinite
Tried adding a longer or infinite timeout on the connection string
Tried adding a Retry strategy on EF and made it longer and run more often
Currently, trying to install Virtual machine with a similar environment to our Dev server (uses Windows Server 2014 R2) and test on specific version of SQL Server to try to see if the versions have anything to do with it (yeah, I'm that desperate :))
Anyway, here is a simple C# console application you can download and try to replicate the issue. This test app will execute N-number of tasks you input and simply creates a database and does cleanup right afterwards.
2 observations:
Since the underlying issue has something to do with concurrency, and access to a "resource" which at a key point only allows a single, but not a concurrent, accessor, it's unsurprising that you might be getting differing results on two different machines when executing highly concurrent scenarios under load. Further, SQL Server Engine differences might be involved. All of this is just par for the course for trying to figure out and debug concurrency issues, especially with an engine involved that has its own very strong notions of concurrency.
Rather than going against the grain of the situation by trying to make something work or fully explain a situation, when things are empirically not working, why not change approach by designing for cleaner handling of the problem?
One option: acknowledge the reality of SQL Server's need to have a exclusive lock on model db by regulating access via some kind of concurrency synchronization mechanism--a System.Threading.Monitor sounds about right for what is happening here and it would allow you to control what happens when there is a timeout, with a timeout of your choosing. This will help prevent the kind of locked up type scenario that may be happening on the SQL Server end, which would be an explanation for the current "timeouts" symptom (although stress load might be the sole explanation).
Another option: See if you can design in such a way that you don't need to synchronize at all. Get to a point where you never request more than one database create simultaneously. Some kind of queue of the create requests--and the queue is guaranteed to be serviced by, say, only one thread--with requesting tasks doing async/await patterns on the result of the creates.
Either way, you are going to have situations where this slows down to a crawl under stress testing, with super stressed loads causing failure. The key questions are:
Can your design handle some multiple of the likely worst case load and still show acceptable performance?
If failure does occur, is your response to the failure "controlled" in a way that you have designed for.
Probably you have different LockTimeoutSeconds and QueryTimeoutSeconds set on the development and local instances for SSDT (DacFx Deploy), which is deploying the databases.
For example LockTimeoutSeconds is used to set lock_timeout. If you have a small number here, this is the reason for
Could not obtain exclusive lock on database 'model'. Retry the operation later. CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
You can use the query below to identify what timeout is set by SSDT
select session_id, lock_timeout, * from sys.dm_exec_sessions where login_name = 'username'
To increase the default timeout, find the identifier of the user, which is deploying the database here
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList
Then find the following registry key
HKEY_USERS\your user identifier\Microsoft\VisualStudio\your version\SQLDB\Database
and change the values for LockTimeoutSeconds and QueryTimeoutSeconds

How do I find the source of a SqlException (1205) Deadlock?

I'm dealing with a fairly large-scale C# application which occasionally hits a SQL Server deadlock. I can't figure out what's causing it. My bandaid solution for now is:
(1) Catch the SqlException.
(2) See if the error code is 1205 (i.e. deadlock).
(3) If it is, Sleep() for a couple of seconds and retry the entire transaction. You can assume the previously failed transaction was successfully rolled back.
This works. The problem occurs only once or twice per week so the performance penalty is trivial. It still bugs me that it's occurring though. How do I figure out why it is occurring?
I know a deadlock occurs when two or more SQL Server threads are contending for the same resources. I obviously know which one of my transactions LOSES that battle. It's always the same query. But I'd like to know which which transaction is WINNING the battle. Maybe it's the same code block. Maybe not. I have no way to tell. Is there some special tool I should use to find the deadlock's source?
More info: The losing transaction isn't doing anything particularly exotic; just two large deletes via ExecuteNonQuery() followed by two large bulk inserts using the SqlBulkCopy class -- all in the same SqlTransaction. Both READER_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION are turned on. There are no humans making ad-hoc queries against the database. My application is the only user.
Again, this works flawlessly 99.99%+ of the time... the deadlock is quite rare. It manifests only once or twice per week.

How do I address cases when a transaction completes in the database but a connectivity issue causes an exception?

In my Azure web service I have code that invokes a stored procedure in SQL Azure. Sometimes it happens so that the stored procedure completes but the connection is broken afterwards and the caller gets a SqlException claiming that Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
The caller will then reopen the connection and try to rerun the same code. The problem is that the code first checks that the database table stores "the right state" and since the abovementioned stored procedure has already been run the database state has changed and so the check is failed and an exception is thrown.
So the problem is the calling code relies on the condition that "no exceptions" equals "database change okay" and so if there was an exception then the database has not changed. In this case an exception is because of temporary connectivity problems after the database change has occurred so the assumption turns out to be wrong.
What's the typical way to address such cases?
Use a DTC and remote transaction, then at least you are properly handling this. You must promote that from a "local" to a "distributed" transaction. This has issues in itself, but there is hardly another way to do it properly.
Alternatively you can reprogram so that you handle this situation in code properly.
I'd recommend using the Enterprise Library Transient Fault Handling Block. We have started incorporating it in our Web Roles.
The only way to solve the problem entirely is to make the transactions idempotent, which is a fancy way of saying that no matter how many times you run the procedure the final state is the correct one. Once you do that then it doesn't matter if the first transaction (or the second, third, fourth, etc) failed. You just keep trying until it works, and then you know you've got the right state.
How exactly you achieve idempotence is situational. In many cases you can use guard clauses to check if you're already in the desired state, but other times you might need to do something more complex.

Categories

Resources