I am attempting to update RavenDB storage for hangfire to RavenDB4 and I sometimes receive the following exception:
Raven.Client.Exceptions.RavenException: 'System.InvalidOperationException: A write transaction is already opened by this thread
I checked for unclosed session, but all session but one use using and the last one is specific because it is part of a class that acts like a transaction builder and is disposed on commit. I was unable to find what operations might take longer in the background or what could cause it.
I'd appreciate a little help with narrowing down what could cause this, because I have absolutely no idea and documentation didn't help much.
After upgrading to nightly version of RavenDB4 instead of RavenDB 4.0.0-rc-40025 (after Ayende Rahien suggested it should be a server issue) I never got this exception. I scheduling thousands of jobs before posting this as an answer to make sure it was server side issue.
Before the upgrade I got the exception pretty much every time I scheduled many jobs.
Related
I am using the latest version of Entity Framework on my application (but I don't think EF is the issue here, just stating what ORM we are using) and have this multi-tenant architecture. I was doing some stress tests, built in C#, wherein it creates X-number of tasks that runs in parallel to do some stuff. At some point at the beginning of the whole process, it will create a new database for each task (each tenant in this case) and then continues to process the bulk of the operation. But on some tasks, it throws 2 SQL Exceptions on that exact part of my code where it tries to create a new database.
Exception #1:
Could not obtain exclusive lock on database 'model'. Retry the
operation later. CREATE DATABASE failed. Some file names listed could
not be created. Check related errors.
Exception #2:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding.
It's either of those two and throws on the same line of my code (when EF creates the database). Apparently in SQL Server, when creating a database it does it one at a time and locks the 'model' database (see here) thus some tasks that are waiting throws a timeout or that lock on 'model' error.
Those tests were done on our development SQL Server 2014 instance (12.0.4213) and if I execute, say, 100 parallel tasks there will bound to be an error thrown on some tasks or sometimes even nearly half the tasks I executed.
BUT here's the most disturbing part in all these, when testing it on my other SQL server instance (12.0.2000), which I have installed locally on my PC, no such error throws and completely finishes all the tasks I executed (even 1000 tasks in parallel!).
Solutions I've tried so far but didn't work:
Changed the timeout of the Object context in EF to infinite
Tried adding a longer or infinite timeout on the connection string
Tried adding a Retry strategy on EF and made it longer and run more often
Currently, trying to install Virtual machine with a similar environment to our Dev server (uses Windows Server 2014 R2) and test on specific version of SQL Server to try to see if the versions have anything to do with it (yeah, I'm that desperate :))
Anyway, here is a simple C# console application you can download and try to replicate the issue. This test app will execute N-number of tasks you input and simply creates a database and does cleanup right afterwards.
2 observations:
Since the underlying issue has something to do with concurrency, and access to a "resource" which at a key point only allows a single, but not a concurrent, accessor, it's unsurprising that you might be getting differing results on two different machines when executing highly concurrent scenarios under load. Further, SQL Server Engine differences might be involved. All of this is just par for the course for trying to figure out and debug concurrency issues, especially with an engine involved that has its own very strong notions of concurrency.
Rather than going against the grain of the situation by trying to make something work or fully explain a situation, when things are empirically not working, why not change approach by designing for cleaner handling of the problem?
One option: acknowledge the reality of SQL Server's need to have a exclusive lock on model db by regulating access via some kind of concurrency synchronization mechanism--a System.Threading.Monitor sounds about right for what is happening here and it would allow you to control what happens when there is a timeout, with a timeout of your choosing. This will help prevent the kind of locked up type scenario that may be happening on the SQL Server end, which would be an explanation for the current "timeouts" symptom (although stress load might be the sole explanation).
Another option: See if you can design in such a way that you don't need to synchronize at all. Get to a point where you never request more than one database create simultaneously. Some kind of queue of the create requests--and the queue is guaranteed to be serviced by, say, only one thread--with requesting tasks doing async/await patterns on the result of the creates.
Either way, you are going to have situations where this slows down to a crawl under stress testing, with super stressed loads causing failure. The key questions are:
Can your design handle some multiple of the likely worst case load and still show acceptable performance?
If failure does occur, is your response to the failure "controlled" in a way that you have designed for.
Probably you have different LockTimeoutSeconds and QueryTimeoutSeconds set on the development and local instances for SSDT (DacFx Deploy), which is deploying the databases.
For example LockTimeoutSeconds is used to set lock_timeout. If you have a small number here, this is the reason for
Could not obtain exclusive lock on database 'model'. Retry the operation later. CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
You can use the query below to identify what timeout is set by SSDT
select session_id, lock_timeout, * from sys.dm_exec_sessions where login_name = 'username'
To increase the default timeout, find the identifier of the user, which is deploying the database here
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList
Then find the following registry key
HKEY_USERS\your user identifier\Microsoft\VisualStudio\your version\SQLDB\Database
and change the values for LockTimeoutSeconds and QueryTimeoutSeconds
I have developed an Hangfire application using MVC running in IIS, and it is working absolutely fine, till I saw the size of my SQL Server log file, which grew whopping 40 GB overnight!!
As per information from our DBA, there was an long running transaction, with the following SQL statement (I have 2 hangfire queues in place)-
(#queues1 nvarchar(4000),#queues2 nvarchar(4000),#timeout float)
delete top (1) from [HangFire].JobQueue with (readpast, updlock, rowlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))
and Queue in (#queues1,#queues2)
On exploring the Hangfire library, I found that it is used for dequeuing the jobs, and doing a very simple task that should not take any significant time.
I couldn't found anything that would have caused this error. transactions are used correctly with using statements and object are Disposed in event of exception.
As suggested in some posts, I have checked the recovery mode of my database and verified that it is simple.
I have manually killed the hanged transaction to reclaim the log file space, but it come up again after few hours. I am observing it continuously.
What could be the reason for such behavior? and how it can be prevented?
The issue seems to be intermittent, and it could be of extremely high risk to be deployed on production :(
Starting from Hangfire 1.5.0, Hangfire.SqlServer implementation wraps the whole processing of a background job with a transaction. Previous implementation used invisibility timeout to provide at least once processing guarantee without requiring a transaction, in case of an unexpected process shutdown.
I've implemented a new model for queue processing, because there were a lot of confusion for new users, especially ones who just installed Hangfire and played with it under a debugging session. There were a lot of questions like "Why my job is still under processing state?". I've considered there may be problems with transaction log growth, but I didn't know this may happen even with Simple Recovery Model (please see this answer to learn why).
It looks like there should be a switch, what queue model to use, based on transactions (by default) or based on invisibility timeout. But this feature will be available in 1.6 only and I don't know any ETAs yet.
Currently, you can use Hangfire.SqlServer.MSMQ or any other non-RDBMS queue implementations (please see the Extensions page). Separate database for Hangfire may also help, especially if your application changes a lot of data.
I'm dealing with a fairly large-scale C# application which occasionally hits a SQL Server deadlock. I can't figure out what's causing it. My bandaid solution for now is:
(1) Catch the SqlException.
(2) See if the error code is 1205 (i.e. deadlock).
(3) If it is, Sleep() for a couple of seconds and retry the entire transaction. You can assume the previously failed transaction was successfully rolled back.
This works. The problem occurs only once or twice per week so the performance penalty is trivial. It still bugs me that it's occurring though. How do I figure out why it is occurring?
I know a deadlock occurs when two or more SQL Server threads are contending for the same resources. I obviously know which one of my transactions LOSES that battle. It's always the same query. But I'd like to know which which transaction is WINNING the battle. Maybe it's the same code block. Maybe not. I have no way to tell. Is there some special tool I should use to find the deadlock's source?
More info: The losing transaction isn't doing anything particularly exotic; just two large deletes via ExecuteNonQuery() followed by two large bulk inserts using the SqlBulkCopy class -- all in the same SqlTransaction. Both READER_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION are turned on. There are no humans making ad-hoc queries against the database. My application is the only user.
Again, this works flawlessly 99.99%+ of the time... the deadlock is quite rare. It manifests only once or twice per week.
Greetings
I stumbled onto a problem today that seems sort of impossible to me, but its happening...I'm calling some database code in c# that looks something like this:
using(var tran = MyDataLayer.Transaction())
{
MyDataLayer.ExecSproc(new SprocTheFirst(arg1, arg2));
MyDataLayer.CallSomethingThatEventuallyDoesLinqToSql(arg1, argEtc);
tran.Commit();
}
I've simplified this a bit for posting, but whats going on is MyDataLayer.Transaction() makes a TransactionScope with the IsolationLevel set to Snapshot and TransactionScopeOption set to Required. This code gets called hundreds of times a day, and almost always works perfectly. However after reviewing some data I discovered there are a handful of records created by "SprocTheFirst" but no corresponding data from "CallSomethingThatEventuallyDoesLinqToSql". The only way that records should exist in the tables I'm looking at is from SprocTheFirst, and its only ever called in this one function, so if its called and succeeded then I would expect CallSomethingThatEventuallyDoesLinqToSql would get called and succeed because its all in the same TransactionScope. Its theoretically possible that some other dev mucked around in the DB, but I don't think they have. We also log all exceptions, and I can find nothing unusual happening around the time that the records from SprocTheFirst were created.
So, is it possible that a transaction, or more properly a declarative TransactionScope, with Snapshot isolation level can fail somehow and only partially commit?
We have spotted the same issue. I have recreated it here - https://github.com/DavidBetteridge/MSMQStressTest
For us we see the issue when reading from the queue rather than writing to it. Our solution was to change the isolation level of the first read in the subscriber to be serialised.
no, but snapshot isolation level isn't the same as serializable.
snapshoted rows are stored in the tempdb until the row commits.
so some other transaction can read the old data just fine.
at least that's how i understood your problem. if not please provide more info like a grapf of the timeline or something similar.
Can you verify that CallSomethingThatEventuallyDoesLinqToSQL is using the same Connection as the first call? Does the second call read data that the first filed into the db... and if it is unable to "see" that data would cause the second to skip a few steps and not do it's job?
Just because you have it wrapped in a .NET transaction doesn't mean the data as seen in the db is the same between connections. You could for instance have connections to two different databases and want to rollback both if one failed, or file data to a DB and post a message to MSMQ... if MSMQ operation failed it would roll back the DB operation too. .NET transaction would take care of this multi-technology feature for you.
I do remember a problem in early versions of ADO.NET (maybe 3.0) where the pooled connection code would allocate a new db connection rather than use the current one when a .NET level TransactionScope was used. I believe it was fully implemented with 3.5 (I may have my versions wrong.. might be 3.5 and 3.5.1). It could also be caused by the MyDataLayer and how it allocates connections.
Use SQL Profiler to trace these operations and make sure the work is being done on the same spid.
It sounds like your connection may not be enlisted in the transaction. When do you create your connectiion object? If it occurs before the TransactionScope then it will not be enlisted in the transaction.
I'm writing an application in C# which accesses a SQL Server 2005 database. The application is quite database intensive, and even if I try to optimize all access, set up proper indexes and so on I expect that I will get deadlocks sooner or later. I know why database deadlocks occur, but I doubt I'll be able to release the software without deadlocks occuring at some time. The application is using Entity Framework for database access.
Are there any good pattern for handling SQLExceptions (deadlocked) in the C# client code - for example to re-run the statement batch after x milliseconds?
To clarify; I'm not looking for a method on how to avoid deadlocks in the first place (isolation levels, indexes, order of statements etc) but rather how to handle them when they actually occur.
I posted a code sample to handle exactly this a while back, but SO seemed to lose my account in the interim so I can't find it now I'm afraid and don't have the code I used here.
Short answer - wrap the thing in a try..catch. If you catch an error which looks like a deadlock, sleep for a short random time and increment a retry the counter. If you get another error or the retry counter clears your threshold, throw the error back up to the calling routine.
(And if you can, try to bung this in a general routine and run most/all of your DB access through it so you're handling deadlocks program-wide.)
EDIT: Ah, teach me not to use Google! The previous code sample I and others gave is at How to get efficient Sql Server deadlock handling in C# with ADO?
Here is the approach we took in the last application framework I worked on. When we detected a deadlock, we simply reran the transaction. We did this up to 5 times. If after 5 times it failed, we would throw an exception. I don't recall a time that the second attempt ever failed. We would know because we were logging all activity in the backend code. So we knew any time a deadlock occurred, and we knew if it failed more than 5 times. This approach worked well for us.
Randy