SQL Server log file grew 40GB with Hangfire - c#

I have developed an Hangfire application using MVC running in IIS, and it is working absolutely fine, till I saw the size of my SQL Server log file, which grew whopping 40 GB overnight!!
As per information from our DBA, there was an long running transaction, with the following SQL statement (I have 2 hangfire queues in place)-
(#queues1 nvarchar(4000),#queues2 nvarchar(4000),#timeout float)
delete top (1) from [HangFire].JobQueue with (readpast, updlock, rowlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))
and Queue in (#queues1,#queues2)
On exploring the Hangfire library, I found that it is used for dequeuing the jobs, and doing a very simple task that should not take any significant time.
I couldn't found anything that would have caused this error. transactions are used correctly with using statements and object are Disposed in event of exception.
As suggested in some posts, I have checked the recovery mode of my database and verified that it is simple.
I have manually killed the hanged transaction to reclaim the log file space, but it come up again after few hours. I am observing it continuously.
What could be the reason for such behavior? and how it can be prevented?
The issue seems to be intermittent, and it could be of extremely high risk to be deployed on production :(

Starting from Hangfire 1.5.0, Hangfire.SqlServer implementation wraps the whole processing of a background job with a transaction. Previous implementation used invisibility timeout to provide at least once processing guarantee without requiring a transaction, in case of an unexpected process shutdown.
I've implemented a new model for queue processing, because there were a lot of confusion for new users, especially ones who just installed Hangfire and played with it under a debugging session. There were a lot of questions like "Why my job is still under processing state?". I've considered there may be problems with transaction log growth, but I didn't know this may happen even with Simple Recovery Model (please see this answer to learn why).
It looks like there should be a switch, what queue model to use, based on transactions (by default) or based on invisibility timeout. But this feature will be available in 1.6 only and I don't know any ETAs yet.
Currently, you can use Hangfire.SqlServer.MSMQ or any other non-RDBMS queue implementations (please see the Extensions page). Separate database for Hangfire may also help, especially if your application changes a lot of data.

Related

StackExchange.Redis - Unexplainable time-out exception issue

We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache.
The exception thrown is
An unhandled exception has occurred while executing the
request.","#l":"Error","#x":"StackExchange.Redis.RedisTimeoutException:
Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms
elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu:
0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0,
serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc:
1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a
look at this article for some common client-side issues that can cause
timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were
Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))
What is the setup you ask?
.NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don't see any extremes on the monitoring regarding cpu or memory)
Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
pushing request to an Azure Service Bus at the end (no problems there)
A Batch process is running and processing a couple of 10000's of API calls (several API's) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api's are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api's behavior)
All api's and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can't be an implementation issue in that 1 api, all shared code.
How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms)
We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is
services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
.AddStackExchangeRedisCache(s =>
{
s.Configuration = connectionString;
})
.AddTransient<IRedisCache, RedisCache>()
.AddTransient<ISharedCacheStore, SharedCacheStore>();
We are out of ideas to be honest. We don't see an issue with the Redis Cache instance in Azure as this one is not even near it's top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn't even reach 10% on the current plan.
According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.
UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months.
This also is applicable for other api's connecting to Redis Cache and NOT giving an issue
Comparison
Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
This one is hitting the time-out mark nog even reaching 10K hits per minute
There's a couple of possible things here:
I don't know what that EVAL is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look at SLOWLOG, but I don't know whether this is exposed on Azure redis
It could be that your payloads are saturating the available bandwidth - I don't know what you are transferring
It could simply be a network/socket stall/break; they happen, especially with cloud - and the (relatively) high latency makes this especially painful
We want to enable a new optional pooled (rather than multiplexed) model; this would in theory (the proof-of-concept worked great) avoid large backlogs, which means even if a socket fails: only that one call is impacted, rather than causing a cascade of failure; the limiting factor on this one is our time (and also, this needs to be balanced with any licensing implications from the redis provider; is there an upper bound on concurrent connections, for example)
It could simply be a bug in the library code; if so, we're not seeing it here, but we don't use the same setup as you; we do what we can, but it is very hard to diagnose problems that we don't see, that only arise in someone else's at-cost setup that we can't readily replicate; plus ultimately: this isn't our day job :(
I don't think there's a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don't pay for our time.
So, we found the issue on this.
The issue sits within the registration of our classes which is AddTransient as shown in the original code above.
When altering this to AddScoped, the performance is a lot faster. Even wondering if it can be a singleton.
Weird thing is that the addtransient should increase 'connected clients', which it does as a matter of fact, but has a bigger impact on the number of requests that can be handled as well. Since we never reached the max connections limit during processing.
.AddScoped<IRedisCache, RedisCache>()
.AddScoped<ISharedCacheStore, SharedCacheStore>();
With this code instead of AddTransient, we did 220 000 operations on a 4-5 minute period without an issues, whereas with the old code, we didn't even reach 40 000 operations, because of time out exceptions

Parallel execution of CREATE DATABASE statements result to an error but not on separate SQL Server instance

I am using the latest version of Entity Framework on my application (but I don't think EF is the issue here, just stating what ORM we are using) and have this multi-tenant architecture. I was doing some stress tests, built in C#, wherein it creates X-number of tasks that runs in parallel to do some stuff. At some point at the beginning of the whole process, it will create a new database for each task (each tenant in this case) and then continues to process the bulk of the operation. But on some tasks, it throws 2 SQL Exceptions on that exact part of my code where it tries to create a new database.
Exception #1:
Could not obtain exclusive lock on database 'model'. Retry the
operation later. CREATE DATABASE failed. Some file names listed could
not be created. Check related errors.
Exception #2:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding.
It's either of those two and throws on the same line of my code (when EF creates the database). Apparently in SQL Server, when creating a database it does it one at a time and locks the 'model' database (see here) thus some tasks that are waiting throws a timeout or that lock on 'model' error.
Those tests were done on our development SQL Server 2014 instance (12.0.4213) and if I execute, say, 100 parallel tasks there will bound to be an error thrown on some tasks or sometimes even nearly half the tasks I executed.
BUT here's the most disturbing part in all these, when testing it on my other SQL server instance (12.0.2000), which I have installed locally on my PC, no such error throws and completely finishes all the tasks I executed (even 1000 tasks in parallel!).
Solutions I've tried so far but didn't work:
Changed the timeout of the Object context in EF to infinite
Tried adding a longer or infinite timeout on the connection string
Tried adding a Retry strategy on EF and made it longer and run more often
Currently, trying to install Virtual machine with a similar environment to our Dev server (uses Windows Server 2014 R2) and test on specific version of SQL Server to try to see if the versions have anything to do with it (yeah, I'm that desperate :))
Anyway, here is a simple C# console application you can download and try to replicate the issue. This test app will execute N-number of tasks you input and simply creates a database and does cleanup right afterwards.
2 observations:
Since the underlying issue has something to do with concurrency, and access to a "resource" which at a key point only allows a single, but not a concurrent, accessor, it's unsurprising that you might be getting differing results on two different machines when executing highly concurrent scenarios under load. Further, SQL Server Engine differences might be involved. All of this is just par for the course for trying to figure out and debug concurrency issues, especially with an engine involved that has its own very strong notions of concurrency.
Rather than going against the grain of the situation by trying to make something work or fully explain a situation, when things are empirically not working, why not change approach by designing for cleaner handling of the problem?
One option: acknowledge the reality of SQL Server's need to have a exclusive lock on model db by regulating access via some kind of concurrency synchronization mechanism--a System.Threading.Monitor sounds about right for what is happening here and it would allow you to control what happens when there is a timeout, with a timeout of your choosing. This will help prevent the kind of locked up type scenario that may be happening on the SQL Server end, which would be an explanation for the current "timeouts" symptom (although stress load might be the sole explanation).
Another option: See if you can design in such a way that you don't need to synchronize at all. Get to a point where you never request more than one database create simultaneously. Some kind of queue of the create requests--and the queue is guaranteed to be serviced by, say, only one thread--with requesting tasks doing async/await patterns on the result of the creates.
Either way, you are going to have situations where this slows down to a crawl under stress testing, with super stressed loads causing failure. The key questions are:
Can your design handle some multiple of the likely worst case load and still show acceptable performance?
If failure does occur, is your response to the failure "controlled" in a way that you have designed for.
Probably you have different LockTimeoutSeconds and QueryTimeoutSeconds set on the development and local instances for SSDT (DacFx Deploy), which is deploying the databases.
For example LockTimeoutSeconds is used to set lock_timeout. If you have a small number here, this is the reason for
Could not obtain exclusive lock on database 'model'. Retry the operation later. CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
You can use the query below to identify what timeout is set by SSDT
select session_id, lock_timeout, * from sys.dm_exec_sessions where login_name = 'username'
To increase the default timeout, find the identifier of the user, which is deploying the database here
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList
Then find the following registry key
HKEY_USERS\your user identifier\Microsoft\VisualStudio\your version\SQLDB\Database
and change the values for LockTimeoutSeconds and QueryTimeoutSeconds

Multi-server n-tier synchronized timing and performance metrics?

[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.

How to disable automatic retry for DevForce requests

We have noticed that when a DevForce request times out, it is automatically retried. This behavior has also been mentioned on the forums here. In that forum post, the suggested solution is to increase the timeout to try to avoid the problem altogether. For us, that is not really a possible solution. There are some operations that we know will timeout and increasing the timeout is not an acceptable solution.
Worse, if the call is a Stored Procedure query or an InvokeServerMethod call, it's very possible that the call is not idempotent so retrying it again is not safe and could very likely end up doing more harm than good. We've started running into cases like that in our app and it is causing major pains. A simple example would be: we call a stored procedure that creates a Copy of an item. If the copy takes too long, it will keep getting retried but that just means we have 3 copy operations all going in parallel. The end result is that the end user gets an error (because the 3rd rety still times out) but there (eventually) will be three copies of the item (the stored procedure will eventually finish - the retry logic doesn't seem to cancel the previous requests - and I'm not even sure such cancelling is possible). And that is one of the more benign examples - in other cases, the retried operations can cause even worse problems.
I see from the 6.1.6 release notes, that DevForce no longer performs automatic retry for Saves. I'd really like to see that behavior extended to StoredProcedureQueries and InvokeServerMethods. For normal EntityQuery operations (and probably even Connect/Disconnect calls), I'm fine with the rety. If this isn't something that can be changed in the core of DevForce, is there a way to make it configurable or provide some custom way for us to inject code that controls this?
The auto retry behavior for communication failures is configurable in the 7.2.4 release now available. See the release notes for usage information.

How to prevent NHibernate long-running process from locking up web site?

I have an NHibernate MVC application that is using ReadCommitted Isolation.
On the site, there is a certain process that the user could initiate, and depending on the input, may take several minutes. This is because the session is per request and is open that entire time.
But while that runs, no other user can access the site (they can try, but their request won't go through unless the long-running thing is finished)
What's more, I also have a need to have a console app that also performs this long running function while connecting to the same database. It is causing the same issue.
I'm not sure what part of my setup is wrong, any feedback would be appreciated.
NHibernate is set up with fluent configuration and StructureMap.
Isolation level is set as ReadCommitted.
The session factory lifecycle is HybridLifeCycle (which on the web should be Session per request, but on the win console app would be ThreadLocal)
It sounds like your requests are waiting on database locks. Your options are really:
Break the long running process into a series of smaller transactions.
Use ReadUncommitted isolation level most of the time (this is appropriate in a lot of use cases).
Judicious use of Snapshot isolation level (Assuming you're using MS-SQL 2005 or later).
(N.B. I'm assuming the long-running function does a lot of reads/writes and the requests being blocked are primarily doing reads.)
As has been suggested, breaking your process down into multiple smaller transactions will probably be the solution.
I would suggest looking at something like Rhino Service Bus or NServiceBus (my preference is Rhino Service Bus - I find it much simpler to work with personally). What that allows you to do is separate the functionality down into small chunks, but maintain the transactional nature. Essentially with a service bus, you send a message to initiate a piece of work, the piece of work will be enlisted in a distributed transaction along with receiving the message, so if something goes wrong, the message will not just disappear, leaving your system in a potentially inconsistent state.
Depending on what you need to do, you could send an initial message to start the processing, and then after each step, send a new message to initiate the next step. This can really help to break down the transactions into much smaller pieces of work (and simplify the code). The two service buses I mentioned (there is also Mass Transit), also have things like retries built in, and error handling, so that if something goes wrong, the message ends up in an error queue and you can investigate what went wrong, hopefully fix it, and reprocess the message, thus ensuring your system remains consistent.
Of course whether this is necessary depends on the requirements of your system :)
Another, but more complex solution would be:
You build a background robot application which runs on one of the machines
this background worker robot can be receive "worker jobs" (the one initiated by the user)
then, the robot processes the jobs step & step in the background
Pitfalls are:
- you have to programm this robot very stable
- you need to watch the robot somehow
Sure, this is involves more work - on the flip side you will have the option to integrate more job-types, enabling your system to process different things in the background.
I think the design of your application /SQL statements has a problem , unless you are facebook I dont think any process it should take all this time , it is better to review your design and check where is the bottleneck are, instead of trying to make this long running process continue .
also some times ORM is not good for every scenario , did you try to use SP ?

Categories

Resources