I'm the developer in charge of a C# web application running on IIS 10.0 and I use the FluentScheduler library to schedule my jobs.
This job does a database query and then generates some files. Recently our jobs have been failing as they were running for too long and Windows kills the thread (this only happens on specific days with a large influx of data to be processed).
After doing some optimizations on the database access I got the killing of the thread down a bunch but it still occasionally happens.
The problem is that after having it's thread killed (and logging the exception), the job stops running on it's scheduled time.
How can I make sure the job keeps running even if this exception does happen?
My code below:
Schedule(new GenerateFiles()).NonReentrant().ToRunOnceAt(DateTime.Now.AddMinutes(10)).AndEvery(30).Seconds();
The 10-minute delay on the first run is there because we cache some information available on the database to improve the application's performance and this job uses data from that cache and I don't know how to make it only begin after the cache is done so I added this delay.
Any other exceptions caught and logged on my jobs does not cause this issue. It's only when the thread runs for too long and Windows kills it that it stops running again.
Edit: Adding the line in which the application fails (at least that's what the Stack Trace tells me.
The entire job is quite extensive and I can't really post it here.
foreach (var datumToGenerate in context.GenerateData.Include(f => f.Datum))
{
var datum = datumToGenerate.Datum;
if (!datum.Generated)
{
output.Add(datum);
i++;
if (i == 100) return output;
}
}
As you can see, I lowered the number of entries to be processed at a time to 100 but even setting it as 50 or a low number, I get the error eventually as the GenerateData table is quite large even though its entries get deleted after being processed.
Edit2: The code, in fact, fails on any random part of the class. It runs fine for around 10 minutes and then it just crashes. Am I simply screwed??
Me and my colleague found the answer to the issue.
It had nothing to do with code.
IIS's application pool has a default sleep timeout of 20 minutes. We disabled the application pool's timeout by setting its value to 0 and never again did that exception occur.
Related
I have developed an Hangfire application using MVC running in IIS, and it is working absolutely fine, till I saw the size of my SQL Server log file, which grew whopping 40 GB overnight!!
As per information from our DBA, there was an long running transaction, with the following SQL statement (I have 2 hangfire queues in place)-
(#queues1 nvarchar(4000),#queues2 nvarchar(4000),#timeout float)
delete top (1) from [HangFire].JobQueue with (readpast, updlock, rowlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))
and Queue in (#queues1,#queues2)
On exploring the Hangfire library, I found that it is used for dequeuing the jobs, and doing a very simple task that should not take any significant time.
I couldn't found anything that would have caused this error. transactions are used correctly with using statements and object are Disposed in event of exception.
As suggested in some posts, I have checked the recovery mode of my database and verified that it is simple.
I have manually killed the hanged transaction to reclaim the log file space, but it come up again after few hours. I am observing it continuously.
What could be the reason for such behavior? and how it can be prevented?
The issue seems to be intermittent, and it could be of extremely high risk to be deployed on production :(
Starting from Hangfire 1.5.0, Hangfire.SqlServer implementation wraps the whole processing of a background job with a transaction. Previous implementation used invisibility timeout to provide at least once processing guarantee without requiring a transaction, in case of an unexpected process shutdown.
I've implemented a new model for queue processing, because there were a lot of confusion for new users, especially ones who just installed Hangfire and played with it under a debugging session. There were a lot of questions like "Why my job is still under processing state?". I've considered there may be problems with transaction log growth, but I didn't know this may happen even with Simple Recovery Model (please see this answer to learn why).
It looks like there should be a switch, what queue model to use, based on transactions (by default) or based on invisibility timeout. But this feature will be available in 1.6 only and I don't know any ETAs yet.
Currently, you can use Hangfire.SqlServer.MSMQ or any other non-RDBMS queue implementations (please see the Extensions page). Separate database for Hangfire may also help, especially if your application changes a lot of data.
We have a WCF service with a NetTcpBinding homegrown application. It works fine most of the time but from time to time it starts creating dot net threads as fast as it can. The CLR dies quickly but it will keep creating threads until it runs out of memory or cpu...over 10,000 at times. Of course, the application was dead long before it gets there. We recycle the windows service it runs as and it goes back to normal until the next time.
We have Dynatrace and it catches the thread creation increase but the CLR crashes very early in the process and we can not get a thread dump of the problem. I will try to catch it earlier but so far, no luck.
What other tools could we leave running all the time without impacting the application, that would help us determine why this is happening?
It may be many days between these events, it might occur during the night when no one is monitoring it, so automated info collection would be useful. But we will look into anything.
Also, any idea what might be causing this in a general way?
Thanks for any help you could provide.
I have a website doing some things that I've never seen before. My server is Win 2003 w/ IIS6 I'm using C# and .Net 4.0.
The site is a real-estate website that stores the data directly in my db. The site will run great for a little while and then just die. What I mean is you'll try to view a property's details and it will take the site 2-3 minutes to load, if it loads at all. If I simply resave the web.config file and reupload it to restart the app, it runs just fine for a little while and then will die again. This continues over and over. I've gone to the local copy while the live site has "died" and the local copy will run just fine and then it will die after so long as well. The time frame that it takes varies from 5 minutes to 30 minutes, i believe it has something to do with the number of requests.
Anyone have any clue as to what might be happening? The only the data query on the page is to pull the main data which is the LINQ query below:
public Listing GetListingByMLNumber(string MLNumber)
{
try
{
DatabaseDataContext db = new DatabaseDataContext();
var item = (from a in db.Listings
where a.ML_.ToLower() == MLNumber.ToLower()
select a).FirstOrDefault();
return item;
}
catch (Exception ex)
{
Message = ex.Message;
return null;
}
}
Not closing the database context stands out as the obvious error in the code you provided. Wrap it in a using statement to be sure it gets disposed correctly.
As long as the context lives, you will hold on to a sql connection, which is a limited resource. You will also waste memory by change-tracking the entities you returned. Given your code the context should be garbage collected at some point, but it might still be the problem (And, whether or not this is the problem, you should dispose your database contexts).
Try load testing locally to see if you can reproduce the problem. If you can, then use the debugger to figure out the problem. If not, you probably need to add logging to narrow down the problem.
You could also look at the IIS process to see if it uses absurd amounts of memory, handles, etc. Also check IIS settings for performance and application pool recyling as suggested in another answer here.
I would take a look at the application pool settings to see how the worker processes are being recycled, and I would also look under the Performance tab in IIS to see if there's a bandwidth threshold specified.
If you ever hit this type of problem again then your should add DebugDiag/ADPlus and WinDBG to your diagnostic toolbelt.
When your application hangs again or is taking an exceedingly long time to respond to requests then grab a dump of the worker process using DebugDiag or ADPlus. Load this up into WinDBG, load up SOS (Son of Strike) which is a WinDBG extension for managed code debugging and start digging around.
Tess Ferrandez has a great set of tutorials and labs on how to use these tools effectively:
.NET Debugging Demos - Information and setup instructions
They've gotten me out of a few pickles several times and it's well worth spending the time familiarising yourself with them.
I've recently deployed a MVC application to an IIS6 web server. One strange behaviour I've been having is the load times will randomly blow up to 30sec+ and then return to normal. Our tests have shown this occurring on multiple connections at the same time. Once the wait has passed, the site become responsive again. It's completely random when this will occur, but will probably happen about once every 15 minutes or so.
My first thought was the application was being restarted by the web server for some reason, but I determined this wasn't the case because the process recycling is set very infrequently, and I placed some logging in the application startup.
It's also nothing to do with the database connection. This slowdown happens simply by moving between static pages too. I've watched the database with a SQL profiler, and nothing is hitting it when these slowdowns occur.
Finally, I've placed entry and exit logging on my controller actions, the slowdown always happens outside of the controller. The entry and exit time for a controller action is always appropriately fast.
Does anyone have any ideas of what could be causing this? I've tried running it locally on IIS7 and I haven't had the issue. I can only think it's something to do with our hosting provider.
Is this running on a deadicated server? if not it might be your hosting providor.
It sounds to me from what you have said that the server every 15 mins is maxing its CPU for some reason. It could be something in code hitting a infinate loop, have you had a look in the event log for any crashes / error from the application.
Run the web app under a profiler (eg JetBrains) and dump out the results after one of these 30 seconds lockups occur. The profiler output should make locating the bottleneck fairly obvious as it will pinpoint the exact API call which is consuming the time/blocking other threads.
At a guess it could be memory pressure causing items being dumped from cache or garbage collection, although 30 seconds sounds a little excessive for this.
Calling a WCF published orchestration from a C# program usually is sub-second response time. However, on some occasions, it can take 20-50- seconds between the call in the C# program and the first trace message from the orchestration. The C# that runs calls the WCF runs under HIS/HIP (Host Integration Services/CICS Host-Initiated Processing).
Almost everytime I restart the HIS/HIP service, we have a very slow response time, and thus a timeout in CICS. I'm also afraid it might happen during the day if things "go cold" - in other words maybe things are being cached. Even JIT first-times compiles shouldn't take 20-50 seconds should they? The other thing that seem strange is that the slow response time seems to be the load of the orchestration, which is running under the BizTalk service, not the HIP/Service which I cycled.
The fear is that when we go live, the first user in the morning (or after a "cold-spell" will get the timeout). The second time they try it after the time-out, it is always fast.
I've done a few tests by restarting each of the following:
1) BizTalk services
2) IIS
3) HIS/HIP Transaction Integrator (HIP Service)
Restarting any one of them tends to cause about a 20 second delay.
Restarting all 3 is like the kiss of death - about a 60 second delay before first trace appears from orchestration.
The HIP program always gives its first trace quickly, even when the HIP service is restarted. Not sure why restarting HIP slows down the starting of the orchestration.
Thanks,
Neal Walters
I have seen this kind of behavior with the MQSeries adapter as well. After a period of inactivity the COM+ components which enable communication with MQSeries will shut down due to inactivity.
What we had was a 10 minute timer which would force some sort of a keep-alive message. I don't know if you have a non-destructive call which can be sent, or if you can build one into the system just for this purpose.
I have the same problem with a BizTalk flow that needs to work in 2 seconds, but when it was unused for some time the reload of the dll into cache generated a timeout.
We found a solution in MS's Orchestration Engine Configuration documentation, where they explain how to avoid unloading of the dlls:
Using the options SecondsIdleBeforeShutdown and SecondsEmptyBeforeShutdown from AppDomainSpecs and assigning to the desired dlls in the ExactAssignmentRules or PatternAssignmentRules sections, you can have your dlls permanently loaded, and maybe you can avoid the caller application.
Take into account that if you restart the BizTalk host, the dll will be loaded again.