We have a new ASP.NET website running on a pair of load balanced Azure VMs. The website is fairly simple and uses Kentico CMS. Twice in the 24 hours since going live the application pool on both web servers has suddenly stopped (within 5-10 minutes of each other) causing 503: Service unavailable errors.
Looking at Windows system logs I see the error which caused the problem:
Application pool '[[NAME]]' is being automatically disabled due to a
series of failures in the process(es) serving that application pool.
Leading up to this are a series of warnings:
A process serving application pool '[[NAME]]' suffered a fatal
communication error with the Windows Process Activation Service. The
process id was '[[PROCESS ID]]'. The data field contains the error
number.
Evidently this is IIS's rapid-fail protection kicking in. What's not clear is how to find the cause of this "fatal communication error".
After some web searching I've installed the Debug Diagnostics Tool which has helped me identify that in every case the relevant process was the IIS worker process (w3wp.exe). This tool is new to me and unfortunately the only time the problem occurred since I installed it, no dumps were generated. However, its logs contain a lot of messages like this:
First chance exception - 0xe0434352 caused by thread with System ID:
[[ID]]
The frustrating thing is that I don't know what steps to take to replicate the error conditions. It never occurred in UAT in a very similar environment, even under load test. Here are some facts about my setup:
ASP.NET version = 4.5.2
Application pool running with identity set to a domain account with modify permission on the website directory
Application set with max one worker process
Any advice much appreciated.
* UPDATE 1 *
I now have DebugDiag dump generated by the "fatal communication error" warning event. Dump summary reads:
Dump Summary
------------
Process Name: w3wp.exe : C:\Windows\SysWOW64\inetsrv\w3wp.exe
Process Architecture: x86
Exception Code: 0xC00000FD
Exception Information: The thread used up its stack.
Heap Information: Present
In the end I tracked this down to a bug in my code. Under very edge-case circumstances the CMS was returning an empty Guid instead of an actual ID which was causing a stack overflow in a recursive method.
The 0xC00000FD exception code I posted above is actually a stack overflow exception, so once I knew that and downloaded the Debug Diagnostcs dump file I was able to replicate the crash scenario locally. That tool, by the way, is incredibly powerful and was able to demonstrate the exact conditions of the crash.
All I can say to people who arrive here with similar issue is - firstly, don't assume the issue is not with your code! And secondly, use Debug Diagnostcs.
First of all, what is your app pool regular recycle time interval setting & overlapping setting in IIS? - If these incidents occur when the recycling is scheduled and overlapping is disabled, this behavior is to be expected. Even when overlapping is enabled, I'd guess that it is somewhat connected to automatic recycling of app pool since both instances are impacted in cca the same time & it occurs twice a day and it can cause logging the warning you mentioned (Here you might find how to disable logging this warning in case it is caused by automatic recycling)
If that leads nowhere, you can find more details about the warning event here:
IIS Application Pool Availability
And about the Debug Diagnostcs tools here:
How to use the Debug Diagnostics tool to troubleshoot an IIS process that stops unexpectedly
Related
I have a problem I'm fighting for a week now. I have a WCF service running in IIS 8.5 on Windows Server 2012 R2 and a windows service client who is making one or two requests at each 30 seconds. At some point (usually withing two hours of the service running) one of the requests is causing the service app pool (separated from other app pools) process to gain CPU usage. In IIS worker process section can be seen that this request never ends and is hanging in ServiceModel-4 module in AuthenticateRequest state (i.e most likely it is in infinite loop somewhere). At some point another such request is added to the first one, until they become four, staying forever and causing 100 % CPU usage (there are 4 logical processors on the machine). What I did to investigate , fix this problem:
used wcf tracing and custom logging to determine where the problem is. Wcf tracing actually shows all the requests made to the server passed succesfully in milliseconds (!) (at the same time wcf tracing on the client side shows of course time out on the same requests). Custom logging also is showing that the service code is calling returtn of the requested operation. The result of the method are two simple dto objects, so no possible serialization issue and also there are no enpoint behaviors or wathever custom code which is execting before sending reply from the service (except the method code, which, as I mentioned returns successfully).
used iis failed request tracing which shows the request reaching the ServiceModel-4 without continuing with the following information:
ModuleName : ServiceModel-4.0
Notification: AUTHENTICATE_REQUEST
HttpStatus: 500
HttpReason: Internal Server Error
HttpSubStatus: 0
ErrorCode: The operation completed succesfully (0x0)
used Debug Diag for tracing requests continuing more than 10 minutes and saw the threads which are running long time. The stack trace is as follows:
or as follows:
I've seen these are called from iis process. Since thiese are .Net function I suspected first corrupted .Net installation, moreover there were both .Net4.5 and .Net4 installed on the server (which I don't know how exactly could happen). So:
I deinstalled .Net4 and From windows features on/off i turned off .Net4.5 features, restarted and after that i turned them on, restarted, without success
after that I by same way reinstalled the IIS (from Windows features). Again no success.
Does not have any more ideas.
it seems I have found the answer (but havent used Dot Trace or other tools). There was an access to a Generic Dictionary from multiple threads. This seems to be a known problem:
https://blogs.msdn.microsoft.com/tess/2009/12/21/high-cpu-in-net-app-using-a-static-generic-dictionary/
https://blogs.msdn.microsoft.com/asiatech/2009/05/11/asp-net-application-100-cpu-caused-by-system-collections-generic-dictionary/
Actually I noticed this problem in the beggining of the research but ruled it out, because i couldn't reproduce it (probably because I havent't testing the dictionary in iis app, of course I received various exceptions, but not a 100 % Cpu) and mainly because all logs showed that the code, accessing the dictionary has passed, also the stack trace above has nothing to do with the dictionary.
However I think that the problem happened during the serialization of this dictionary (which is data contract) which explains the logged information.
Still cannot explain how this exactly is happening. If anyone can explain it I think it will be a good knowledge for everyone.
Right so after upgrading to Sitecore 8.2 from 8.1 with split environment i.e CD and CMS. I'm seeing few performance issues, The CMS works fine but the number of threads is around 200 in local! whereas CD just freezes by just consuming all the memory just after starting the site, there no error shown in the log as well.
Any idea what might be wrong ?
Please check if you are not getting any error continuously in log files on CD server.
Check for redirects if you have and see if they are causing problems. We found this one as an issue in our instance in the past.
You can watch for Sitecore performance counter for more diagnosis.
For CPU high usage process, i think you need to stop some agent processes which considered one of the memory consuming process, try Stop this cleanup agent in sitecore.config file be set the interval to zero
<agent type="Sitecore.Tasks.CleanupAgent" method="Run" interval="06:00:00">
A possible reason may be the size of the Sitecore event queue.
Check the record count from all the Event queues especially the web database:
SELECT count(*) FROM [EventQueue]
If the count is high like 100K you need to clean up for better performance. Work best when there are at most a few thousand records.
See:
Publish Queue, History and Event Queue too big
sitecore-event-queue-how-to-clean-it-and-why
We had a similar issue where the CPU usage was spiking but could not find any error in sitecore log files.Also the web traffic was normal as observed from the IIS logs. This is how we resolved the issue.
In IIS user interface and in the application pool where the site is hosted,please check the currently running IIS worker Processes.Here we can see that each request is in different part of ASP.NET pipeline and currently executing HTTP modules.
Now please check whether any requests are getting stuck at any stage.If there are multiple requests coming from the same URL are getting stuck then it means that some module in that URL is getting hung up or going into an infinite loop.Now we can investigate the modules used in this URL and find the actual issue.
The ASP application running on the sql server is causing to stop the IIS server very frequently. The cause it shows in the Error log is:
"A significant part of sql server process memory has been paged out.This may result in a performance degradation."
Is there any tool which can identify the fault in the web application?
No. You might be able to play with some settings to get your apps to not crash but in the end, if you have reached your bandwidth cap, you are stuck.
There might not actually be any fault in the web application. Both IIS and SQL Server eat a lot of memory. Source, SQL Server eats ram for lunch
There might not be anything wrong, you might just be running too much on one machine. You will have to provide an actual error or problem. Because right now, our only answer can be to leverage the admin tools, and get more memory.
I have found the cause to my problem. For each Url redirection, I used the syntax Response.Redirect("/NewPage.aspx"); and this would continue the process even after creating the child process. The fix was: Response.Redirect("/NewPage.aspx", false); This would terminate the process right after calling a child process. That saved a lot of memory used by each process!
In our ASP.Net application we usually try to handle all our exceptions by catching them in relevant places to give the end user useful error messages, but some exceptions are impossible for us to catch due the place they are thrown.
This is an issue to our server setup since we want to keep the IIS Rapid Fail Protection working as intended, and all errors to be written to our custom error log. So to avoid unexpected resets of the server and flooding our error log, I have added some code in Global.asax.cs to suppress certain kinds of errors. At the moment we are looking at two kinds of HttpExceptions thrown by the IIS itself, to prevent too long URLs (based on the maxUrlLength setting), and to prevent faulty WebResource or ScriptResource requests. These are impossible for us to prevent due to some webcrawlers generating them.
What I'm interested in knowing, that is difficult for me to find info on anywhere is:
Can the referenced HttpExceptions even potentially cause the Rapid
Fail Protection to restart the server? I'm told that any uncaught
exception can cause it, but it seems illogical to me that this kind
of exception should be able to cause it.
If I call Server.ClearError() in the Application_Error() event, is that enough to suppress errors that could cause a rapid fail protection restart?
Or is it already too late at this point? Since we're already in the
process of responding to an unhandled exception.
The Rapid-Fail Protection (here, RFP) feature is meant to protect the system from application pools and worker processes that are not starting properly or are failing often. These issues could be caused by your application(s) or an IIS worker process. The official (albeit old) list of causes can be found here.
Not directly. If the logic that is attempting to handle the error fails, the worker process could crash. This would trigger RFP. Usually, this will not happen because IIS will try to handle an exception in Application_Error.
If your application has gracefully handled the exception in Application_Error, then it stops there. Your exception was "unhandled" at the application level, but IIS was able to handle it (usually serving the "yellow screen of death"). Therefore, the worker process is still healthy and RFP will not be triggered.
I have seen an IIS worker process crash under the following conditions:
Recursive call results in an infinite loop.
Insufficient system resources to process a request (out of memory or memory limit reached).
I am using .Net 2.0 and my site seems to reach the deadlock state at certain period. It stops working until I recycle the application pool or change something in web.config file. I think deadlock is causing this issue.
I am wondering if there is any tool to debug/check the site to find the code that could be causing the deadlock.
Right now I had to set recycling interval to 10 minutes which is really bad but it is the only way to solve the problem and there is a lot of codes on the site and I need to find the problem. If I use DOS attack tool, can I find the page/code block that is causing this issue? If I can, what is the best tool to test it?
Cheers!
EDIT
I tried to check the Event Logs and found the following warning. I don't know if it is issue will keep digging now.
Exception information:
Exception type: HttpException
Exception message: Request timed out.
Check the event log
Turn on Health Monitoring
If you use the 'Failed Request Tracing' and it'll produce a nice output which will then tell you what is causing the error, down to the module level. This will then give you the first step into where it's breaking down.
Have a read of this article on iis.net → Troubleshooting Failed Requests Using Tracing in IIS 7
I would attach visual studio to IIS and break the debugger when a deadlock occurs. You can then inspect the call stack of the running threads.
Code Project has a nice article on how to do IIS remote debugging.
Of course, you can very well set up up a test machine with a local IIS and local Visual Studio .NET and do this without the need to remotely debug.