I have a fairly busy site which does around 10m views a month.
One of my app pools seemed to jam up for a few hours and I'm looking for some ideas on how to troubleshoot it..? I suspect that it somehow ran out of threads but I'm not sure how to determine this retroactively..? Here's what I know:
The site never went 'down', but around 90% of requests started timing out.
I can see a high number of "HttpException - Request timed out." in the log during the outage
I can't find any SQL errors or code errors that would have caused the timeouts.
The timeouts seem to have been site wide on all pages.
There was one page with a bug on it which would have caused errors on that specific page.
The site had to be restarted.
The site is ASP.NET C# 3.5 WebForms..
Possibilities:
Thread depletion: My thought is that the page causing the error may have somehow started jamming up the available threads?
Global code error: Another possibility is that one of my static classes has an undiscovered bug in it somewhere. This is unlikely as the this has never happened before, and I can't find any log errors for these classes, but it is a possibility.
UPDATE
I've managed to trace the issue now while it's occurring. The pages are being loaded normally but for some reason WebResource.axd and ScriptResource.axd are both taking a minute to load. In the performance counters I can see ASP.NET Requests Queued spikes at this point.
The first thing I'd try is Sam Saffron's CPU analyzer tool, which should give an indication if there is something common that is happening too much / too long. In part because it doesn't involve any changes; just run it at the server.
After that, there are various other debugging tools available; we've found that some very ghetto approaches can be insanely effective at seeing where time is spent (of course, it'll only work on the 10% of successful results).
You can of course just open the server profiling tools and drag in various .NET / IIS counters, which may help you spot some things.
Between these three options, you should be covered for:
code dropping into a black hole and never coming out (typically threading related)
code running, but too slowly (typically data access related)
Related
Apologies, this is not a short question:
Background
I have a B1 Azure Website, and for the life of me, cannot get exceptions with callstacks.
The WebAPI is hosted side-by-side with the website in the same solution, which I hear is unusual. Almost all configuration has been done through the solution, I believe. Most everything in the portal is probably default settings from a brand new site.
I will be the first to admit, I am a novice at Azure. I have previously hosted some exceedingly simple ASP websites (mostly pre-.NET) in the past. I have found the Azure Portal to be overwhelming, to say the least. Hence why I am here!
The main place I look for exceptions is in Application Insights, under Failures, Exceptions tab, however. While it usually (not always...) show that there were 500s, the vast majority of the time, it will show no callstack.
Situations
The few times it does catch a callstack, it's your normal bots poking at random directories... not the crippling exception I need to debug immediately. I recall hearing that Azure will use "AI to determine which callstacks to keep" or something market-y like that, but I can't find any settings regarding it. Even if that market-speak is true, why is it recording callstacks to daily bot attempts, but the rare application-crippling exception?
A month or so ago, I attempted to debug the live website via Visual Studio, but I get an error saying that Internet Explorer could not be found. Given that it's the year 2018 and Microsoft has moved onto Edge, I don't know why it wants Internet Explorer at all. I did find a response to this, saying to hack the registry and reinstall Internet Explorer, but that seemed overkill at the time.
Viewing Azure errors through Visual Studio's embedded Azure portal thing seems to show very similar data as the Azure portal does. No callstacks to be found.
Many years ago, a classic alert was set up for Http Server Errors, which still triggers to this day. It does not trigger on HttpExceptions from bots poking at the site, but it does for for important 500s, and that's good. What is interesting is that it is the most reliable way to hear about errors, besides user reports. Too bad they don't have callstacks...
Last night, we encountered an exception, presumably in the view, of a page. We got e-mails from the classic alert, as expected, but the Failures section does not show any failures at all. In the past, we'd see 500s, but no callstack. It would seem that last night's errors were not detected by anything but the classic alert and the user. I don't know if it is because last night's error was unique, or if we now mysteriously get even less information out of Azure.
Attempted Solutions
Over the years, I have followed a myriad of guides, ranging from flipping switches in the portal itself, to FTPing and looking at the raw logs (which apparently are not really about your application, as much as Microsoft hosting it). If I got a penny for every time I read a guide that said, "Simply click on the Exceptions tab to see your callstacks" I'd be rich :-P.
A month ago, I got so desperate I implemented Application_Error in the HttpApplication class for the application, and implemented ExceptionLogger for WebAPI, to manually log all exceptions to text files. Unfortunately, while this helped me fix one error, subsequent exceptions have not appeared there either. Just like Application Insights, mostly bots poking at non-existent directories show up in these logs.
A week ago, I got desperate enough that I wrote a janky "unit test" (ha!), that'd pull a copy of production data down and test it locally, which is absolutely bonkers.
I have spoken to other architect-level ASP.NET engineers that use Azure portal to varying frequencies, and they could not come up with any suggestions. We looked at the web.configs; there is one in the root and in the Views folder. We played with turning on customerrors, but obviously we can't have that running in production because it'd display the errors to the user. That being said, I wouldn't mind having real error messages appear to certain users. How would one accomplish that? If I were to guess, the issue is hidden in those web.configs, simply because they're ancient and so many hands have touched them.
Conclusion
I need a 100% bullet-proof way to get exceptions and their callstacks from ASP.NET hosted on Azure. Otherwise, it's nearly impossible to solve edge cases that appear unexpectedly in production. I don't recall this being a problem in my days before Azure.
I am certain an expert out there will have this solved in mere minutes, but, for now, I'm completely stumped. Thank you for your time!
A couple of things to try and check for:
Make sure that your Application Insights NuGet packages are up to date. I've had metrics quit working over the last couple of years, or new metrics show up on the AppInsights blade that I wasn't collecting. Upgrading to the latest NuGet packages did the trick.
Are you catching exceptions within your web app and then returning a HTTP 500 response explicitly? If so, you won't see a stack trace. Stack traces are captured after bubbling all the way up through your controller method unhandled.
I am developing a web API in ASP.NET that does some image processing. Essentially the user application will make get requests with a few arguments i.e an image, quality, text to draw on it, size, etc.
My concern is that I do not know exactly how fast these requests are going to come in. If I spam refresh on a get request for long enough, I see the memory starting to slowly increase until it hits 1G and then finally throws an OutOfMemory exception. Strangely enough, sometimes before hitting the OOM, I get a ArgumentException (even though I am using a valid request that works otherwise).
My questions are broad and as follows:
1) Is there a good tool to test this sort of mass request? I'd like to be able to spam my server so I can consistently analyze and troubleshoot any problems that arise. I havn't found anything and have just been clicking / pressing enter on the browser manually..
2) Is there a tool you'd recommend to analyze what specific processes in my program are causing this memory issue? If the Diagnostic Tools in VS are good enough, can you offer some guidance as to what I should be looking for? I.e investigating the call stack, memory profiling, etc..
3) Perhaps none of the above questions are even necessary if this one can be answered: Can these sort of requests be prevented? Maybe my API can ensure that they are only processed at a speed that can be handled (at the expense of user image load time).. I know that catching the exceptions alone isn't going to be enough, so is there something that ASP.NET provides for this sort of mass request prevention?
Thanks for taking the time to read, any answers are appreciated.
[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.
This is a very general question, so I won't be providing any code as my project is fairly large.
I have an ASP.NET project, which I've been maintaining and adding to you for a year now. There's about 30 pages in total, each mostly having a couple of gridview's and SqlDataSource's, and usually not more than 10-15 methods in the codebehind. There is also a fairly hefty LINQ-to-SQL dbml file - with around 40-50 tables.
The application takes about 30-40 seconds to compile, which I suppose isn't too bad - but the main issue is that when deployed, it's slow at loading pages compared to other applications on the same server and app pool - it can take around 10 seconds to load a simple page. I'm very confident the issue isn't isolated to any specific page - it seems more of a global application issue.
I'm just wondering if there are any settings in the web.config etc I can use to help speed things up? Or just general tips on common 'mistakes' or issues developers encounter than can cause this. My application is close to completion, and the speed issues are really tainting the customer's view of it.
As the first step find out source of the problem, either application or database side.
Application side:
Start by enabling trace for slow pages and see size of ViewState, sometimes large ViewState cause slow page load.
Database side:
Use Sql Profiler to see what exactly requires a lot of time to get done
Useful links:
How to: Enable Tracing for an ASP.NET Application
Improve ASP.NET Performance By Disabling ViewState And Setting Session As ReadOnly
How to Identify Slow Running Queries with SQL Profiler
Most common oversight probably: don't forget to turn off debugging in your web.config before deploying.
<compilation debug="false" targetFramework="4.0">
A few others:
Don't enable session state or viewstate where you don't use it
Use output caching where possible, consider a caching layer in general i.e. for database queries (memcached, redis, etc.)
Minify and combine CSS
Minify javascript
What to do now:
Look at page load in Firebug or Chrome developer tools. Check to make sure you're not sending a huge payload over the wire.
Turn on trace output to see how the server is spending its time.
Check network throughput to your server.
How to avoid this in the future:
Remember that speed is a feature. If your app is slow as a dog, customers can't help but think it sucks. This means you want to run your app on "production" hardware as soon as you can and deploy regularly so that you catch performance problems as they're introduced. It's no fun to have an almost-done app that takes 10 seconds to deliver a page. Hopefully, you get lucky and can fix most of this with config. If you're unlucky, you might have some serious refactoring to do.
For example, if you've used ViewState pretending it was magic and free, you may have to rework some of that dependency.
Keep perf on a short leash. Your app will be snappy, and people will
think you are awesome.
I've recently deployed a MVC application to an IIS6 web server. One strange behaviour I've been having is the load times will randomly blow up to 30sec+ and then return to normal. Our tests have shown this occurring on multiple connections at the same time. Once the wait has passed, the site become responsive again. It's completely random when this will occur, but will probably happen about once every 15 minutes or so.
My first thought was the application was being restarted by the web server for some reason, but I determined this wasn't the case because the process recycling is set very infrequently, and I placed some logging in the application startup.
It's also nothing to do with the database connection. This slowdown happens simply by moving between static pages too. I've watched the database with a SQL profiler, and nothing is hitting it when these slowdowns occur.
Finally, I've placed entry and exit logging on my controller actions, the slowdown always happens outside of the controller. The entry and exit time for a controller action is always appropriately fast.
Does anyone have any ideas of what could be causing this? I've tried running it locally on IIS7 and I haven't had the issue. I can only think it's something to do with our hosting provider.
Is this running on a deadicated server? if not it might be your hosting providor.
It sounds to me from what you have said that the server every 15 mins is maxing its CPU for some reason. It could be something in code hitting a infinate loop, have you had a look in the event log for any crashes / error from the application.
Run the web app under a profiler (eg JetBrains) and dump out the results after one of these 30 seconds lockups occur. The profiler output should make locating the bottleneck fairly obvious as it will pinpoint the exact API call which is consuming the time/blocking other threads.
At a guess it could be memory pressure causing items being dumped from cache or garbage collection, although 30 seconds sounds a little excessive for this.