I have what is essentially a flashcard web app, hosted on the free version of Azure and coded in ASP.NET (C#). It is used by a small amount of people (40 or so). As you can see in the graph below, the CPU Time was steady for a while and then steadily started increasing around April 1. The problem is that I am now reaching Azure's 60-minute CPU Time per day limit, causing my app to shut down when it reaches that quota.
I am unaware of ANY changes, either in the code or in the websites configuration, that happened at any time period seen on this chart.
Quick note: The large spikes are expected, and I don't believe they're related to the issue. Long story short, it was the day of a competition where the app was used significantly more than usual. This happens every couple weeks during each competition. I don't believe it's related because it has NEVER been followed by a steady increase shortly after. So the spike is normal; the gradual increase is not.
I have restarted the web service many times. I have redeployed the code. I have turned off many features in the C# code that might increase the CPU Time. I checked the website's request count, and it is actually LOWER after that first spike than before it. Even when there are periods of no requests (or something small like <5 requests per hour), the CPU Time is still high. So this has nothing to do with request count or something like undisposed threads (which would get cleared upon webservice restart anyways).
One last thing: I have also deployed this EXACT same code to another Azure website, which I have used for years as the test website. The test website does NOT have this issue. The test website connects to the same data and everything. The only difference is that it's not what other users use, so the request count is much lower, and it does not gradually increase. This leads me to believe it is not an issue in my C#/ASP.NET code.
My theory is that there is some configuration in Azure that is causing this, but I don't know what. I didn't change anything around the time the CPU Time started increasing, but I don't see what else it could be. Any ideas would be greatly appreciated, as I've been wracking my brain for weeks on this, and it's causing my production site to go down for hours every day.
EDIT: Also, the CPU Usage is NOT high at this time. So while the CPU is supposedly running at long periods of time, it never approaches 100% CPU at any given moment. So this is also NOT an issue of high CPU usage.
Related
[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.
I'm crossing posting this from the 51degrees forums as it hasn't gotten much traction there.
I went ahead and implemented the latest NuGet package version of 51Degrees into a site we manage here at work. (2.19.1.4) We are attempting to bring in house the management of the mobile views for this site (it's currently done by a third party). So the only functionality we are interested in is the detection. We disabled the redirect functionality by commenting out the redirect element in the config and we modified the logging level to Fatal (the log is in the App_Data folder).
To our understanding, those were the only changes needed. And this worked. We could switch our layout view between desktop and mobile based on the information 51degrees was providing.
While testing and promoting through DEV and QA we noted increased memory consumption in the app pool, but nothing that we were overly worried about. The app pool at standard traffic levels consumes roughly 230 MB of memory in PROD. It will spike to 300 MB during peak times, so nothing too worrisome, especially considering we do a fair amount of InProc caching.
As of Sunday we promoted 51degreees lite into PROD, but disabled the mobile views (we did this in QA as well). We wanted to see how it would perform in PROD and what kind of impact it would have on the server in a live environment. Just to reiterate, QA revealed increased memory use, but we could not replicate PROD loads and variances.
PROD revealed some concerns. Memory consumption on the app pool of one of the two frontends grew slowly throughout the day up to a peak at days end of 560MB on the app pool at 11 PM. The other peaked at 490MB.
We confirmed the problem was isolated to 51degrees by removing it from the site, recycling, and monitoring for another day. App pool memory never exceeded 300MB.
We also ran the app pool through SciTech's memory profiler to confirm. The results showed 51Degrees consuming majority of the additional memory above the expected. (We can run these tests again in a QA environment if wanted. The numbers will be lower, but they will paint a picture).
So the questions:
1) What would account for this large memory consumption? While a 500-600MB app pool isn't the end of the world, having our mobile detection solution more than double our app pool size is worrisome. (While our site isn't the heaviest traffic site, it does receive a fairly decent number of requests)
2) Are there any settings that we can apply to prevent or reduce the memory consumption? Ideally, we'd like to limit the memory consumption of 51 degrees to just the memory needed to load the product and monitor incoming requests.
Thanks for any feedback.
We have been dealing with a troublesome issue with WCF for quite some time, and it has gotten to a point that we are desperate to find a fix.
I have a WCF Service that is hit frequently, approximately 50 requests /second sometimes. Average execution times are 10-20 seconds, but may go up to 45 seconds.
The problem we are having is randomly (we have not been able to recreate for 2 months since it started), when viewing the IIS requests in the worker process, they will just continue to add up and will not complete, and will infinitely increase in time until we have to recycle the app pool.
We have done performance metrics on the DB, and on the code using CPU and memory analyzers and have confirmed that there are no bottle necks in our code that stand out.
I have service throttlign config values set to the following:
<serviceThrottling maxConcurrentInstances="800"
maxConcurrentCalls="800"
maxConcurrentSessions="800"/>
I am enabling WCF Tracing right now to see if anything turns up, but its hard because we are unable to recreate.
I've been using Signal R on a project for the last couple of weeks and its been performing great, I even did a stress test with Crank yesterday and got 1000 users with no real delay.
I need to move on to the next stage of testing today so I decided to move it to IIS 7.5
After moving it over and doing a quick touch test I decided to do another stress test - this time I only got to 10 users and the website was pretty much dead..
does anyone know why this would happen? I've followed all the information on the Signal R performance tuning and its made zero difference..
Can anyone help?
In some cases the Maximum concurrent requests can be maxed out at ~10 (old default). This was changed in later .net releases to default to 5000. Judging on what's happening on your machine I'd assume that your default is still (somehow) ~10.
I know you said you looked over the SignalR performance tuning piece, but make sure your configurations are properly setup for the Maximum Concurrent Requests Per CPU section at https://github.com/SignalR/SignalR/wiki/Performance. It makes sense to ignore the section thinking that 5k concurrent requests is enough, but in earlier releases the value was defaulted to be very low.
You can also check out: http://blogs.msdn.com/b/tmarq/archive/2007/07/21/asp-net-thread-usage-on-iis-7-0-and-6-0.aspx for more info regarding IIS concurrent request usages, particularly the 7th paragraph.
I have a fairly busy site which does around 10m views a month.
One of my app pools seemed to jam up for a few hours and I'm looking for some ideas on how to troubleshoot it..? I suspect that it somehow ran out of threads but I'm not sure how to determine this retroactively..? Here's what I know:
The site never went 'down', but around 90% of requests started timing out.
I can see a high number of "HttpException - Request timed out." in the log during the outage
I can't find any SQL errors or code errors that would have caused the timeouts.
The timeouts seem to have been site wide on all pages.
There was one page with a bug on it which would have caused errors on that specific page.
The site had to be restarted.
The site is ASP.NET C# 3.5 WebForms..
Possibilities:
Thread depletion: My thought is that the page causing the error may have somehow started jamming up the available threads?
Global code error: Another possibility is that one of my static classes has an undiscovered bug in it somewhere. This is unlikely as the this has never happened before, and I can't find any log errors for these classes, but it is a possibility.
UPDATE
I've managed to trace the issue now while it's occurring. The pages are being loaded normally but for some reason WebResource.axd and ScriptResource.axd are both taking a minute to load. In the performance counters I can see ASP.NET Requests Queued spikes at this point.
The first thing I'd try is Sam Saffron's CPU analyzer tool, which should give an indication if there is something common that is happening too much / too long. In part because it doesn't involve any changes; just run it at the server.
After that, there are various other debugging tools available; we've found that some very ghetto approaches can be insanely effective at seeing where time is spent (of course, it'll only work on the 10% of successful results).
You can of course just open the server profiling tools and drag in various .NET / IIS counters, which may help you spot some things.
Between these three options, you should be covered for:
code dropping into a black hole and never coming out (typically threading related)
code running, but too slowly (typically data access related)