WCF Requests Hanging Infinitely in IIS until Recycled - c#

We have been dealing with a troublesome issue with WCF for quite some time, and it has gotten to a point that we are desperate to find a fix.
I have a WCF Service that is hit frequently, approximately 50 requests /second sometimes. Average execution times are 10-20 seconds, but may go up to 45 seconds.
The problem we are having is randomly (we have not been able to recreate for 2 months since it started), when viewing the IIS requests in the worker process, they will just continue to add up and will not complete, and will infinitely increase in time until we have to recycle the app pool.
We have done performance metrics on the DB, and on the code using CPU and memory analyzers and have confirmed that there are no bottle necks in our code that stand out.
I have service throttlign config values set to the following:
<serviceThrottling maxConcurrentInstances="800"
maxConcurrentCalls="800"
maxConcurrentSessions="800"/>
I am enabling WCF Tracing right now to see if anything turns up, but its hard because we are unable to recreate.

Related

Azure Web App Service has Steady CPU Time Increase

I have what is essentially a flashcard web app, hosted on the free version of Azure and coded in ASP.NET (C#). It is used by a small amount of people (40 or so). As you can see in the graph below, the CPU Time was steady for a while and then steadily started increasing around April 1. The problem is that I am now reaching Azure's 60-minute CPU Time per day limit, causing my app to shut down when it reaches that quota.
I am unaware of ANY changes, either in the code or in the websites configuration, that happened at any time period seen on this chart.
Quick note: The large spikes are expected, and I don't believe they're related to the issue. Long story short, it was the day of a competition where the app was used significantly more than usual. This happens every couple weeks during each competition. I don't believe it's related because it has NEVER been followed by a steady increase shortly after. So the spike is normal; the gradual increase is not.
I have restarted the web service many times. I have redeployed the code. I have turned off many features in the C# code that might increase the CPU Time. I checked the website's request count, and it is actually LOWER after that first spike than before it. Even when there are periods of no requests (or something small like <5 requests per hour), the CPU Time is still high. So this has nothing to do with request count or something like undisposed threads (which would get cleared upon webservice restart anyways).
One last thing: I have also deployed this EXACT same code to another Azure website, which I have used for years as the test website. The test website does NOT have this issue. The test website connects to the same data and everything. The only difference is that it's not what other users use, so the request count is much lower, and it does not gradually increase. This leads me to believe it is not an issue in my C#/ASP.NET code.
My theory is that there is some configuration in Azure that is causing this, but I don't know what. I didn't change anything around the time the CPU Time started increasing, but I don't see what else it could be. Any ideas would be greatly appreciated, as I've been wracking my brain for weeks on this, and it's causing my production site to go down for hours every day.
EDIT: Also, the CPU Usage is NOT high at this time. So while the CPU is supposedly running at long periods of time, it never approaches 100% CPU at any given moment. So this is also NOT an issue of high CPU usage.

WCF NetTcpBinding application creating many thousands of .NET threads

We have a WCF service with a NetTcpBinding homegrown application. It works fine most of the time but from time to time it starts creating dot net threads as fast as it can. The CLR dies quickly but it will keep creating threads until it runs out of memory or cpu...over 10,000 at times. Of course, the application was dead long before it gets there. We recycle the windows service it runs as and it goes back to normal until the next time.
We have Dynatrace and it catches the thread creation increase but the CLR crashes very early in the process and we can not get a thread dump of the problem. I will try to catch it earlier but so far, no luck.
What other tools could we leave running all the time without impacting the application, that would help us determine why this is happening?
It may be many days between these events, it might occur during the night when no one is monitoring it, so automated info collection would be useful. But we will look into anything.
Also, any idea what might be causing this in a general way?
Thanks for any help you could provide.

Multi-server n-tier synchronized timing and performance metrics?

[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.

IIS6, ASP.NET MVC 1 and random slowdowns

I've recently deployed a MVC application to an IIS6 web server. One strange behaviour I've been having is the load times will randomly blow up to 30sec+ and then return to normal. Our tests have shown this occurring on multiple connections at the same time. Once the wait has passed, the site become responsive again. It's completely random when this will occur, but will probably happen about once every 15 minutes or so.
My first thought was the application was being restarted by the web server for some reason, but I determined this wasn't the case because the process recycling is set very infrequently, and I placed some logging in the application startup.
It's also nothing to do with the database connection. This slowdown happens simply by moving between static pages too. I've watched the database with a SQL profiler, and nothing is hitting it when these slowdowns occur.
Finally, I've placed entry and exit logging on my controller actions, the slowdown always happens outside of the controller. The entry and exit time for a controller action is always appropriately fast.
Does anyone have any ideas of what could be causing this? I've tried running it locally on IIS7 and I haven't had the issue. I can only think it's something to do with our hosting provider.
Is this running on a deadicated server? if not it might be your hosting providor.
It sounds to me from what you have said that the server every 15 mins is maxing its CPU for some reason. It could be something in code hitting a infinate loop, have you had a look in the event log for any crashes / error from the application.
Run the web app under a profiler (eg JetBrains) and dump out the results after one of these 30 seconds lockups occur. The profiler output should make locating the bottleneck fairly obvious as it will pinpoint the exact API call which is consuming the time/blocking other threads.
At a guess it could be memory pressure causing items being dumped from cache or garbage collection, although 30 seconds sounds a little excessive for this.

BizTalk low latency - response time issue calling C# to WCF-Published-Orchestration

Calling a WCF published orchestration from a C# program usually is sub-second response time. However, on some occasions, it can take 20-50- seconds between the call in the C# program and the first trace message from the orchestration. The C# that runs calls the WCF runs under HIS/HIP (Host Integration Services/CICS Host-Initiated Processing).
Almost everytime I restart the HIS/HIP service, we have a very slow response time, and thus a timeout in CICS. I'm also afraid it might happen during the day if things "go cold" - in other words maybe things are being cached. Even JIT first-times compiles shouldn't take 20-50 seconds should they? The other thing that seem strange is that the slow response time seems to be the load of the orchestration, which is running under the BizTalk service, not the HIP/Service which I cycled.
The fear is that when we go live, the first user in the morning (or after a "cold-spell" will get the timeout). The second time they try it after the time-out, it is always fast.
I've done a few tests by restarting each of the following:
1) BizTalk services
2) IIS
3) HIS/HIP Transaction Integrator (HIP Service)
Restarting any one of them tends to cause about a 20 second delay.
Restarting all 3 is like the kiss of death - about a 60 second delay before first trace appears from orchestration.
The HIP program always gives its first trace quickly, even when the HIP service is restarted. Not sure why restarting HIP slows down the starting of the orchestration.
Thanks,
Neal Walters
I have seen this kind of behavior with the MQSeries adapter as well. After a period of inactivity the COM+ components which enable communication with MQSeries will shut down due to inactivity.
What we had was a 10 minute timer which would force some sort of a keep-alive message. I don't know if you have a non-destructive call which can be sent, or if you can build one into the system just for this purpose.
I have the same problem with a BizTalk flow that needs to work in 2 seconds, but when it was unused for some time the reload of the dll into cache generated a timeout.
We found a solution in MS's Orchestration Engine Configuration documentation, where they explain how to avoid unloading of the dlls:
Using the options SecondsIdleBeforeShutdown and SecondsEmptyBeforeShutdown from AppDomainSpecs and assigning to the desired dlls in the ExactAssignmentRules or PatternAssignmentRules sections, you can have your dlls permanently loaded, and maybe you can avoid the caller application.
Take into account that if you restart the BizTalk host, the dll will be loaded again.

Categories

Resources