We have a WCF service with a NetTcpBinding homegrown application. It works fine most of the time but from time to time it starts creating dot net threads as fast as it can. The CLR dies quickly but it will keep creating threads until it runs out of memory or cpu...over 10,000 at times. Of course, the application was dead long before it gets there. We recycle the windows service it runs as and it goes back to normal until the next time.
We have Dynatrace and it catches the thread creation increase but the CLR crashes very early in the process and we can not get a thread dump of the problem. I will try to catch it earlier but so far, no luck.
What other tools could we leave running all the time without impacting the application, that would help us determine why this is happening?
It may be many days between these events, it might occur during the night when no one is monitoring it, so automated info collection would be useful. But we will look into anything.
Also, any idea what might be causing this in a general way?
Thanks for any help you could provide.
Related
I understand that an application domain forms:
an isolation boundary for security,
versioning,
reliability,
and unloading of managed code,
but so does a process
Can someone please help me understand the practical benefits of an application domain?
I assumed app domain provides you a container to load one version of an assembly but recently I discovered that multiple versions of strong key assembly can be loaded in an app domain.
My concept of application domain is still not clear. And I am struggling to understand why this concept was implemented when the concept of process is present.
Thank you.
I can't tell if you are talking in general or specifically .NET's AppDomain.
I am going to assume .NET's AppDomain and why it can be really useful when you need that isolation inside of a single process.
For instance:
Say you are dealing with a library that had certain worker classes and you have no choice, but to use those workers and can't modify the code. It's your job to build a Windows Service that manages said workers and makes sure they all stay up and running and need to work in parallel.
Easy enough right? Well, you hoped. It turns out your worker library is prone to throwing exceptions, uses a static configuration, and is generally just a real PITA.
You could try to launch them in their own process, but monitor them, you'll need to implement namedpipes or try to thoughtfully parse the STDIN and STDOUT of the process.
What else can you do? Well AppDomain actually solves this. I can spawn an AppDomain for each worker, give them their own configuration, they can't screw each other up by changing static properties because they are isolated, and on top of that, if the library bombs out and I failed to catch the exception, it doesn't bother the workers in their domain. And during all of this, I can still communicate with those workers easily.
Sadly, I have had to do this before
EDIT: Started to write this as a comment response, but got too large
Individual processes can work great in many scenarios, however, there are just times where they can become a pain. I am not saying one should use an AppDomain over another process. I think it's uncommon you would need a separate process or AppDomain, but once you need it, you'll definitely know.
The main problem I see with processes in the scenario I've given above is that processes have their own downfalls that are easier to mitigate with the AppDomain.
A process can go rogue, become unresponsive, and crash or be killed at any point.
If you're managing processes, you need to keep track of the process ID and monitor the status of it. IPCs are great, but it does take time to get proper communication going back and forth as needed.
As an example let's say your process just dies. What do you do? Depending on the mechanism you chose to monitor, maybe the communication thread died, perhaps the work finished and you still show it as "processing". What do you do?
Now what happens when you have 20 processes and your management app dies. You don't have any real information, all you have is 20 "myprocess.exe" and maybe now have to start parsing the command line arguments they were started with to see which workers you actually have. Obviously with an AppDomain all 20 would have died too, but did you really gain anything with the process? You still have to code the ability to recover, however, now you have to also code all of the recovery for your processes instead of just firing the workers back up.
As with anything in programming, there's 1,000 different ways to achieve the same goal. It's up to you to decide which solution you feel is most appropriate.
Some practical benefits using app domain:
Multiple app domains can be run in a process. You can also stop individual app domain without stopping the entire process. This alone drastically increases the server scalability.
Managing app domain life cycle is done programmatically by runtime hosts (you can override it as well). For processes & threads, you have to explicitly manage their life cycle. Initialization, execution, termination, inter-process/multithread communication is complex and that's why it's easier to defer that to CLR management.
Source: https://learn.microsoft.com/en-us/dotnet/framework/app-domains/application-domains
[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.
I hate asking questions like this - they're so undefined... and undefinable, but here goes.
Background:
I've got a DLL that is the guts of an application that is a timed process. My timer receives a configuration for the interval at which it runs and a delegate that should be run when the interval elapses. I've got another DLL that contains the process that I inject.
I created two applications, one Windows Service and one Console Application. Each of the applications read their own configuration file and load the same libraries pushing the configured timer interval and delegate into my timed process class.
Problem:
Yesterday and for the last n weeks, everything was working fine in our production environment using the Windows Service. Today, the Windows Service will run for a period of around 20-30 minutes and hangs (with a timer interval of 30 secods), but the console application runs without issue and has for the past 4 hours. Detailed logging doesn't indicate any failure. It's as if the Windows Service just...dies quietly - without stopping.
Given that my Windows Service and Console Applications are doing the exact same thing, I can only think that there is something that is causing the Windows Service process to hang - but I have no idea what could be causing that. I've checked the configuration files, and they're both identical - I even copied and pasted the contents of one into the other just to be sure. No dice.
Can anyone make suggestions as to what might cause a Windows Service to hang, when a counterpart Console Application using the same base libraries doesn't; or can anyone point me in the direction of tools that would allow me to diagnose what could be causing this issue?
Thanks for everyone's help - still digging.
You need to figure out what changed on the production server. At first, the IT guys responsible will swear that nothing changed but you have to be persistent. i've seen this happen to often i've lost count. Software doesn't spoil. Period. The change must have been to the environment.
Difference in execution: You have two apps running the same code. The most likely difference (and culprit) is that the service is running with a different set of security credentials than your console app and might fall victim to security vagaries. Check on that first. Which Windows account is running the service? What is its role and scope? Is there any 3rd party security software running on the server and perhaps Killing errant apps? Do you have to register your service with a 3rd party security service? Is your .Net assembly properly signed? Are your .Net assemblies properly registered and configured on the server? Last but not least, don't forget that a debugger user, which you most likely are, gets away with a lot more stuff than many other account types.
Another thought: Since timing seems to be part of the issues, check the scheduled tasks on the machine. Perhaps there's a process that is set to go off every 30 minutes that is interfering with your own.
You can debug a Windows service by running it interactively within Visual Studio. This may help you to isolate the problem by setting (perhaps conditional) breakpoints.
Alternatively, you can use the Visual Studio "Attach to process" dialog window to find the service process and attach to it with the "Debug CLR" option enabled. Again this allows you to set breakpoints as needed.
Are you using any assertions? If an assertion fires without being re-directed to write to a log file, your service will hang. If the code throws an unhandled exception, perhaps because of a memory leak, then your service process will crash. If you set the Service Control Manager (SCM) to restart your process in the event of a crash, you should be able to see that the service has been restarted. As you have identical code running in both environments, these two situations don't seem likely. But remember that your service is being hosted by the SCM, which means a very different environment to the one in which your console app is running.
I often use a "heartbeat", where each active thread in the service sends a regular (say every 30 seconds) message to a local MSMQ. This enables manual or automated monitoring, and should give you some clues when these heartbeat messages stop arriving.
Annother possibility is some sort of permissions problem, because the service is probably running with a different local/domain user to the console.
After the hang, can you use the SCM to stop the service? If you can't, then there is probably some sort of thread deadlock problem. After the service appears to hang, you can go to a command-line and type sc queryex servicename. This should give you the current STATE of the service.
I would probably put in some file logging just to see how far the program is getting. It may give you a better idea of what is looping/hanging/deadlocked/crashing.
You can try these techniques
Logging start logging the flow of the code in the service. Have this parameter based so you dont have a deluge after you are done. You should log all function names, parameters, timestamps.
Attach Debugger Locally or Remotely attach a debugger with the code to the running service, set appropriate breakpoints (can be based on the data gathered from logging)
PerfMon Run this utility and gather information about the machine that the service is running on for any additional clues (high CPU spikes, IO spikes, excessive paging, etc)
Microsoft provides a good resource on debugging a Windows Service. That essentially sounds like what you'd have to do given that your question is so generic. With that said, has any changes been made to the system over the last few days that could aversely affect the service? Have you made any updates to the code that change the way the service might possibly work?
Again, I think you're going to have to do some serious debugging to find your problem.
What type of timer are you using in the windows service? I've seen numberous people on SO have problems with timers and windows services. Here is a good tutorial just to make sure you are setting it up correctly and using the right type of timer. Hope that helps.
Another potential problem in reference to psasik's answer is if your application is relying on something only available when being run in User Mode.
Running in service mode runs in (is it desktop0?) which can cause some issues in my experience if you are trying to determine states of something that can only be seen in user mode.
Smells like a threading issue to me. Is there any threading or async work being done at all? One crucial question is "does the service hang on the same line of code or same method every time?" Use your logging to find out the last thing that happens before a hang, and if so, post the problem code.
One other tool you may consider is a good profiler. If it is .NET code, I believe RedGate ANTS can monitor it and give you a good picture of any threadlock scenarios.
I've been experiencing a high degree of flicker and UI lag in a small application I've developed to test a component that I've written for one of our applications. Because the flicker and lag was taking place during idle time (when there should--seriously--be nothing going on), I decided to do some investigating. I noticed a few threads in the Threads window that I wasn't aware of (not entirely unexpected), but what caught my eye was one of the threads was set to Highest priority. This thread exists at the time Main() is called, even before any of my code executes. I've discovered that this thread appears to be present in every .NET application I write, even console applications.
Being the daring soul that I am, I decided to freeze the thread and see what happened. The flickering did indeed stop, but I experienced some oddness when it came to doing database interaction (I'm using SQL CE 3.5 SP1). My thought was that this might be the thread that the database is actually running on, but considering it's started at the time the application loads (before any references to the DB) and is present in other, non-database applications, I'm inclined to believe this isn't the case.
Because this thread (like a few others) shows up with no data in the Location column and no Call Stack listed if I switch to it in the debugger while paused, I tried matching the StartAddress property through GetCurrentProcess().Threads for the corresponding thread, but it falls outside all of the currently loaded modules address ranges.
Does anyone have any idea what this thread is, or how I might find out?
Edit
After doing some digging, it looks like the StartAddress is in kernel32.dll (based upon nearby memory contents). This leads me to think that this is just the standard system function used to start the thread, according to this page, which basically puts me back at square one as far as determining where this thread actually comes from. This is further confirmed by the fact that ALL of the threads in this list have the same value for StartAddress, leading me to ask exactly what the purpose is...?
Edit 2
Process Explorer let me to an actually meaningful start address. It looks like it's mscorwks.dll!CreateApplicationContext+0xbbef. This dll is in %WINDOWS%\Microsoft.NET\Framework\v2.0.50, so it looks like it's clearly a runtime assembly. I'm still not sure why
it's Highest priority
it appears to be causing hiccups in my application
You could try using Sysinternals. Process Explorer let's you dig in pretty deep. Right click on the Process to access Properties. Then "Threads" tab. In there, you can see the thread's stack and module.
EDIT:
After asking around some, it seems that your "Highest" priority thread is the Finalizer thread that runs due to a garbage collection. I still don't have a good reason as to why it would constantly keep running. Maybe you have some funky object lifetime behavior going on in your process?
I'm not sure what this is, but if you turn on unmanaged debugging, and set up Visual Studio with the Windows symbol server, you might get some more clues.
Might be the Garbage Collector thread. I noticed it too when I was once investigating a finalizer-related bug. Perhaps your system memory is low and the GC is trying to collect all the time? This was the case in the previously mentioned bug too. I couldn't reproduce it on my machine, but a co-worker of mine had a machine with less RAM where it would reappear like clockwork.
I know this is probably the canonical "It depends..." question but I'd appreciate any pointers as to where to start looking.
I have a client/server app talking over ethernet. In one computer I run the server and a client and on another just the client. One runs Vista and one runs XP. After an uptime of about 3 weeks the entire computer freezes and nothing works, not mouse, not keyboard, nothing -just power off. Every ten seconds the server sends a ping message to see if the clients are alive, other than that just a few small messages go back and forth every day.
I'm trying to find out if it's me causing it or something else. I've started a session and after a few days I thought I'd check for strange increases in memory use but beyond that I have very few ideas.
Some thoughts to consider:
You know the computer doesn't respond, but that doesn't mean it's hung. Does it respond to a ping?
Maybe the disk activity light is on all the time?
You say "no keyboard" - do you mean no caps lock or num lock lights?
Although the .NET application may be the only one you're running at the time, that does not imply it is the cause of the problem. Some background job could be doing it.
For example, I notice that Retrospect backup, when it is creating a snapshot, freezes the entire system for 10-15 minutes. I mean, no caps lock, the clock in the task bar doesn't update, no CTRL-ALT-DEL, can't type into an "Answer" text box in SO, nothing. It had nothing to do with what I was doing at the time, which was answering a question on SO.
After it came back, SO asked if I was a human. My feelings were hurt. ;-)
You could attach a kernel debugger to the OS. That way you should be able to inspect the state of the OS and your process even if the OS is completely unresponsive. (Unfortunately, it's a lot harder than just hitting "break" in VS. I suggest reading John Robbin's "Debugging Applications for .NET and Windows" before trying that.)
You could also try to create memory dumps of your application in regular intervals. You might have to do a little scripting for that, though. (usually, you'd create a dump with a keystroke, using a tool like userdump or adplus, but if the OS is not responding to keystrokes, that won't work.) That way, you know what state your process is in during or shortly before a hang.
This page: http://blogs.msdn.com/debuggingtoolbox/default.aspx is a good starting point for scripting WinDbg. (If you don't know what to do with a memory dump, I'd again suggest John Robbin's excellent book on debugging!)
Other than that, I can only think of standard debugging tricks: does the problem occur on every PC? Does it happen if there are no client requests? Does it happen sooner if there are more client requests? Does it happen sooner if there is less available physical memory? Try removing parts of your application (maybe on a separate server for testing) and see if the problem still occurs, and so on. Try running it in a VM so you can see if it uses the CPU, harddisk, or network during those "hangs".
This isn't going to be the answer, but I'd advise starting by checking your OS event logs and running a perfmon to keep track of memory, cpu usage etc.
Which computer freezes, the server or client? And what OSes are they running respectively?
As Daniel L noted, tight polling loops can really kill the CPU. If you can, change your code to use event handlers, it's a much more robust solution.
Finally, are you certain there's not a hardware problem on the freezing computer?