I have created a WPF service for tracking a user session, while tracking the user session I also want to track a event of service crash. For that I have been checking the windows event log and identifying the error. But I am confused, It was showing a error there which tells that failed to process a sessionchange!
Is that a service crash?? Is there any specific exception code for a service crash/
Can anyone help with suggesting relavent articles/ points for identify a system crash?
Not a crash, you are just seeing the .NET framework's ServiceBase class doing its job. In a few specific cases it will catch an exception and create an entry in the application event log. In does so in its code that causes the OnStart(), OnStop(), etcetera method to run.
Looks like the service's OnSessionChange() method fell over, just a bog-standard file locking error. In all likelihood the service code is a bit clumsy, it needs to open that file in its Main() method so nobody else can mess with it. Probably wasn't tested really well, OnSessionChange() does not fire very often. And certainly little reason to try to log anything, but who knows.
This should not otherwise affect the service process, the service control manager doesn't give much of a hoot if the OnSessionChange notification fails. Nothing it can do about it. So you are seeing this mostly because you started looking, services do tend to misbehave without anybody noticing. It just isn't very visible that they do. Do make sure it wasn't your code that put a lock on the Log.txt file. If you do then you'll have to use FileShare.ReadWrite to prevent the service from falling over.
Related
I have a service which can automatically update itself. It does so by downloading and running the installer/updater, which is another executable. That executable stops the service with the ServiceController class, makes sure it is stopped using WaitForStatus(ServiceControllerStatus.Stopped), and then copies the relevant files. Those files include the service's main assembly and its dependencies.
Sometimes, the installation works as expected, but sometimes, I get an IOException telling me that it cannot access one of the service's assemblies because it is being used by another process (presumably the service which hasn't completely shut down). To remedy this I tried adding a fairly large delay of 1000ms after the WaitForStatus call, before starting to copy the files, but the IOException still gets thrown (or not) at random, i.e. sometimes the update is successful and sometimes it isn't.
I then tried adding a call to Environment.Exit() at the end of the ServiceBase.OnStop implementation of my service, and the update seems to work all the time now. However, I can tell this is not good practice since when I try stopping my service from the SCM, it stops, but gives the error Service process closed unexpectedly.
So what is the best way to do what I am trying to do? I could increase the delay, but it seems to me that 1000ms should be ample time for the service to properly shut down and release its lock on its assemblies. Perhaps I am doing something else incorrectly.
I'll write about what I did to solve this problem. I made it so that when a file isn't able to be copied because of that exception, the process enters a loop whereby it waits 1000ms and tries to copy that file again. It does so 5 times, and if it isn't able to copy the file after 5 times, the installer fails. In practice, from the log information I am receiving, it can take up to 3 seconds from a service process to properly shut down. I think this is the best solution for my problem.
We're developing a video game that has literally no bugs ever has, like any application, bugs that can on occasion cause hard crashes. Unfortunately a number of the crashes we've cataloged so far are out of our control in terms be being able to solve them or work around them due to the closed source middleware we're using (Unity 3D).
Whilst we can hope and wait for the middleware developer to fix the problem we'd like to see if its possible to at least make the crashes more informative and user friendly. For example - One of the rare crashes our users can have is that certain AV products cause some kind of thread context race condition and cause the game to explode. We'd like to be able to detect the crash and error signature, and provide to the user a link to our wiki or forums on how to resolve it (If possible).
Is it possible to write a lightweight watchdog process or parent process that can respond to crash events on the Windows platforms?
Collecting crash dumps outside the crashing process is essential. You never know whether your unhandled exception handler is affected or not. But there are other options:
Enable WER LocalDumps and write a watchdog (FileSystemWatcher) for the directory where the dumps are stored.
Configure AeDebug and attach your own debugger at the time of the crash.
Supopse I have a unhandled exception (or a known serious, unrecoverable error). The scariest situation is a security breach, but it could apply to anything that means my state is so badly hosed I can't expect to continue safely.
What do I do?
In a traditional application, the usual technique is to end my process, quickly. as soon as possible. I'm calling Process.Exit, TerminateProcess, die, or whatever other tool the environment has that means "END. NOW". Eric Lippert's post expresses the reasoning for this attitude well.
In a production ASP.NET application running on IIS, it's not so simple. I can certainly end the current process and cough an error to the event log or wherever. That's essentially what happens with any unhandled exception. But the next time a request comes in, IIS is just going to spin up a new worker process. If my fatal error was a transient problem that's great.
But if my problem persists past the lifetime of my process, the new one won't be any better. It could even be compounded by the intialization code or a reattempt. Plus, if IIS is running multiple worker processes within the same application pool, even killing my process doesn't kill the application. Logically speaking all those other workers may be hosed too and just not know it yet.
So far I've only come up with two options.
End the process and hope for the best. Knowing that the app will just be restarted, this is pretty much the same as "catch(Exception) {}". Hardly satisfying.
"Reaching out" to tell IIS to disable the app, stop IIS, the machine, etc. This seems like a brutal hack. Moreover I'd guess it's likely to require elevated security credentials. During termination of a possibly-compromised process seems like a poor time to have those.
What I can think of are something as following:
You can go ahead use the advanced setting of an Application Pool in IIS named "Rapid-Fail protection", set the Failure Interval long enough as you like, and make the Maximum Failures as 1, then go ahead thrown the exception and make the IIS think this application pool can't work correctly so that it will send back Service Unavailable to client side or even reset the connection(depend on your setting). For more detail please check it here: Failure Settings for an Application Pool . However you need to be very careful to not overkill, I mean you need to write a very good application that all exception been handled properly and only the one you want to terminate application can really been detected by IIS, otherwise maybe a single user click just brought down your site.
Another solution is just go ahead make it your own code, I mean you can record such an error in some certain way like creating a file named SystemCrashed, and then terminate the Application, then check if file exist on Application_Startup and do nothing but terminate the Application if file been found. Something like a lock. This need more code but maybe safer than IIS settings, I mean there can't be too much overkill as long as you get it right to remove the lock.
I am currently writing a windows service that runs entirely in the background and does something every day. My idea is that the service should be very stable so if something goes wrong it should not stop but try it next day again and of course log the exception. Can you suggest me any best practice how to make truly stable windows services?
I have read the article of Scott Hanselman of exception handling best practice where he writes that there are only few cases when you should swallow an exception. I think somehow that windows service is one of the few cases, but I would be happy to get some confirmation on that.
'Swallowing' an exception is different to 'abandoning a specific task without stopping the entire process'.
In our windows service, we catch exceptions, log their details, then gracefully degrade that task and wait for the next task. We can then use the log to troubleshoot the error while the server is still running.
The question you should be asking, is should your Windows service be fault tolerant. Remebering that any unhandled exceptions will bring the service down, which results in its immediate unavailability. How do you think your service should behave? Should it try and continue servicing whatever it needs to? Should it be terminated?
Actually, if you have an unexpected exception that is passed all the way to the top level of your service, you should not continue processing; log it and propogate it. If you truly need a "reliable" service, then you'll need a "watchdog" that restarts the original service when it exits.
Note that modern operating systems act as a watchdog, so you don't need a watchdog service in most cases (check out the "Recovery" tab under your Service properties). Historically, critical services would have a second "watchdog" service whose sole purpose is to restart the real service if it fails.
It sounds like your design may be able to make use of the scheduler; just let Windows take care of the "once a day" part and just have your service do the task a single time. If it fails, fine; Windows is responsible for starting it again the next day.
One final note: this level of reliability in a service is rarely needed. In commercial code, I've only seen it used in a couple of antivirus programs and a network filtering program (that had to be running or else all network communication would fail). I've done a couple "watchdog" programs myself, but these were for customers like auto companies who would lose tons of money when their assembly line systems went down. In addition to the software watchdog, these systems also had redundant power supplies, RAIDed hot-swappable hard drives, and a complete duplicate of the entire system for use as an automatic failover.
Just saying: you may want to reconsider how much you really need to increase reliability (keeing in mind that 100% reliability is impossible; it can only be approached, at exponential cost).
In my opinion, you should establish a strong distinction between unrecoverable and recoverable exceptions, i.e., exceptions that prevent the continuation of your service (if your "static" data structures are corrupted) and exceptions that just determine the failure of the current operation. To make clear the distinction you may have to separated exception classes hierarchies.
This distinction should go along with a strong distinction between the structures of the "supervisor" part of the service (the one that schedules the periodic action) and the part of the service that actually does such periodic action. In case of a recoverable exception, you could abort the running operation and completely reset this last part, obviously logging all the details of the exception to the system event log; on the other hand, if you got an unrecoverable error (supervisor's structures in an inconsistent state and SEH exceptions, of course) you should just log your error and exit, since continuing running in an inconsistent state is much more dangerous than not running at all.
Swallowing exceptions is rarely a good idea and as Scott says in his article, there really are only a few valid cases where it might be the best option.
My advice would be to firstly, know what exceptions you're catching and catch them. It'll be more useful to you in the future if you know what you're catching rather than the generic (Exception e)
Once you've caught the exception then as you stated above, writing that to a logging service, perhaps emailing the details to the maintainer of the code or even firing off another event that sets up a re-try of the code with a limit on the number of attempts before a new message is issued to the code maintainer.
By catching specific exceptions you can do specific things about them. You can also catch the general exception to ensure that exceptions you really didn't expect don't cause a complete system failure.
Once you know about exceptions you weren't aware of before, these can then be refactored into the next release with a more ideal way of handling them.
Like so many things in software development rarely does "one size fit all". If you deem it appropriate to swallow the exception with the intention of retrying at a later date then that's perfectly reasonable. What really does matter is that you clean up after yourself, log and determine a reasonable retry policy before notifying someone.
The Exception Handling Block of the Enterprise Library may prove useful as you can modify your exception policy within config without changing the code.
A service should never stop. There are two classes of errors, errors in the Service itself, and errors in data provided to the service. Data Errors should be reported but not ignored. These two goals can be accomplished by having the service log errors, by providing a way to transmit error information to the user, and by having the service retry the failure after the user (or programmer in the case of an error in the service) has corrected what caused the service to fail (obviously the service will have to be stopped, re-installed, and re-started if a program error is corrected).
I hate asking questions like this - they're so undefined... and undefinable, but here goes.
Background:
I've got a DLL that is the guts of an application that is a timed process. My timer receives a configuration for the interval at which it runs and a delegate that should be run when the interval elapses. I've got another DLL that contains the process that I inject.
I created two applications, one Windows Service and one Console Application. Each of the applications read their own configuration file and load the same libraries pushing the configured timer interval and delegate into my timed process class.
Problem:
Yesterday and for the last n weeks, everything was working fine in our production environment using the Windows Service. Today, the Windows Service will run for a period of around 20-30 minutes and hangs (with a timer interval of 30 secods), but the console application runs without issue and has for the past 4 hours. Detailed logging doesn't indicate any failure. It's as if the Windows Service just...dies quietly - without stopping.
Given that my Windows Service and Console Applications are doing the exact same thing, I can only think that there is something that is causing the Windows Service process to hang - but I have no idea what could be causing that. I've checked the configuration files, and they're both identical - I even copied and pasted the contents of one into the other just to be sure. No dice.
Can anyone make suggestions as to what might cause a Windows Service to hang, when a counterpart Console Application using the same base libraries doesn't; or can anyone point me in the direction of tools that would allow me to diagnose what could be causing this issue?
Thanks for everyone's help - still digging.
You need to figure out what changed on the production server. At first, the IT guys responsible will swear that nothing changed but you have to be persistent. i've seen this happen to often i've lost count. Software doesn't spoil. Period. The change must have been to the environment.
Difference in execution: You have two apps running the same code. The most likely difference (and culprit) is that the service is running with a different set of security credentials than your console app and might fall victim to security vagaries. Check on that first. Which Windows account is running the service? What is its role and scope? Is there any 3rd party security software running on the server and perhaps Killing errant apps? Do you have to register your service with a 3rd party security service? Is your .Net assembly properly signed? Are your .Net assemblies properly registered and configured on the server? Last but not least, don't forget that a debugger user, which you most likely are, gets away with a lot more stuff than many other account types.
Another thought: Since timing seems to be part of the issues, check the scheduled tasks on the machine. Perhaps there's a process that is set to go off every 30 minutes that is interfering with your own.
You can debug a Windows service by running it interactively within Visual Studio. This may help you to isolate the problem by setting (perhaps conditional) breakpoints.
Alternatively, you can use the Visual Studio "Attach to process" dialog window to find the service process and attach to it with the "Debug CLR" option enabled. Again this allows you to set breakpoints as needed.
Are you using any assertions? If an assertion fires without being re-directed to write to a log file, your service will hang. If the code throws an unhandled exception, perhaps because of a memory leak, then your service process will crash. If you set the Service Control Manager (SCM) to restart your process in the event of a crash, you should be able to see that the service has been restarted. As you have identical code running in both environments, these two situations don't seem likely. But remember that your service is being hosted by the SCM, which means a very different environment to the one in which your console app is running.
I often use a "heartbeat", where each active thread in the service sends a regular (say every 30 seconds) message to a local MSMQ. This enables manual or automated monitoring, and should give you some clues when these heartbeat messages stop arriving.
Annother possibility is some sort of permissions problem, because the service is probably running with a different local/domain user to the console.
After the hang, can you use the SCM to stop the service? If you can't, then there is probably some sort of thread deadlock problem. After the service appears to hang, you can go to a command-line and type sc queryex servicename. This should give you the current STATE of the service.
I would probably put in some file logging just to see how far the program is getting. It may give you a better idea of what is looping/hanging/deadlocked/crashing.
You can try these techniques
Logging start logging the flow of the code in the service. Have this parameter based so you dont have a deluge after you are done. You should log all function names, parameters, timestamps.
Attach Debugger Locally or Remotely attach a debugger with the code to the running service, set appropriate breakpoints (can be based on the data gathered from logging)
PerfMon Run this utility and gather information about the machine that the service is running on for any additional clues (high CPU spikes, IO spikes, excessive paging, etc)
Microsoft provides a good resource on debugging a Windows Service. That essentially sounds like what you'd have to do given that your question is so generic. With that said, has any changes been made to the system over the last few days that could aversely affect the service? Have you made any updates to the code that change the way the service might possibly work?
Again, I think you're going to have to do some serious debugging to find your problem.
What type of timer are you using in the windows service? I've seen numberous people on SO have problems with timers and windows services. Here is a good tutorial just to make sure you are setting it up correctly and using the right type of timer. Hope that helps.
Another potential problem in reference to psasik's answer is if your application is relying on something only available when being run in User Mode.
Running in service mode runs in (is it desktop0?) which can cause some issues in my experience if you are trying to determine states of something that can only be seen in user mode.
Smells like a threading issue to me. Is there any threading or async work being done at all? One crucial question is "does the service hang on the same line of code or same method every time?" Use your logging to find out the last thing that happens before a hang, and if so, post the problem code.
One other tool you may consider is a good profiler. If it is .NET code, I believe RedGate ANTS can monitor it and give you a good picture of any threadlock scenarios.