Windows NT Service shutdown issues - c#

I have developed middleware that provides RPC functionality to multiple client applications on multiple platforms within our organization. The middleware is written in C# and runs as a Windows NT Service. It handles things like file access to network shares, database access, etc. The middleware is hosted on two high end systems running Windows Server 2008.
When one of our server administrators goes to reboot the machine, primarily to do Windows Updates, there are serious problems with how the system behaves in regards to my NT Service. My service is designed to immediately stop listening for new connections, immediately start refusing new requests on existing connections, and otherwise shut down as rapidly as possible in the case of an OnStop or OnShutdown request from the SCM. Still, to maintain system integrity, operations that are currently in progress are allowed to continue for a reasonable time. Usually the server shuts down inside of 30 seconds (when the service is manually stopped for example). However, when the system is instructed to restart, my service immediately loses access to network drives and UNC paths, causing data integrity problems for any open files and partial writes to those locations. My service does list Workstation (and thus SMB Redirector) as a dependency, so I would think that my service would need to be stopped prior to Workstation/Redirector being stopped if Windows were honoring those dependencies.
Basically, my application is forced to crash and burn, failing remote procedure calls and eventually being forced to terminate by the operating system after a timeout period has elapsed (seems to be on the order of 20-30 seconds).
Unlike a Windows application, my Windows NT Service doesn't seem to have any power to stop a system shutdown in progress, delay the system shutdown, or even just the opportunity to save out any pending network share disk writes before being forcibly disconnected and shutdown. How is an NT Service developer supposed to have any kind of application integrity in this environment? Why is it that Forms Applications get all of the opportunity to finish their business prior to shutdown, while services seem to get no such benefits?
I have tried:
Calling SetProcessShutdownParameters via p/invoke to try to notify my application of the shutdown sooner to avoid Redirector shutting down before I do.
Calling ServiceBase.RequestAdditionalTime with a value less than or equal to the two minute limit.
Tweaking the WaitToKillServiceTimeout
Everything I can think of to make my service shutdown faster.
But in the end, I still get ~30 seconds of problematic time in which my service doesn't even seem to have been notified of an OnShutdown event yet, but requests are failing due to redirector no longer servicing my network share requests.
How is this issue meant to be resolved? What can I do to delay or stop the shutdown, or at least be allowed to shut down my active tasks without Redirector services disappearing out from under me? I can understand what Microsoft is trying to do to prevent services from dragging their feet and showing shutdowns, but that seems like a great goal for Windows client operating systems, not for servers. I don't want my servers to shutdown fast, I want operational integrity and graceful shutdowns.
Thanks in advance for any help you can provide.
UPDATE: I've done the exact timings. In a test shutdown, I got shutdown notification at 23:55:58 and noticed losing network share connectivity at 23:56:02. So within four seconds, I've lost the ability to save out any active state.

This question: https://serverfault.com/questions/34427/windows-service-dependencies on ServerFault should answer your own. It links to this article: http://blogs.technet.com/askperf/archive/2008/02/04/ws2008-service-shutdown-and-crash-handling.aspx, which should help you get pre-shutdown notification and service shutdown ordering.

Related

Heartbeat activity for Windows Service

I have many Windows Services (written in C# 4.0) that at various intervals connect to a database and do various complex tasks. Some of these tasks only occur every X hour intervals per day. However, the server support team would like to know if the Windows Service is actually running as there can be a big interval between tasks. They would like essentially a heartbeat for each Windows Service. Every 5 minutes the Windows Service would do something that could be monitored by Microsoft System Center Operation Manager (SCOM). Any solution must being easily monitored by SCOM as the server support team rely on it.
What is the best way to achieve this type of funcationality? I thought of using Performance Counters and have SCOM listen for those but not sure if that is that best use of perf counters.
UPDATE (Totally forgot to include): Currently each service writes to a database this action but SCOM is not very good at monitoring databases records and differentiating which service is doing what etc.
If your case here is limited to service health check, I am pretty sure this can be done by windows tools such as windows performance monitor. Collecting their data should be an easier task than creating heartbeats in a usually disconnected environment.
There are plenty of server and service monitoring tools. Few of them are also open-source, which you can choose as an initial step.
What about simply writing to a DB or event log?
You could add either SNMP or WMI instrumentation to your services - this incidentally will allow you to raise more than just heartbeat messages - for example you could also raise events when you encounter transient faults that require an operation to be retried.
There are plenty of SNMP libraries out there and WMI support is built into the .NET framework.
Just FYI, I'm researching this question because how I do it is that the Service application logs a hearbeat message every X seconds when it polls the Db for a new job. At midnight, all heartbeat messages over a week old are deleted from the log. I do have an application scheduled at 7 AM that checks when the last heartbeat message was sent, but it looks like I'm going to have to run it constantly to periodically check ... and notify Solar Winds to send an alert if there is a problem.

What happens to other users if the .NET worker process crashes?

My knowledge of how processes are handled by the ASP.Net worker process is woefully inadequate. I'm hoping some of the experts out there can fill me in.
If I crash the worker process with a System.OutOfMemoryException, what would the user experience be for other users who were being served by the same process? Would they get a blank screen? 503 error?
I'm going to attempt to test this scenario with some other folks in our lab, but I thought I would float this out there. I will update with our results.
UPDATE: Our results varied. If we artificially induced a OOM exception (for example by loading larger and larger PDFs into memory), other threads being served by that worker process would "hang" temporarily and then complete, while others seemingly would never return. Thank you for your responses.
W3WP.exe is the process
IIS runs all web apps in a generic worker process - w3wp.exe. Whether you write in ASP.NET, or ISAPI, or some other framework, the process that serves the web request is w3wp.exe. In the ASP.NET case, w3wp.exe loads the ASP.NET JIT-compiled DLLs and services the requests through them. In other cases, it works differently. But the key point is, w3wp.exe is the process. This model started in IIS6.0 and continues in IIS7.0.
Unexpected Failures
If the W3WP.exe fails unexpectedly, for any reason, all transactions it was handling will likely get 500 errors (Server error). IIS will start a new worker process in its place (MS calls this "Health Monitoring"), which means the web app will continue to run. Users that did not have a request being served by the failing process at the time of failure, will be unaware of any of this.
The HTTP 500 error that a client receives in this case will be indistinguishable from a 500 error that the client receives in the case of an application error, let's say an uncaught exception in your ASPNET application code.
For those requests that were in the failing process, there's no way to recover them. They will result in 500 errors at the browser. A 503 Server Busy results from IIS actively refusing the connection due to a threshold on the number of connections. A 503 does not result from an application failure, so you shouldn't expect to see 503 for in-flight transactions in the out-of-memory-and-crash scenario. On a heavily loaded system, you may see 503's as the process-crash-and-restart happens, as a secondary effect. If this is really what you're seeing, you need a larger margin of safety to handle the load in the single-error condition.
The Request Queue
IIS has a hand-off approach for requests. As they arrive on the network layer (Http.sys), they are placed in a queue, to be picked up by a worker process. Any requests waiting in the IIS queue to be handled by a WP will continue unaffected, though they might see a slight temporary increase in latency (service time) due to resource contention, since one fewer process is running on the server. Wait time in this queue is generally very very short, on a system that is configured properly.
It is when this queue is full that you will see 503 errors.
Auto restart of W3WP.exe
IIS has an auto-restart (or "nanny") facility, through which it restarts worker processes after they have exceeded configured thresholds, such as memory size, number of requests, or time-of-running. In all those cases, IIS will quiesce and restart worker processes when the configured threshold is reached. These pro-active restarts normally do not result in any disruption of requests. When IIS decides that a restart of a worker process is necessary, it prevents any new requests from arriving at that to-be-quiesced WP. Existing requests are drained: any in-flight transactions in that WP are allowed to complete normally. When all requests in the WP complete, then the WP dies and IIS starts a new one in its place. This new process then immediately begins picking up new requests from the dispatch queue. This is all transparent to users or browsers.
I say normally because it's possible that the worker process has become truly sick at the same time as the threshold has been reached. In that case the w3wp.exe may not respond to IIS within the configured "quiesce" timeout, and thus IIS has to eventually kill the process even though it hasn't reported that all of its in-flight requests have completed. This should be exceedingly rare, because it's two distinct exceptional conditions, but it happens. In this case, the in-flight requests will once again, get 500 errors.
Web gardens
Also - IIS allows multiple worker processes on a single server. MS calls this a "web garden", a play on words from "web farm". If you have a web garden set up, then transactions being served by w3wp.exe instances other than the failing one, will continue unaffected. "Unaffected" presumes though, that the out-of-memory error is localized, and not a system-wide problem.
Bottom Line
The bottom line is that there is no substitute for your own testing. The configuration options are pretty broad - from restart thresholds to web gardens and so on. Also the failure modes tend to be pretty complex and varied, whether it's memory, timeout, too busy, and so on. You'll want to understand what to expect.
ps: this Q&A really belongs on serverfault.com !!
references:
http://blogs.iis.net/thomad/archive/2008/05/07/the-iis-process-model-features.aspx
A new worker thread will be started and the user would not know anything happened. Unless it shuts down completely via rapid fail (http://technet.microsoft.com/en-us/library/cc779127(WS.10).aspx)
If it's an out of memory situation, iis usually just recycles the app pool.
As the other answers say, in most cases everything just restarts, and most users who did not have a pending request at the time will not notice much more than a delay.
However, if your application uses session variables with In-Proc session state, all session variables for all users will be lost when the app pool restarts. This may or may not have a negative effect on the users, depending on what you're doing with the session variables. You can avoid this by switching to StateServer or SQL Server session storage.

WCF communication between 2 servers crashes after IIS7 process recycle

I am kind of stumped with this one, and was hoping I could find some answers here.
Basically, I have an ASP.NET application that is running across 2 servers. Server A has all of the business logic/data access exposed as web services, and Server B has the website which talks to those services (via WCF, with net.tcp binding).
The problem occurs a few seconds after a recycle of my app pool is initiated by IIS on Server A. The recycle happens after the allotted time (using the default of 29 hours set in IIS).
In the server log (of Server A):
A worker process with process id of
'####' serving application pool
'AppPoolName' has requested a recycle
because the worker process reached its
allowed processing time limit.
I believe that this is normal behavior. The problem is that a few seconds later, I get this exception on Server B:
This channel can no longer be used to
send messages as the output session
was auto-closed due to a
server-initiated shutdown. Either
disable auto-close by setting the
DispatchRuntime.AutomaticInputSessionShutdown
to false, or consider modifying the
shutdown protocol with the remote
server.
This doesn't happen on every recycle; I assume that it happens when someone is hitting the site with a request WHILE the recycle happens.
Furthermore, my application is down until I intervene; this exception continues to occur every time a subsequent request is made to the page. I intervene by editting the web.config (by adding a space or something benign to the end of file) and saving it- I assume that that causes my application to recompile and brings the services back up. I also have experimented with running a batch file that does this for me every time the exception happens ;)
Now, I could barely find any information on this exception, and I've been looking for a while. Most of the information I did find pertains to WCF settings that I am not using.
I already read up on "DispatchRuntime.AutomaticInputSessionShutdown" and I don't think it pertains to this situation. This particular property refers to the service shutting down automatically in response to behavior on the client side, which is not what is happening here. Here, the service is shutdown because of IIS.
I did read this which went through some sort of work around to bring the service back up automatically, but I am really looking to understand what is going on here, not to hack around it!
I have started playing around with the settings in IIS7, specifically turning on/off Overlapped Recycling and increasing the process startup/shutdown times. I am wondering whether it is safe to turn off recycling completely (I believe if I put 0 for the recycling time interval?) But again, I want to know what's going on!
Anyway, if you need more information, let me know. Thanks in advance!
This is probably related to how you open and close WCF connections.
If you open a proxy when your app starts and then continue to use this, a break in the connection, which is caused by a restart on the server side. Results in a error on the client side, since the server that the proxy was talking to is no longer there.
When you restart the client side (changing the web.config) new proxies are created against a server that is running.
The way to fix this is to make sure that you close a WCF connection after you use it.
http://www.codeguru.com/csharp/.net/net_wcf/article.php/c15941/
You should also make sure that you're using the correct SessionMode for your Web Service. I remember having similar trouble with some of my Services until I sorted out the correct mode. This is especially true when you're mixing this with any other authentication mode that is not "None".
This link might have some pointer.
http://msdn.microsoft.com/en-us/library/ms731193.aspx
My suggestion is to simply stop using IIS to host your services. Unless there is something you really need from IIS, I would recommend just writing a standard Windows Service to host your WCF endpoints.
If you can't do that, then by all means turn off recycling. AppPool recycling is mainly there because web developers write crappy code. I know that sounds rather blunt, but if you have enough sense to write code that doesn't leak then there is no reason to have IIS constantly restart your program.

Check if windows service has been forcefully shut down or crashed

I have a windows service written in C# .NET framework 3.5 and would like to know the best way to check if previous shutdown of a service was regular.
Upon starting the service, there should be a check if the last shutdown was regular (via stop service button in services management) or if somebody just killed the process (or it crashed for some reason not directly linked to the service itself).
I thought about writing encrypted XML on a hard drive upon starting a service, and then editing it with some values when service is being stopped. In that way, after I start the service again next time, I could check the XML and see if the values were edited in correct way during shutdown, and if they were not I'd know the process was killed or it crashed.
This way seems too unreliable and not a good practice. What do you suggest?
Clarification:
What the service does is it sits on a server and listens to connections from client machines. Once the connection has been established, it communicates to a remote database via web services and determines whether they have right to connect (and therefore use application that is the caller). One of the aspects of protection is concurrency check, and if I have a limit set to 5 work stations, I keep the TcpClient connection alive from windows service to, let's say 5 workstations, and the sixth one cannot connect.
If I kill the service process and restart it, the connections are gone and I have 5 "licensed" apps running on workstations, and now there are 5 free connection slots to be taken by 5 more.
I also cant see anything bad using a file. You could even use this file to log some more information.
Eg. you could attach to the AppDomains Unhanded Exception event and try to log that exception.
Or you could evaluate how log your service has been running/not running (parsing a logfile for that task is a little bit harder).
Of course - this is not an excuse for not using logfiles.
I went with this in the end:
Service used to check up on the connected workstations to see if they're alive, but now I've built in periodical check from all the workstations as well (they connect through a common router dll where I've built in the check). Every 10 seconds the connection is verified, and if there is none, the client will try to reconnect in 15 seconds, which will be successful if there was just a temporary network problem, but will fail if the service was shut down forcefully (since all it's Tcp objects will be lost).
I would suggest to use the EventLog. Add a log event when a service start or stops and read through the event logs to detect anomalies.
Here's a basic sample from CodeProject.
Here's a walkthrough from MSDN how to create/delete/read event logs and entries.
Unless the service is running some sort securiy system that you need to have a "tamper" proof system i dont see why using a file is a bad solution.
Personaly i think a encrpted xml file is overkill, a simple text file should be enough.
I think you are on the right track, I'm not sure why you want to edit the values, just use the file (or a registry key) as a marker to indicate that the service was started and is running. During a graceful shutdown remove the marker. You then just need to look for the existence of the marker to know whether you were shutdown gracefully or crashed.
If you are finding that the file isn't created reliably, then make sure you are closing and flushing and disposing of the file object rather than relying on the garbage collector.
--- EDIT following clarification ---
So the requirement is for a licensing system and not simply to determine if the service was shutdown gracefully. I'm guessing that the desire is for the 'licenses' to be cleared on a graceful shutdown and restored following a crash, the scenarios are interchangeable.
I would probably use a database backing store, with suitable security, to hold the license keys at the server. As each client connects and requests a license they are provided with a key that has to be presented for each communication from the client. The server is obviously verifying that the presented key is valid for the current session. Should the server be gracefully shutdown it can clear the key table, if it crashes then the keys would still be present and can be honoured. That's probably the simplest approach I can think of that's secure.
If there's yet more to the story then let us know.

Best way to check if server is reachable in .NET?

I have created a timeclock application in C# that connects to a web service on our server in order to clock employees in/out. The application resides in the system tray and clocks users out if they shut down/suspend their machines or if they are idle for more than three hours to which it clocks them out at the time of last activity.
My issue arises that when a user brings his machine back up from a sleep state (which fires the SystemEvents.PowerModeChanged event), the application attempts to clock the employee back in but the network connection isn't fully initialized at that time and the web-service call times out.
An obvious solution, albeit it a hack, would be to put a delay on the clock in but this wouldn't necessarily fix the problem across the board. What I am looking to do is a sort of "relentless" clock in where it will wait until it can see the server until it actually attempts to clock in.
What is the best method to determine if a connection to a web service can be made?
The best way is going to be to actually try to make the connection and catch the errors. You can ping the machine, but that will only tell you if the machine is running and on the network, which doesn't necessarily reflect on whether the webservice is running and available.
When handling the event, put your connection code into a method that will loop through until success, catching errors and retrying.
Even a delay wouldn't be perfect as depending on the individual systems and other applications running it can take varying times for the network connection to be re-established.
Implement a queue where you post messages and have a thread periodically try to flush the in-memory queue to the web service.
if the problem is latency in re-establishing the network service, Ping is the solution; it's like ringing the doorbell to see if anyone is home
if ping succeeds, then try calling the web service, catching exceptions appropriately (I think both SocketException and SoapException can occur depending on readiness/responsiveness)
Ping can be disabled although the web service port is open. I wouldn't use this method...

Categories

Resources