Design considerations for high-reliability service

Design considerations for high-reliability service - c#

I am writing a c# windows service which will perform some background processing - basically it is a consumer for a work queue.
It needs to not go down (stop processing new items), and if it does go down I need to be notified.
What are some design guidelines and considerations for a) ensuring that such a service is as reliable as possible, and b) sending out a notification if something does go wrong? I have considered, for instance, creating a watcher thread whose only job is to make sure the worker thread is still processing jobs.

There are a number of things that you can do here to help improve the reliability, as well as gauge that you have a solution that is going to meet your needs.
Testing
First and foremost though, the testing process that you go through will need to be a very solid one, test for those "unexpected" situations, loss of network connection, etc. Make sure that you are testing those, and seeing what is happening. Notification on failure, can be a bit of a "mixed bag". For example, you can't e-mail yourself if you don't have network connections available.
Proper Code Design
In addition to setting up valid test scenarios, be sure that your code is a bullet proof as possible, since you are creating a windows service, be sure that you are capturing, logging, and dealing with all errors possible, as if an error bubbles up to the OS, your service will go down.
Monitoring
Consider putting monitoring, in my day-job we have two types of monitoring used, errors are reported the the Windows Event log in some cases and Microsoft MOM is used to notify us of any/all issues that are going on in the environment. A second process that we use is a second scheduled job that every X minutes validates that the critical job is in a "Started" state, if it isn't in a started state, it will re-start it. Not elegant, but it works.

I think a MOM and/or Solar Winds or some other monitoring application which your system administrator might be using to monitor the machine on which the service is deployed & take proper action (send email, ring phones :)

Related

C# What exactly is application domain?

I understand that an application domain forms:
an isolation boundary for security,
versioning,
reliability,
and unloading of managed code,
but so does a process
Can someone please help me understand the practical benefits of an application domain?
I assumed app domain provides you a container to load one version of an assembly but recently I discovered that multiple versions of strong key assembly can be loaded in an app domain.
My concept of application domain is still not clear. And I am struggling to understand why this concept was implemented when the concept of process is present.
Thank you.

I can't tell if you are talking in general or specifically .NET's AppDomain.
I am going to assume .NET's AppDomain and why it can be really useful when you need that isolation inside of a single process.
For instance:
Say you are dealing with a library that had certain worker classes and you have no choice, but to use those workers and can't modify the code. It's your job to build a Windows Service that manages said workers and makes sure they all stay up and running and need to work in parallel.
Easy enough right? Well, you hoped. It turns out your worker library is prone to throwing exceptions, uses a static configuration, and is generally just a real PITA.
You could try to launch them in their own process, but monitor them, you'll need to implement namedpipes or try to thoughtfully parse the STDIN and STDOUT of the process.
What else can you do? Well AppDomain actually solves this. I can spawn an AppDomain for each worker, give them their own configuration, they can't screw each other up by changing static properties because they are isolated, and on top of that, if the library bombs out and I failed to catch the exception, it doesn't bother the workers in their domain. And during all of this, I can still communicate with those workers easily.
Sadly, I have had to do this before
EDIT: Started to write this as a comment response, but got too large
Individual processes can work great in many scenarios, however, there are just times where they can become a pain. I am not saying one should use an AppDomain over another process. I think it's uncommon you would need a separate process or AppDomain, but once you need it, you'll definitely know.
The main problem I see with processes in the scenario I've given above is that processes have their own downfalls that are easier to mitigate with the AppDomain.
A process can go rogue, become unresponsive, and crash or be killed at any point.
If you're managing processes, you need to keep track of the process ID and monitor the status of it. IPCs are great, but it does take time to get proper communication going back and forth as needed.
As an example let's say your process just dies. What do you do? Depending on the mechanism you chose to monitor, maybe the communication thread died, perhaps the work finished and you still show it as "processing". What do you do?
Now what happens when you have 20 processes and your management app dies. You don't have any real information, all you have is 20 "myprocess.exe" and maybe now have to start parsing the command line arguments they were started with to see which workers you actually have. Obviously with an AppDomain all 20 would have died too, but did you really gain anything with the process? You still have to code the ability to recover, however, now you have to also code all of the recovery for your processes instead of just firing the workers back up.
As with anything in programming, there's 1,000 different ways to achieve the same goal. It's up to you to decide which solution you feel is most appropriate.

Some practical benefits using app domain:
Multiple app domains can be run in a process. You can also stop individual app domain without stopping the entire process. This alone drastically increases the server scalability.
Managing app domain life cycle is done programmatically by runtime hosts (you can override it as well). For processes & threads, you have to explicitly manage their life cycle. Initialization, execution, termination, inter-process/multithread communication is complex and that's why it's easier to defer that to CLR management.
Source: https://learn.microsoft.com/en-us/dotnet/framework/app-domains/application-domains

Preventing a bottleneck in devicecommunication

I've got quite an abstract question. I'm working on a project that requires constant device communication. I'm integrating multiple devices onto an external processing unit with a touchpanel to execute certain methods. I.e. the "start videocall" button on the touchpanel activates a relay, turns a display-device, camera-device and microphone-device on, etc.
On the flipside, I'm also trying to monitor these devices. What status do they currently have? Are they enabled/disabled ? What input is the display device currently on?
So far, I've come up with two solutions to prevent a bottleneck in the communication where I'm constantly polling (i.e. every two to five seconds to keep an acurate and up-to-date status) the on-state and input-state of the display-device.
Make use of threading so I can enqueue the different commands and execute them async. By also reading the response async, all communication should be nicely spaced out but I'd have a very "busy" communication line, taking it's toll on the processing unit.
With the help of events have the display-device notify the processor of it's changed status. This would take a lot of stress off of the communication line, but I feel like this is very easily disrupted. If the device doesn't throw it's events correctly (or the events are missed out on) the monitored state does not correspond with the actual state.
I'm curious if there are other ways of going about this issue. As of now, I'm leaning towards the second one because it stresses the processing unit a whole lot less, I just feel like I should be building in a lot of safeguards to prevent an inacurate representation of the actual device-states.
The project runs in C# on .Net 3.5.

Polling works, but it isn't fun or optimal. Reactive is best but as you've mentioned there may be a hiccup insuring your still listening to to the device and not just standing by for nothing. In this situation it makes since to optimize both processes. Poll when you're waiting or haven't heard a response in so long and listen when your polling returns good info, passing the polling.
That said, you shouldn't worry about taxing the unit too much with polling on various threads. This sounds like a purpose device so as long as you're not running it hot or stressing it to max all the time then using your resources are perfectly fine.

Multi-server n-tier synchronized timing and performance metrics?

[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.

What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.

How to prevent NHibernate long-running process from locking up web site?

I have an NHibernate MVC application that is using ReadCommitted Isolation.
On the site, there is a certain process that the user could initiate, and depending on the input, may take several minutes. This is because the session is per request and is open that entire time.
But while that runs, no other user can access the site (they can try, but their request won't go through unless the long-running thing is finished)
What's more, I also have a need to have a console app that also performs this long running function while connecting to the same database. It is causing the same issue.
I'm not sure what part of my setup is wrong, any feedback would be appreciated.
NHibernate is set up with fluent configuration and StructureMap.
Isolation level is set as ReadCommitted.
The session factory lifecycle is HybridLifeCycle (which on the web should be Session per request, but on the win console app would be ThreadLocal)

It sounds like your requests are waiting on database locks. Your options are really:
Break the long running process into a series of smaller transactions.
Use ReadUncommitted isolation level most of the time (this is appropriate in a lot of use cases).
Judicious use of Snapshot isolation level (Assuming you're using MS-SQL 2005 or later).
(N.B. I'm assuming the long-running function does a lot of reads/writes and the requests being blocked are primarily doing reads.)

As has been suggested, breaking your process down into multiple smaller transactions will probably be the solution.
I would suggest looking at something like Rhino Service Bus or NServiceBus (my preference is Rhino Service Bus - I find it much simpler to work with personally). What that allows you to do is separate the functionality down into small chunks, but maintain the transactional nature. Essentially with a service bus, you send a message to initiate a piece of work, the piece of work will be enlisted in a distributed transaction along with receiving the message, so if something goes wrong, the message will not just disappear, leaving your system in a potentially inconsistent state.
Depending on what you need to do, you could send an initial message to start the processing, and then after each step, send a new message to initiate the next step. This can really help to break down the transactions into much smaller pieces of work (and simplify the code). The two service buses I mentioned (there is also Mass Transit), also have things like retries built in, and error handling, so that if something goes wrong, the message ends up in an error queue and you can investigate what went wrong, hopefully fix it, and reprocess the message, thus ensuring your system remains consistent.
Of course whether this is necessary depends on the requirements of your system :)

Another, but more complex solution would be:
You build a background robot application which runs on one of the machines
this background worker robot can be receive "worker jobs" (the one initiated by the user)
then, the robot processes the jobs step & step in the background
Pitfalls are:
- you have to programm this robot very stable
- you need to watch the robot somehow
Sure, this is involves more work - on the flip side you will have the option to integrate more job-types, enabling your system to process different things in the background.

I think the design of your application /SQL statements has a problem , unless you are facebook I dont think any process it should take all this time , it is better to review your design and check where is the bottleneck are, instead of trying to make this long running process continue .
also some times ORM is not good for every scenario , did you try to use SP ?

Is this a good time to use multithreading in ASP.NET MVC and how is it implemented?

I want a certain action request to trigger a set of e-mail notifications. The user does something, and it sends the emails. However I do not want the user to wait for page response until the system generates and sends the e-mails. Should I use multithreading for this? Will this even work in ASP.NET MVC? I want the user to get a page response back and the system just finish sending the e-mails at it's own pace. Not even sure if this is possible or what the code would look like. (PS: Please don't offer me an alternative solution for sending e-mails, don't have time for that kind of reconfiguration.)

SmtpClient.SendAsync is probably a better bet than manual threading, though multi-threading will work fine with the usual caveats.
http://msdn.microsoft.com/en-us/library/x5x13z6h.aspx
As other people have pointed out, success/failure cannot be indicated deterministically when the page returns before the send is actually complete.
A couple of observations when using asynchronous operations:
1) They will come back to bite you in some way or another. It's a risk versus benefit discussion. I like the SendAsync() method I proposed because it means forms can return instantly even if the email server takes a few seconds to respond. However, because it doesn't throw an exception, you can have a broken form and not even know it.
Of course unit testing should address this initially, but what if the production configuration file gets changed to point to a broken mail server? You won't know it, you won't see it in your logs, you only discover it when someone asks you why you never responded to the form they filled out. I speak from experience on this one. There are ways around this, but in practicality, async is always more work to test, debug, and maintain.
2) Threading in ASP.Net works in some situations if you understand the ThreadPool, app domain refreshes, locking, etc. I find that it is most useful for executing several operations at once to increase performance where the end result is deterministic, i.e. the application waits for all threads to complete. This way, you gain the performance benefits while still having a clear indication of results.
3) Threading/Async operations do not increase performance, only perceived performance. There may be some edge cases where that is not true (such as processor optimizations), but it's a good rule of thumb. Improperly used, threading can hurt performance or introduce instability.
The better scenario is out of process execution. For enterprise applications, I often move things out of the ASP.Net thread pool and into an execution service.
See this SO thread: Designing an asynchronous task library for ASP.NET

I know you are not looking for alternatives, but using a MessageQueue (such as MSMQ) could be a good solution for this problem in the future. Using multithreading in asp.net is normally discouraged, but in your current situation I don't see why you shouldn't. It is definitely possible, but beware of the pitfalls related to multithreading (stolen here):
•There is a runtime overhead
associated with creating and
destroying threads. When your
application creates and destroys
threads frequently, this overhead
affects the overall application
performance. •Having too many threads
running at the same time decreases the
performance of your entire system.
This is because your system is
attempting to give each thread a time
slot to operate inside. •You should
design your application well when you
are going to use multithreading, or
otherwise your application will be
difficult to maintain and extend. •You
should be careful when you implement a
multithreading application, because
threading bugs are difficult to debug
and resolve.

At the risk of violating your no-alternative-solution prime directive, I suggest that you write the email requests to a SQL Server table and use SQL Server's Database Mail feature. You could also write a Windows service that monitors the table and sends emails, logging successes and failures in another table that you view through a separate ASP.Net page.

You probably can use ThreadPool.QueueUserWorkItem

Yes this is an appropriate time to use multi-threading.
One thing to look out for though is how will you express to the user when the email sending ultamitely fails? Not blocking the user is a good step to improving your UI. But it still needs to not provide a false sense of success when ultamitely it failed at a later time.

Don't know if any of the above links mentioned it, but don't forget to keep an eye on request timeout values, the queued items will still need to complete within that time period.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.