I have a C# service application which interacts with a database. It was recently migrated from .NET 2.0 to .NET 4.0 so there are plenty of new tools we could use.
I'm looking for pointers to programming approaches or tools/libraries to handle defining tasks, configuring which tasks they depend on, queueing, prioritizing, cancelling, etc.
There are various types of services:
Data (for retrieving and updating)
Calculation (populate some table with the results of a calculation on the data)
Reporting
These services often depend on one another and are triggered on demand, i.e., a Reporting task, will probably have code within it such as
if (IsSomeDependentCalculationRequired())
PerformDependentCalculation(); // which may trigger further calculations
GenerateRequestedReport();
Also, any Data modification is likely to set the Required flag on some of the Calculation or Reporting services, (so the report could be out of date before it's finished generating). The tasks vary in length from a few seconds to a couple of minutes and are performed within transactions.
This has worked OK up until now, but it is not scaling well. There are fundamental design problems and I am looking to rewrite this part of the code. For instance, if two users request the same report at similar times, the dependent tasks will be executed twice. Also, there's currently no way to cancel a task in progress. It's hard to maintain the dependent tasks, etc..
I'm NOT looking for suggestions on how to implement a fix. Rather I'm looking for pointers to what tools/libraries I would be using for this sort of requirement if I were starting in .NET 4 from scratch. Would this be a good candidate for Windows Workflow? Is this what Futures are for? Are there any other libraries I should look at or books or blog posts I should read?
Edit: What about Rx Reactive Extensions?
I don't think your requirements fit into any of the built-in stuff. Your requirements are too specific for that.
I'd recommend that you build a task queueing infrastructure around a SQL database. Your tasks are pretty long-running (seconds) so you don't need particularly high throughput in the task scheduler. This means you won't encounter performance hurdles. It will actually be a pretty manageable task from a programming perspective.
Probably you should build a windows service or some other process that is continuously polling the database for new tasks or requests. This service can then enforce arbitrary rules on the requested tasks. For example it can detect that a reporting task is already running and not schedule a new computation.
My main point is that your requirements are that specific that you need to use C# code to encode them. You cannot make an existing tool fit your needs. You need the turing completeness of a programming language to do this yourself.
Edit: You should probably separate a task-request from a task-execution. This allows multiple parties to request a refresh of some reports while at the same time only one actual computation is running. Once this single computation is completed all task-requests are marked as completed. When a request is cancelled the execution does not need to be cancelled. Only when the last request is cancelled the task-execution is cancelled as well.
Edit 2: I don't think workflows are the solution. Workflows usually operate separately from each other. But you don't want that. You want to have rules which span multiple tasks/workflows. You would be working against the system with a workflow based model.
Edit 3: A few words about the TPL (Task Parallel Library). You mentioned it ("Futures"). If you want some inspiration on how tasks could work together, how dependencies could be created and how tasks could be composed, look at the Task Parallel Library (in particular the Task and TaskFactory classes). You will find some nice design patterns there because it is very well designed. Here is how you model a sequence of tasks: You call Task.ContinueWith which will register a continuation function as a new task. Here is how you model dependencies: TaskFactory.WhenAll(Task[]) starts a task that only runs when all its input tasks are completed.
BUT: The TPL itself is probably not well suited for you because its task cannot be saved to disk. When you reboot your server or deploy new code, all existing tasks are being cancelled and the process aborted. This is likely to be unacceptable. Please just use the TPL as inspiration. Learn from it what a "task/future" is and how they can be composed. Then implement your own form of tasks.
Does this help?
I would try to use the state machine package stateless to model the workflow. Using a package will provide a consistent way to advance the state of the workflow, across the various services. Each of your services would hold an internal statemachine implementation, and expose methods for advancing it. Stateless will be resposible for triggering actions based on the state of the workflow, and enforce you to explicitly setup the various states that it can be in - this will be particularly useful for maintenance, and it will probably help you understand the domain better.
If you want to solve this fundamental problem properly and in a scalable way, you should probably look as SOA architecture style.
Your services will receive commands and generate events you can handle in order to react on facts happen in your system.
And, yes, there are tools for it. For example NServiceBus is a wonderful tool to build SOA systems.
You can do a SQL data agent to run SQL queries in timed interval. You have to write the application yourself it looks like. Write like a long running program that checks the time and does something. I don't think there is clearcut tools out there to do what you are trying to do. Do C# application, WCF service. data automation can be done in the sql itself.
If I understand you right you want to cache the generated reports and do not the work again. As other commenters have pointed out this can be solved elegantly with multiple Producer/Consumer queues and some caches.
First you enqueue your Report request. Based on the report genration parameters you can check the cache first if a previously generated report is already available and simply return this one. If due to changes in the database the report becomes obsolete you need to take care that the cache is invalidated in a reliable manner.
Now if the report was not generated yet you need need to schedule the report for generation. The report scheduler needs to check if the same report is already beeing generated. If yes register an event to notify you when it is completed and return the report once it is finished. Make sure that you do not access the data via the caching layer since it could produce races (report is generated, data is changed and the finished report would be immediatly discared by the cache leaving noting for you to return).
Or if you do want to prevent to return outdated reports you can let the caching layer become your main data provider which will produce as many reports until one report is generated in time which was not outdated. But be aware that if you have constant changes in your database you might enter an endless loop here by constantly generating invalid reports if the report generation time is longer as the average time between to changes to your db.
As you can see you have plenty of options here without actually talking about .NET, TPL, SQL server. First you need to set your goals how fast/scalable and reliable your system should be then you need to choose the appropriate architecture-design as described above for your particular problem domain. I cannot do it for you because I do not have your full domain know how what is acceptable and what not.
The tricky part is the handover part between different queues with the proper reliability and correctness guarantees. Depending on your specific report generation needs you can put this logic into the cloud or use a single thread by putting all work into the proper queues and work on them concurrently or one by one or something in between.
TPL and SQL server can help there for sure but they are only tools. If used wrongly due to not sufficient experience with the one or the other it might turn out that a different approach (like the usage of only in memory queues and persisted reports on in the file system) is better suited for your problem.
From my current understanding I would not use SQL server to misuse it as a cache but if you want a database I would use something like RavenDB or RaportDB which look stable and much more light weight compared to a full blown SQL server.
But if you already have a SQL server running then go ahead and use it.
I am not sure if I understood you correctly, but you might want to have a look at JAMS Scheduler: http://www.jamsscheduler.com/. It's non-free, but a very good system for scheduling depending tasks and reporting. I have used it with success at my previous company. It's written in .NET and there is a .NET API for it, so you can write your own apps communicating with JAMS. They also have a very good support and are eager to implement new features.
Related
I am fairly new to asynchronous programming so I need help.
What I need to do is, create a windows service that constantly checks the database for menu updates (insert/updates), tables updates (insert/updates), menu category updates (insert/updates) and so on and if any change is detected the service will then need to POST those said changes to separate APIs one by one. Keeping in mind that the service will be used for just this purpose and the database that I need to check for updates is SQL Server.
So, how do I approach this scenario efficiently ? Do I create new Tasks (System.Threading.Tasks) or create new Threads (System.Threading.Thread) for each pieces like UpdateMenu that checks the menu updates and upload to api, UpdateTable, UpdateDishes and so on and how do I go about the Posting to the API part I mean do I create a new Task for each and every API call? I want the application to be as efficient as possible and pick the changes and post them to API as soon as possible.
Thanks in advance.
It seems that you are worried about the overhead of the mechanism that you are going to use, in order to fetch data from the database and post these data to APIs. You are thinking that maybe Threads are fast and Tasks are slower, or vice versa. In fact choosing between these two mechanisms is likely to have no measurable impact to your service's demand for CPU, memory or other system resources.
What is likely to be impactful, is the pattern of communication of your service with the database and the APIs. For example if your threads/tasks are not coordinated with each other, and query the database all at the same time, the database might be slow to respond, and might consume larger amounts of memory while preparing the response. That's not because your threads/tasks are slow. It's because your service is querying the database with a pattern that makes it harder for the database to respond. The same might be true for the pattern of communication with the APIs. If your workers are not coordinated, the network connectivity might become a bottleneck, or the remote machines that host the APIs might suffer.
So my advice is to focus on the usability factor of the mechanisms, and not on their supposed difference in performance. If you are comfortable and familiar with threads, and know nothing about tasks, use threads. If you are familiar with both threads and tasks, use tasks because they are generally easier to use. You'd better invest your time to optimize the communication pattern between your service and its dependencies, than for doing benchmarks trying to find the best between mechanisms that for all intents and purposes are equally efficient.
We have created a dotnet core web api project which is using SQL Server database. Now, we are planning to deploy this project to Microsoft Azure.
While the deployment of this application, we are also considering to enable autoscaling option (horizontal scaling).
Before, we do it. We want to have some questions that we want to clarify.
Should we need to add some additional code in our application which allows autoscaling to work properly?
Properly in a sense, as there can be more than one instance of the application running because of horizontal scaling. We are using database and more than one instance is running will it case race condition (i.e., two resources accessing the same data at a time). I mean we can add a transaction (or use locking) in our code to avoid these kinds of scenarios?
I want to know that is there any best practices to follow while implementing that kind of application?
Thank you and waiting for your answers!
Consider the following points when designing an autoscaling strategy:
The system must be designed to be horizontally scalable. Avoid making
assumptions about instance affinity; do not design solutions that
require that the code is always running in a specific instance of a
process. When scaling a cloud service or web site horizontally, do
not assume that a series of requests from the same source will always
be routed to the same instance. For the same reason, design services
to be stateless to avoid requiring a series of requests from an
application to always be routed to the same instance of a service.
When designing a service that reads messages from a queue and
processes them, do not make any assumptions about which instance of
the service handles a specific message because autoscaling could
start additional instances of a service as the queue length grows.
The Competing Consumers pattern describes how to handle this
scenario.
If the solution implements a long-running task, design this task to
support both scaling out and scaling in. Without due care, such a
task could prevent an instance of a process from being shutdown
cleanly when the system scales in, or it could lose data if the
process is forcibly terminated. Ideally, refactor a long-running task
and break up the processing that it performs into smaller, discrete
chunks. The Pipes and Filters pattern provides an example of how you
can achieve this. Alternatively, you can implement a checkpoint
mechanism that records state information about the task at regular
intervals, and save this state in durable storage that can be
accessed by any instance of the process running the task. In this
way, if the process is shutdown, the work that it was performing can
be resumed from the last checkpoint by using another instance.
For more information, follow the doc : https://github.com/Huachao/azure-content/blob/master/articles/best-practices-auto-scaling.md
Regarding this:
Properly in a sense, as there can be more than one instance of the application running because of horizontal scaling. We are using database and more than one instance is running will it case race condition (i.e., two resources accessing the same data at a time). I mean we can add a transaction (or use locking) in our code to avoid these kinds of scenarios?
Please keep in mind that, even if the app is running on a single machine, requests will still be handled concurrently. This means that even on a single machine 2 requests can cause the same entry in the database to be updated. So the above questions about race conditions apply to single instance web apps as well.
Try to avoid locking: the whole point of (horizontal) scaling is to gain performance benefits. By using locks you effectively remove this benefits as only one process at a time can use the locked resource.
Other points of considerations are:
If you are using an in-memory cache you might want to swap it out for a distributed cache.
The guidance at the MS docs
I have an NHibernate MVC application that is using ReadCommitted Isolation.
On the site, there is a certain process that the user could initiate, and depending on the input, may take several minutes. This is because the session is per request and is open that entire time.
But while that runs, no other user can access the site (they can try, but their request won't go through unless the long-running thing is finished)
What's more, I also have a need to have a console app that also performs this long running function while connecting to the same database. It is causing the same issue.
I'm not sure what part of my setup is wrong, any feedback would be appreciated.
NHibernate is set up with fluent configuration and StructureMap.
Isolation level is set as ReadCommitted.
The session factory lifecycle is HybridLifeCycle (which on the web should be Session per request, but on the win console app would be ThreadLocal)
It sounds like your requests are waiting on database locks. Your options are really:
Break the long running process into a series of smaller transactions.
Use ReadUncommitted isolation level most of the time (this is appropriate in a lot of use cases).
Judicious use of Snapshot isolation level (Assuming you're using MS-SQL 2005 or later).
(N.B. I'm assuming the long-running function does a lot of reads/writes and the requests being blocked are primarily doing reads.)
As has been suggested, breaking your process down into multiple smaller transactions will probably be the solution.
I would suggest looking at something like Rhino Service Bus or NServiceBus (my preference is Rhino Service Bus - I find it much simpler to work with personally). What that allows you to do is separate the functionality down into small chunks, but maintain the transactional nature. Essentially with a service bus, you send a message to initiate a piece of work, the piece of work will be enlisted in a distributed transaction along with receiving the message, so if something goes wrong, the message will not just disappear, leaving your system in a potentially inconsistent state.
Depending on what you need to do, you could send an initial message to start the processing, and then after each step, send a new message to initiate the next step. This can really help to break down the transactions into much smaller pieces of work (and simplify the code). The two service buses I mentioned (there is also Mass Transit), also have things like retries built in, and error handling, so that if something goes wrong, the message ends up in an error queue and you can investigate what went wrong, hopefully fix it, and reprocess the message, thus ensuring your system remains consistent.
Of course whether this is necessary depends on the requirements of your system :)
Another, but more complex solution would be:
You build a background robot application which runs on one of the machines
this background worker robot can be receive "worker jobs" (the one initiated by the user)
then, the robot processes the jobs step & step in the background
Pitfalls are:
- you have to programm this robot very stable
- you need to watch the robot somehow
Sure, this is involves more work - on the flip side you will have the option to integrate more job-types, enabling your system to process different things in the background.
I think the design of your application /SQL statements has a problem , unless you are facebook I dont think any process it should take all this time , it is better to review your design and check where is the bottleneck are, instead of trying to make this long running process continue .
also some times ORM is not good for every scenario , did you try to use SP ?
Question:
Is there a way to force the Task Parallel Library to run multiple tasks simultaneously? Even if it means making the whole process run slower with all the added context switching on each core?
Background:
I'm fairly new to multithreading, so I could use some assistance. My initial research hasn't turned up much, but I also doubt I know what exactly to search for. Perhaps someone more experienced with multithreading can help me better understand TPL and/or find a better solution.
Our company is planning on deploying a piece of software to all users' machines that will connect to a central server a few times a day, and synchronize some files and MS Access data back to the user's machine. We would like to load-test this concept first and see how the Access DB holds up to lots of simultaneous connections.
I've been tasked with writing a .NET application that behaves like the client app (connecting & syncing with a network location), but does this on multiple threads simultaneously.
I've been getting familiar with the Task Parallel Library (TPL), as this seems like the best (newest) way to handle multithreading, and get return values back from each thread easily. However as I understand it, TPL decides how to run each "task" for the fastest execution possible, splitting the work among the available cores. So lets say I want to run 30 sync jobs on a 2-core machine... the TPL would run 15 on each core, sequentially. This would mean my load test would only be hitting the Access DB with at most 2 connections at the same time. I want to hit the database with lots of simultaneous connections.
You can force the TPL to do this by specifying TaskOptions.LongRunning. According to Reflector (not according to the docs, though) this always creates a new thread. I consider relying on this safe production use.
Normal tasks will not do, because they don't guarantee execution. Setting MinThreads is a horrible solution (for production) because you are changing a process global setting to solve a local problem. And still, you are not guaranteed success.
Of course, you can also start threads. Tasks are more convenient though because of error handling. Nothing wrong with using threads for this use case.
Based on your comment, I think you should reconsider using Access in the first place. It doesn't scale well and has problems once the database grows to a certain size. Especially if this is simply served off some file share on your network.
You can try and simulate load from your single machine but I don't think that would be very representative of what you are trying to accomplish.
Have you considered using SQL Server Express? It's basically a de-tuned version of the full-blown SQL Server which might suit your needs better.
I want a certain action request to trigger a set of e-mail notifications. The user does something, and it sends the emails. However I do not want the user to wait for page response until the system generates and sends the e-mails. Should I use multithreading for this? Will this even work in ASP.NET MVC? I want the user to get a page response back and the system just finish sending the e-mails at it's own pace. Not even sure if this is possible or what the code would look like. (PS: Please don't offer me an alternative solution for sending e-mails, don't have time for that kind of reconfiguration.)
SmtpClient.SendAsync is probably a better bet than manual threading, though multi-threading will work fine with the usual caveats.
http://msdn.microsoft.com/en-us/library/x5x13z6h.aspx
As other people have pointed out, success/failure cannot be indicated deterministically when the page returns before the send is actually complete.
A couple of observations when using asynchronous operations:
1) They will come back to bite you in some way or another. It's a risk versus benefit discussion. I like the SendAsync() method I proposed because it means forms can return instantly even if the email server takes a few seconds to respond. However, because it doesn't throw an exception, you can have a broken form and not even know it.
Of course unit testing should address this initially, but what if the production configuration file gets changed to point to a broken mail server? You won't know it, you won't see it in your logs, you only discover it when someone asks you why you never responded to the form they filled out. I speak from experience on this one. There are ways around this, but in practicality, async is always more work to test, debug, and maintain.
2) Threading in ASP.Net works in some situations if you understand the ThreadPool, app domain refreshes, locking, etc. I find that it is most useful for executing several operations at once to increase performance where the end result is deterministic, i.e. the application waits for all threads to complete. This way, you gain the performance benefits while still having a clear indication of results.
3) Threading/Async operations do not increase performance, only perceived performance. There may be some edge cases where that is not true (such as processor optimizations), but it's a good rule of thumb. Improperly used, threading can hurt performance or introduce instability.
The better scenario is out of process execution. For enterprise applications, I often move things out of the ASP.Net thread pool and into an execution service.
See this SO thread: Designing an asynchronous task library for ASP.NET
I know you are not looking for alternatives, but using a MessageQueue (such as MSMQ) could be a good solution for this problem in the future. Using multithreading in asp.net is normally discouraged, but in your current situation I don't see why you shouldn't. It is definitely possible, but beware of the pitfalls related to multithreading (stolen here):
•There is a runtime overhead
associated with creating and
destroying threads. When your
application creates and destroys
threads frequently, this overhead
affects the overall application
performance. •Having too many threads
running at the same time decreases the
performance of your entire system.
This is because your system is
attempting to give each thread a time
slot to operate inside. •You should
design your application well when you
are going to use multithreading, or
otherwise your application will be
difficult to maintain and extend. •You
should be careful when you implement a
multithreading application, because
threading bugs are difficult to debug
and resolve.
At the risk of violating your no-alternative-solution prime directive, I suggest that you write the email requests to a SQL Server table and use SQL Server's Database Mail feature. You could also write a Windows service that monitors the table and sends emails, logging successes and failures in another table that you view through a separate ASP.Net page.
You probably can use ThreadPool.QueueUserWorkItem
Yes this is an appropriate time to use multi-threading.
One thing to look out for though is how will you express to the user when the email sending ultamitely fails? Not blocking the user is a good step to improving your UI. But it still needs to not provide a false sense of success when ultamitely it failed at a later time.
Don't know if any of the above links mentioned it, but don't forget to keep an eye on request timeout values, the queued items will still need to complete within that time period.