I am using Disruptor-net in a C# application. I'm having some trouble understanding how to do async operations in the disruptor pattern.
Assuming I have a few event handlers, and the last one in the chain hands a message off to my business logic processors, how do I handle async operations inside of my business logic processor? When my business logic needs to do some database insert, does it hand a message off to my output disruptor, which does the insert, then publishes a new message on my input disruptor with all the state to continue the transaction?
In addition, within my output disruptor, would I use Tasks? I'm 99.9% sure I'd want to use tasks so I don't have a ton of event handlers blocking on async operations. How does that fit in with the disruptor pattern then? Seems kind of weird to just do something like this in my EventHandler..
void OnEvent(MyEvent evt, long sequence, bool endOfBatch)
{
db.InsertAsync(evt).ContinueWith(task => inputDisruptor.Publish(task));
}
The Disruptor has the following features:
Dedicated threads, which can be pinned / shielded / prioritized for better performance.
Explicit queues, which can be monitored and generate backpressure.
In-order message processing.
No heap allocations, which can help reduce GC pauses, or even remove them if your own code does not generate heap allocations.
Your code sample does not really follow the Disruptor philosophy:
Task.ContinueWith runs asynchronously by default, so the continuation will use thread-pool threads.
Because you are using the thread-pool, you have no guarantee on the continuation execution order. Even if you use TaskContinuationOptions.ExecuteSynchronously, you have no guarantee that InsertAsync will invoke the continuations in-order.
You are creating an implicit queue with all the pending insert operations. This queue is hidden and does not generate backpressure.
I will put aside the fact that your code is generating heap allocations. You will not benefit from the "no GC pauses" effect but it is probably very acceptable for your use-case.
Also, please note that batching is crucial to support high-throughput for IO operations. You should really use the Disruptor batches in your event handler.
I will simplify the problem to 3 event handlers:
PreInsertEventHandler: pre-insert logic (not shown here)
InsertEventHandler: insert logic
PostInsertEventHandler: post-insert logic
Of course, I am assuming that the post-insert logic must be run only after insert completion.
If your goal is to save the events in InsertEventHandler and to block until completion before processing the event in the next handler, you should probably just wait in InsertEventHandler.
InsertEventHandler:
void OnEvent(MyEvent evt, long sequence, bool endOfBatch)
{
_pendingInserts.Add((evt, task: db.InsertAsync(evt)));
if (endOfBatch)
{
var insertSucceeded = Task.WaitAll(_pendingInserts.Select(x => x.task).ToArray(), _insertTimeout);
foreach (var (pendingEvent, _) in _pendingInserts)
{
pendingEvent.InsertSucceeded = insertSucceeded;
}
_pendingInserts.Clear();
}
}
Of course, if your DB API exposes a bulk-insert method, it might be better to add the events in a list and to save them all at the end of the batch.
There are many other options, like waiting in PostInsertEventHandler, or queueing the insert results in another Disruptor, each coming with its own pros and cons. A SO answer might not be the best place to discuss and analyze all of them.
Related
I've been working on a system which deals with a large queue of inserts to a SQL database. The data for these inserts is fetched from a series of API making the overall operation a little time consuming and a bit heavy due to complex deserialization. In order to make the overall process more efficient, I have come up with this idea of encapsulating data processing and the insert operation for each API call into a single Task and pushing each Task into a ConcurrentQueue while monitoring them for either completion or failure later. To do so, I have developed a wrapper around the Task type with an assignable Id which belongs to its corresponding data. I have implemented this monitoring as follows:
while(processes.TryDequeue(out TaskInfo taskInfo))
{
if (!taskInfo.Task.IsCompleted) {
processes.Enqueue(taskInfo);
continue;
}
if (taskInfo.Task.IsCompletedSuccessfully)
{
Console.WriteLine("{0} Completed.", taskInfo.ReferenceId);
}
else {
Console.WriteLine("{0} Failed With {1}.", taskInfo.ReferenceId, trackableTask.Task.Exception.Message);
}
}
As you see, I do not await the task and instead I check for its Completed status and if it is not yet completed I Enqueue the task back. The reason why I did that I because I believed if I do so, I can skip waiting for a long running task by moving it to the end of the collection so I that I can move to the next task and do the monitoring process a bit faster.
I would like to know if what I have done is a bad approach particularly in comparison with the WhenAll method built into the task type. I'm also unsure if what I did is a proper usage of the ConcurrentQueue type.
Since you are working that queue with only a single thread, you can use a regular queue.
Concurrent Queue is helpfull when you try acessing the queue with multiple threads. E.G the threads that are inside of your queue :)
While reading your code i wonder whether you have heard of TPL: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/task-parallel-library-tpl
or PLINQ:
https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/parallel-linq-plinq
these may help you out with your task(s) ;)
I have tried to simplify my issue by a sample code here. I have a producer thread constantly pumping in data and I am trying to batch it with a time delay between batches so that the UI has time to render it. But the result is not as expected, the produce and consumer seems to be on the same thread.
I don't want the batch buffer to sleep on the thread that is producing. Tried SubscribeOn did not help much. What am I doing wrong here, how do I get this to print different thread Ids on producer and consumer thread.
static void Main(string[] args)
{
var stream = new ReplaySubject<int>();
Task.Factory.StartNew(() =>
{
int seed = 1;
while (true)
{
Console.WriteLine("Thread {0} Producing {1}",
Thread.CurrentThread.ManagedThreadId, seed);
stream.OnNext(seed);
seed++;
Thread.Sleep(TimeSpan.FromMilliseconds(500));
}
});
stream.Buffer(5).Do(x =>
{
Console.WriteLine("Thread {0} sleeping to create time gap between batches",
Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(TimeSpan.FromSeconds(2));
})
.SubscribeOn(NewThreadScheduler.Default).Subscribe(items =>
{
foreach (var item in items)
{
Console.WriteLine("Thread {0} Consuming {1}",
Thread.CurrentThread.ManagedThreadId, item);
}
});
Console.Read();
}
Understanding the difference between ObserveOn and SubscribeOn is key here. See - ObserveOn and SubscribeOn - where the work is being done for an in depth explanation of these.
Also, you absolutely don't want to use a Thread.Sleep in your Rx. Or anywhere. Ever. Do is almost as evil, but Thead.Sleep is almost always totally evil. Buffer has serveral overloads you want to use instead - these include a time based overload and an overload that accepts a count limit and a time-limit, returning a buffer when either of these are reached. A time-based buffering will introduce the necessary concurrency between producer and consumer - that is, deliver the buffer to it's subscriber on a separate thread from the producer.
Also see these questions and answers which have good discussions on keeping consumers responsive (in the context of WPF here, but the points are generally applicable).
Process lots of small tasks and keep the UI responsive
Buffer data from database cursor while keeping UI responsive
The last question above specifically uses the time-based buffer overload. As I said, using Buffer or ObserveOn in your call chain will allow you to add concurrency between producer and consumer. You still need to take care that the processing of a buffer is still fast enough that you don't get a queue building up on the buffer subscriber.
If queues do build up, you'll need to think about means of applying backpressure, dropping updates and/or conflating the updates. These is a big topic too broad for in depth discussion here - but basically you either:
Drop events. There have been many ways discussed to tackle this in Rx. I current like Ignore incoming stream updates if last callback hasn't finished yet but also see With Rx, how do I ignore all-except-the-latest value when my Subscribe method is running and there are many other discussions of this.
Signal the producer out of band to tell it to slow down or send conflated updates, or
You introduce an operator that does in-stream conflation - like a smarter Buffer that could compress events to, for example, only include the latest price on a stock item etc. You can author operators that are sensitive to the time that OnNext invocations take to process, for example.
See if proper buffering helps first, then think about throttling/conflating events at the source as (a UI can only show so much infomation anway) - then consider smarter conflation as this can get quite complex. https://github.com/AdaptiveConsulting/ReactiveTrader is a good example of a project using some advanced conflation techniques.
Although the other answers are correct, I'd like to identify your actual problem as perhaps a misunderstanding of the behavior of Rx. Putting the producer to sleep blocks subsequent calls to OnNext and it seems as though you're assuming Rx automatically calls OnNext concurrently, but in fact it doesn't for very good reasons. Actually, Rx has a contract that requires serialized notifications.
See §§4.2, 6.7 in the Rx Design Guidelines for details.
Ultimately, it looks as though you're trying to implement the BufferIntrospective operator from Rxx. This operator allows you to pass in a concurrency-introducing scheduler, similar to ObserveOn, to create a concurrency boundary between a producer and a consumer. BufferIntrospective is a dynamic backpressure strategy that pushes out heterogeneously-sized batches based on the changing latencies of an observer. While the observer is processing the current batch, the operator buffers all incoming concurrent notifications. To accomplish this, the operator takes advantage of the fact that OnNext is a blocking call (per the §4.2 contract) and for that reason this operator should be applied as close to the edge of the query as possible, generally immediately before you call Subscribe.
As James described, you could call it a "smart buffering" strategy itself, or see it as the baseline for implementing such a strategy; e.g., I've also defined a SampleIntrospective operator that drops all but the last notification in each batch.
ObserveOn is probably what you want. It takes a SynchronizationContext as an argument, that should be the SynchronizationContext of your UI. If you don't know how to get it, see Using SynchronizationContext for sending events back to the UI for WinForms or WPF
I'm just beginning to learn C# threading and concurrent collections, and am not sure of the proper terminology to pose my question, so I'll describe briefly what I'm trying to do. My grasp of the subject is rudimentary at best at this point. Is my approach below even feasible as I've envisioned it?
I have 100,000 urls in a Concurrent collection that must be tested--is the link still good? I have another concurrent collection, initially empty, that will contain the subset of urls that an async request determines to have been moved (400, 404, etc errors).
I want to spawn as many of these async requests concurrently as my PC and our bandwidth will allow, and was going to start at 20 async-web-request-tasks per second and work my way up from there.
Would it work if a single async task handled both things: it would make the async request and then add the url to the BadUrls collection if it encountered a 4xx error? A new instance of that task would be spawned every 50ms:
class TestArgs args {
ConcurrentBag<UrlInfo> myCollection { get; set; }
System.Uri currentUrl { get; set; }
}
ConcurrentQueue<UrlInfo> Urls = new ConncurrentQueue<UrlInfo>();
// populate the Urls queue
<snip>
// initialize the bad urls collection
ConcurrentBag<UrlInfo> BadUrls = new ConcurrentBag<UrlInfo>();
// timer fires every 50ms, whereupon a new args object is created
// and the timer callback spawns a new task; an autoEvent would
// reset the timer and dispose of it when the queue was empty
void SpawnNewUrlTask(){
// if queue is empty then reset the timer
// otherwise:
TestArgs args = {
myCollection = BadUrls,
currentUrl = getNextUrl() // take an item from the queue
};
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
}
public async Task asyncWebRequestAndConcurrentCollectionUpdater(TestArgs args)
{
//make the async web request
// add the url to the bad collection if appropriate.
}
Feasible? Way off?
The approach seems fine, but there are some issues with the specific code you've shown.
But before I get to that, there have been suggestions in the comments that Task Parallelism is the way to go. I think that's misguided. There's a common misconception that if you want to have lots of work going on in parallel, you necessarily need lots of threads. That's only true if the work is compute-bound. But the work you're doing will be IO bound - this code is going to spend the vast majority of its time waiting for responses. It will do very little computation. So in practice, even if it only used a single thread, your initial target of 20 requests per second doesn't seem like a workload that would cause a single CPU core to break into a sweat.
In short, a single thread can handle very high levels of concurrent IO. You only need multiple threads if you need parallel execution of code, and that doesn't look likely to be the case here, because there's so little work for the CPU in this particular job.
(This misconception predates await and async by years. In fact, it predates the TPL - see http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless for a .NET 1.1 era illustration of how you can handle thousands of concurrent requests with a tiny number of threads. The underlying principles still apply today because Windows networking IO still basically works the same way.)
Not that there's anything particularly wrong with using multiple threads here, I'm just pointing out that it's a bit of a distraction.
Anyway, back to your code. This line is problematic:
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
While you've not given us all your code, I can't see how that will be able to compile. The overloads of StartNew that accept two arguments require the first to be either an Action, an Action<object>, a Func<TResult>, or a Func<object,TResult>. In other words, it has to be a method that either takes no arguments, or accepts a single argument of type object (and which may or may not return a value). Your 'asyncWebRequestAndConcurrentCollectionUpdater' takes an argument of type TestArgs.
But the fact that it doesn't compile isn't the main problem. That's easily fixed. (E.g., change it to Task.Factory.StartNew(() => asyncWebRequestAndConcurrentCollectionUpdater(args));) The real issue is what you're doing is a bit weird: you're using Task.StartNew to invoke a method that already returns a Task.
Task.StartNew is a handy way to take a synchronous method (i.e., one that doesn't return a Task) and run it in a non-blocking way. (It'll run on the thread pool.) But if you've got a method that already returns a Task, then you didn't really need to use Task.StartNew. The weirdness becomes more apparent if we look at what Task.StartNew returns (once you've fixed the compilation error):
Task<Task> t = Task.Factory.StartNew(
() => asyncWebRequestAndConcurrentCollectionUpdater(args));
That Task<Task> reveals what's happening. You've decided to wrap a method that was already asynchronous with a mechanism that is normally used to make non-asynchronous methods asynchronous. And so you've now got a Task that produces a Task.
One of the slightly surprising upshots of this is that if you were to wait for the task returned by StartNew to complete, the underlying work would not necessarily be done:
t.Wait(); // doesn't wait for asyncWebRequestAndConcurrentCollectionUpdater to finish!
All that will actually do is wait for asyncWebRequestAndConcurrentCollectionUpdater to return a Task. And since asyncWebRequestAndConcurrentCollectionUpdater is already an async method, it will return a task more or less immediately. (Specifically, it'll return a task the moment it performs an await that does not complete immediately.)
If you want to wait for the work you've kicked off to finish, you'll need to do this:
t.Result.Wait();
or, potentially more efficiently, this:
t.Unwrap().Wait();
That says: get me the Task that my async method returned, and then wait for that. This may not be usefully different from this much simpler code:
Task t = asyncWebRequestAndConcurrentCollectionUpdater("foo");
... maybe queue up some other tasks ...
t.Wait();
You may not have gained anything useful by introducing `Task.Factory.StartNew'.
I say "may" because there's an important qualification: it depends on the context in which you start the work. C# generates code which, by default, attempts to ensure that when an async method continues after an await, it does so in the same context in which the await was initially performed. E.g., if you're in a WPF app and you await while on the UI thread, when the code continues it will arrange to do so on the UI thread. (You can disable this with ConfigureAwait.)
So if you're in a situation in which the context is essentially serialized (either because it's single-threaded, as will be the case in a GUI app, or because it uses something resembling a rental model, e.g. the context of an particular ASP.NET request), it may actually be useful to kick an async task off via Task.Factory.StartNew because it enables you to escape the original context. However, you just made your life harder - tracking your tasks to completion is somewhat more complex. And you might have been able to achieve the same effect simply by using ConfigureAwait inside your async method.
And it may not matter anyway - if you're only attempting to manage 20 requests a second, the minimal amount of CPU effort required to do that means that you can probably manage it entirely adequately on one thread. (Also, if this is a console app, the default context will come into play, which uses the thread pool, so your tasks will be able to run multithreaded in any case.)
But to get back to your question, it seems entirely reasonable to me to have a single async method that picks a url off the queue, makes the request, examines the response, and if necessary, adds an entry to the bad url collection. And kicking the things off from a timer also seems reasonable - that will throttle the rate at which connections are attempted without getting bogged down with slow responses (e.g., if a load of requests end up attempting to talk to servers that are offline). It might be necessary to introduce a cap for the maximum number of requests in flight if you hit some pathological case where you end up with tens of thousands of URLs in a row all pointing to a server that isn't responding. (On a related note, you'll need to make sure that you're not going to hit any per-client connection limits with whichever HTTP API you're using - that might end up throttling the effective throughput.)
You will need to add some sort of completion handling - just kicking off asynchronous operations and not doing anything to handle the results is bad practice, because you can end up with exceptions that have nowhere to go. (In .NET 4.0, these used to terminate your process, but as of .NET 4.5, by default an unhandled exception from an asynchronous operation will simply be ignored!) And if you end up deciding that it is worth launching via Task.Factory.StartNew remember that you've ended up with an extra layer of wrapping, so you'll need to do something like myTask.Unwrap().ContinueWith(...) to handle it correctly.
Of course you can. Concurrent collections are called 'concurrent' because they can be used... concurrently by multiple threads, with some warranties about their behaviour.
A ConcurrentQueue will ensure that each element inserted in it is extracted exactly once (concurrent threads will never extract the same item by mistake, and once the queue is empty, all the items have been extracted by a thread).
EDIT: the only thing that could go wrong is that 50ms is not enough to complete the request, and so more and more tasks cumulate in the task queue. If that happens, your memory could get filled, but the thing would work anyway. So yes, it is feasible.
Anyway, I would like to underline the fact that a task is not a thread. Even if you create 100 tasks, the framework will decide how many of them will be actually executed concurrently.
If you want to have more control on the level of parallelism, you should use asynchronous requests.
In your comments, you wrote "async web request", but I can't understand if you wrote async just because it's on a different thread or because you intend to use the async API.
If you were using the async API, I'd expect to see some handler attached to the completion event, but I couldn't see it, so I assumed you're using synchronous requests issued from an asynchronous task.
If you're using asynchronous requests, then it's pointless to use tasks, just use the timer to issue the async requests, since they are already asynchronous.
When I say "asynchronous request" I'm referring to methods like WebRequest.GetResponseAsync and WebRequest.BeginGetResponse.
EDIT2: if you want to use asynchronous requests, then you can just make requests from the timer handler. The BeginGetResponse method takes two arguments. The first one is a callback procedure, that will be called to report the status of the request. You can pass the same procedure for all the requests. The second one is an user-provided object, which will store status about the request, you can use this argument to differentiate among different requests. You can even do it without the timer. Something like:
private readonly int desiredConcurrency = 20;
struct RequestData
{
public UrlInfo url;
public HttpWebRequest request;
}
/// Handles the completion of an asynchronous request
/// When a request has been completed,
/// tries to issue a new request to another url.
private void AsyncRequestHandler(IAsyncResult ar)
{
if (ar.IsCompleted)
{
RequestData data = (RequestData)ar.AsyncState;
HttpWebResponse resp = data.request.EndGetResponse(ar);
if (resp.StatusCode != 200)
{
BadUrls.Add(data.url);
}
//A request has been completed, try to start a new one
TryIssueRequest();
}
}
/// If urls is not empty, dequeues a url from it
/// and issues a new request to the extracted url.
private bool TryIssueRequest()
{
RequestData rd;
if (urls.TryDequeue(out rd.url))
{
rd.request = CreateRequestTo(rd.url); //TODO implement
rd.request.BeginGetResponse(AsyncRequestHandler, rd);
return true;
}
else
{
return false;
}
}
//Called by a button handler, or something like that
void StartTheRequests()
{
for (int requestCount = 0; requestCount < desiredConcurrency; ++requestCount)
{
if (!TryIssueRequest()) break;
}
}
I am trying to better understand the Async and the Parallel options I have in C#. In the snippets below, I have included the 5 approaches I come across most. But I am not sure which to choose - or better yet, what criteria to consider when choosing:
Method 1: Task
(see http://msdn.microsoft.com/en-us/library/dd321439.aspx)
Calling StartNew is functionally equivalent to creating a Task using one of its constructors and then calling Start to schedule it for execution. However, unless creation and scheduling must be separated, StartNew is the recommended approach for both simplicity and performance.
TaskFactory's StartNew method should be the preferred mechanism for creating and scheduling computational tasks, but for scenarios where creation and scheduling must be separated, the constructors may be used, and the task's Start method may then be used to schedule the task for execution at a later time.
// using System.Threading.Tasks.Task.Factory
void Do_1()
{
var _List = GetList();
_List.ForEach(i => Task.Factory.StartNew(_ => { DoSomething(i); }));
}
Method 2: QueueUserWorkItem
(see http://msdn.microsoft.com/en-us/library/system.threading.threadpool.getmaxthreads.aspx)
You can queue as many thread pool requests as system memory allows. If there are more requests than thread pool threads, the additional requests remain queued until thread pool threads become available.
You can place data required by the queued method in the instance fields of the class in which the method is defined, or you can use the QueueUserWorkItem(WaitCallback, Object) overload that accepts an object containing the necessary data.
// using System.Threading.ThreadPool
void Do_2()
{
var _List = GetList();
var _Action = new WaitCallback((o) => { DoSomething(o); });
_List.ForEach(x => ThreadPool.QueueUserWorkItem(_Action));
}
Method 3: Parallel.Foreach
(see: http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx)
The Parallel class provides library-based data parallel replacements for common operations such as for loops, for each loops, and execution of a set of statements.
The body delegate is invoked once for each element in the source enumerable. It is provided with the current element as a parameter.
// using System.Threading.Tasks.Parallel
void Do_3()
{
var _List = GetList();
var _Action = new Action<object>((o) => { DoSomething(o); });
Parallel.ForEach(_List, _Action);
}
Method 4: IAsync.BeginInvoke
(see: http://msdn.microsoft.com/en-us/library/cc190824.aspx)
BeginInvoke is asynchronous; therefore, control returns immediately to the calling object after it is called.
// using IAsync.BeginInvoke()
void Do_4()
{
var _List = GetList();
var _Action = new Action<object>((o) => { DoSomething(o); });
_List.ForEach(x => _Action.BeginInvoke(x, null, null));
}
Method 5: BackgroundWorker
(see: http://msdn.microsoft.com/en-us/library/system.componentmodel.backgroundworker.aspx)
To set up for a background operation, add an event handler for the DoWork event. Call your time-consuming operation in this event handler. To start the operation, call RunWorkerAsync. To receive notifications of progress updates, handle the ProgressChanged event. To receive a notification when the operation is completed, handle the RunWorkerCompleted event.
// using System.ComponentModel.BackgroundWorker
void Do_5()
{
var _List = GetList();
using (BackgroundWorker _Worker = new BackgroundWorker())
{
_Worker.DoWork += (s, arg) =>
{
arg.Result = arg.Argument;
DoSomething(arg.Argument);
};
_Worker.RunWorkerCompleted += (s, arg) =>
{
_List.Remove(arg.Result);
if (_List.Any())
_Worker.RunWorkerAsync(_List[0]);
};
if (_List.Any())
_Worker.RunWorkerAsync(_List[0]);
}
}
I suppose the obvious critieria would be:
Is any better than the other for performance?
Is any better than the other for error handling?
Is any better than the other for monitoring/feedback?
But, how do you choose?
Thanks in advance for your insights.
Going to take these in an arbitrary order:
BackgroundWorker (#5)
I like to use BackgroundWorker when I'm doing things with a UI. The advantage that it has is having the progress and completion events fire on the UI thread which means you don't get nasty exceptions when you try to change UI elements. It also has a nice built-in way of reporting progress. One disadvantage that this mode has is that if you have blocking calls (like web requests) in your work, you'll have a thread sitting around doing nothing while the work is happening. This is probably not a problem if you only think you'll have a handful of them though.
IAsyncResult/Begin/End (APM, #4)
This is a widespread and powerful but difficult model to use. Error handling is troublesome since you need to re-catch exceptions on the End call, and uncaught exceptions won't necessarily make it back to any relevant pieces of code that can handle it. This has the danger of permanently hanging requests in ASP.NET or just having errors mysteriously disappear in other applications. You also have to be vigilant about the CompletedSynchronously property. If you don't track and report this properly, the program can hang and leak resources. The flip side of this is that if you're running inside the context of another APM, you have to make sure that any async methods you call also report this value. That means doing another APM call or using a Task and casting it to an IAsyncResult to get at its CompletedSynchronously property.
There's also a lot of overhead in the signatures: You have to support an arbitrary object to pass through, make your own IAsyncResult implementation if you're writing an async method that supports polling and wait handles (even if you're only using the callback). By the way, you should only be using callback here. When you use the wait handle or poll IsCompleted, you're wasting a thread while the operation is pending.
Event-based Asynchronous Pattern (EAP)
One that was not on your list but I'll mention for the sake of completeness. It's a little bit friendlier than the APM. There are events instead of callbacks and there's less junk hanging onto the method signatures. Error handling is a little easier since it's saved and available in the callback rather than re-thrown. CompletedSynchronously is also not part of the API.
Tasks (#1)
Tasks are another friendly async API. Error handling is straightforward: the exception is always there for inspection on the callback and nobody cares about CompletedSynchronously. You can do dependencies and it's a great way to handle execution of multiple async tasks. You can even wrap APM or EAP (one type you missed) async methods in them. Another good thing about using tasks is your code doesn't care how the operation is implemented. It may block on a thread or be totally asynchronous but the consuming code doesn't care about this. You can also mix APM and EAP operations easily with Tasks.
Parallel.For methods (#3)
These are additional helpers on top of Tasks. They can do some of the work to create tasks for you and make your code more readable, if your async tasks are suited to run in a loop.
ThreadPool.QueueUserWorkItem (#2)
This is a low-level utility that's actually used by ASP.NET for all requests. It doesn't have any built-in error handling like tasks so you have to catch everything and pipe it back up to your app if you want to know about it. It's suitable for CPU-intensive work but you don't want to put any blocking calls on it, such as a synchronous web request. That's because as long as it runs, it's using up a thread.
async / await Keywords
New in .NET 4.5, these keywords let you write async code without explicit callbacks. You can await on a Task and any code below it will wait for that async operation to complete, without consuming a thread.
Your first, third and forth examples use the ThreadPool implicitly because by default Tasks are scheduled on the ThreadPool and the TPL extensions use the ThreadPool as well, the API simply hides some of the complexity see here and here. BackgroundWorkers are part of the ComponentModel namespace because they are meant for use in UI scenarios.
Reactive extensions is another upcoming library for handling asynchronous programming, especially when it comes to composition of asynchronous events and methods.
It's not native, however it's developed by Ms labs. It's available both for .NET 3.5 and .NET 4.0 and is essentially a collection of extension methods on the .NET 4.0 introduced IObservable<T> interface.
There are a lot of examples and tutorials on their main site, and I strongly recommend checking some of them out. The pattern might seem a bit odd at first (at least for .NET programmers), but well worth it, even if it's just grasping the new concept.
The real strength of reactive extensions (Rx.NET) is when you need to compose multiple asynchronous sources and events. All operators are designed with this in mind and handles the ugly parts of asynchrony for you.
Main site: http://msdn.microsoft.com/en-us/data/gg577609
Beginner's guide: http://msdn.microsoft.com/en-us/data/gg577611
Examples: http://rxwiki.wikidot.com/101samples
That said, the best async pattern probably depends on what situation you're in. Some are better (simpler) for simpler stuff and some are more extensible and easier to handle when it comes to more complex scenarios. I cannot speak for all the ones you're mentioning though.
The last one is the best for 2,3 at least. It has built-in methods/properties for this.
Other variants are almost the same, just different versions/convinient wrappers
Greetings.
I'm trying to implement some multithreaded code in an application. The purpose of this code is to validate items that the database gives it. Validation can take quite a while (a few hundred ms to a few seconds), so this process needs to be forked off into its own thread for each item.
The database may give it 20 or 30 items a second in the beginning, but that begins to decline rapidly, eventually reaching about 65K items over 24 hours, at which point the application exits.
I'd like it if anyone more knowledgeable could take a peek at my code and see if there's any obvious problems. No one I work with knows multithreading, so I'm really just on my own, on this one.
Here's the code. It's kinda long but should be pretty clear. Let me know if you have any feedback or advice. Thanks!
public class ItemValidationService
{
/// <summary>
/// The object to lock on in this class, for multithreading purposes.
/// </summary>
private static object locker = new object();
/// <summary>Items that have been validated.</summary>
private HashSet<int> validatedItems;
/// <summary>Items that are currently being validated.</summary>
private HashSet<int> validatingItems;
/// <summary>Remove an item from the index if its links are bad.</summary>
/// <param name="id">The ID of the item.</param>
public void ValidateItem(int id)
{
lock (locker)
{
if
(
!this.validatedItems.Contains(id) &&
!this.validatingItems.Contains(id)
){
ThreadPool.QueueUserWorkItem(sender =>
{
this.Validate(id);
});
}
}
} // method
private void Validate(int itemId)
{
lock (locker)
{
this.validatingItems.Add(itemId);
}
// *********************************************
// Time-consuming routine to validate an item...
// *********************************************
lock (locker)
{
this.validatingItems.Remove(itemId);
this.validatedItems.Add(itemId);
}
} // method
} // class
The thread pool is a convenient choice if you have light weight sporadic processing that isn't time sensitive. However, I recall reading on MSDN that it's not appropriate for large scale processing of this nature.
I used it for something quite similar to this and regret it. I took a worker-thread approach in subsequent apps and am much happier with the level of control I have.
My favorite pattern in the worker-thread model is to create a master thread which holds a queue of tasks items. Then fork a bunch of workers that pop items off that queue to process. I use a blocking queue so that when there are no items the process, the workers just block until something is pushed onto the queue. In this model, the master thread produces work items from some source (db, etc.) and the worker threads consume them.
I second the idea of using a blocking queue and worker threads. Here is a blocking queue implementation that I've used in the past with good results:
https://www.codeproject.com/Articles/8018/Bounded-Blocking-Queue-One-Lock
What's involved in your validation logic? If its mainly CPU bound then I would create no more than 1 worker thread per processor/core on the box. This will tell you the number of processors:
Environment.ProcessorCount
If your validation involves I/O such as File Access or database access then you could use a few more threads than the number of processors.
Be careful, QueueUserWorkItem might fail
There is a possible logic error in the code posted with the question, depending on where the item id in ValidateItem(int id) comes from. Why? Because although you correctly lock your validatingItems and validatedItems queues before queing a work item, you do not add the item to the validatingItems queue until the new thread spins up. That means there could be a time gap where another thread calls ValidateItem(id) with the same id (unless this is running on a single main thread).
I would add item to the validatingItems queue just before queuing the item, inside the lock.
Edit: also QueueUserWorkItem() returns a bool so you should use the return value to make sure the item was queued and THEN add it to the validatingItems queue.
ThreadPool may not be optimal for jamming so much at once into it. You may want to research the upper limits of its capabilities and/or roll your own.
Also, there is a race condition that exists in your code, if you expect no duplicate validations. The call to
this.validatingItems.Add(itemId);
needs to happen in the main thread (ValidateItem), not in the thread pool thread (Validate method). This call should occur a line before the queueing of the work item to the pool.
A worse bug is found by not checking the return of QueueUserWorkItem. Queueing can fail, and why it doesn't throw an exception is a mystery to us all. If it returns false, you need to remove the item that was added to the validatingItems list, and handle the error (throw exeception probably).
I would be concerned about performance here. You indicated that the database may give it 20-30 items per second and an item could take up to a few seconds to be validated. That could be quite a large number of threads -- using your metrics, worst case 60-90 threads! I think you need to reconsider the design here. Michael mentioned a nice pattern. The use of the queue really helps keep things under control and organized. A semaphore could also be employed to control number of threads created -- i.e. you could have a maximum number of threads allowed, but under smaller loads, you wouldn't necessarily have to create the maximum number if fewer ended up getting the job done -- i.e. your own pool size could be dynamic with a cap.
When using the thread-pool, I also find it more difficult to monitor the execution of threads from the pool in their performing the work. So, unless it's fire and forget, I am in favor of more controlled execution. I know you mentioned that your app exits after the 65K items are all completed. How are you monitoring you threads to determine if they have completed their work -- i.e. all queued workers are done. Are you monitoring the status of all items in the HashSets? I think by queuing your items up and having your own worker threads consume off that queue, you can gain more control. Albeit, this can come at the cost of more overhead in terms of signaling between threads to indicate when all items have been queued allowing them to exit.
You could also try using the CCR - Concurrency and Coordination Runtime. It's buried inside Microsoft Robotics Studio, but provides an excellent API for doing this sort of thing.
You'd just need to create a "Port" (essentially a queue), hook up a receiver (method that gets called when something is posted to it), and then post work items to it. The CCR handles the queue and the worker thread to run it on.
Here's a video on Channel9 about the CCR.
It's very high-performance and is even being used for non-Robotics stuff (Myspace.com uses it behind the scenese for their content-delivery network).
I would recommend looking into MSDN: Task Parallel Library - DataFlow. You can find examples of implementing Producer-Consumer in your case would be the database producing items to validate and the validation routine becomes the consumer.
Also recommend using ConcurrentDictionary<TKey, TValue> as a "Concurrent" hash set where you just populate the keys with no values :). You can potentially make your code lock-free.