I am trying to better understand the Async and the Parallel options I have in C#. In the snippets below, I have included the 5 approaches I come across most. But I am not sure which to choose - or better yet, what criteria to consider when choosing:
Method 1: Task
(see http://msdn.microsoft.com/en-us/library/dd321439.aspx)
Calling StartNew is functionally equivalent to creating a Task using one of its constructors and then calling Start to schedule it for execution. However, unless creation and scheduling must be separated, StartNew is the recommended approach for both simplicity and performance.
TaskFactory's StartNew method should be the preferred mechanism for creating and scheduling computational tasks, but for scenarios where creation and scheduling must be separated, the constructors may be used, and the task's Start method may then be used to schedule the task for execution at a later time.
// using System.Threading.Tasks.Task.Factory
void Do_1()
{
var _List = GetList();
_List.ForEach(i => Task.Factory.StartNew(_ => { DoSomething(i); }));
}
Method 2: QueueUserWorkItem
(see http://msdn.microsoft.com/en-us/library/system.threading.threadpool.getmaxthreads.aspx)
You can queue as many thread pool requests as system memory allows. If there are more requests than thread pool threads, the additional requests remain queued until thread pool threads become available.
You can place data required by the queued method in the instance fields of the class in which the method is defined, or you can use the QueueUserWorkItem(WaitCallback, Object) overload that accepts an object containing the necessary data.
// using System.Threading.ThreadPool
void Do_2()
{
var _List = GetList();
var _Action = new WaitCallback((o) => { DoSomething(o); });
_List.ForEach(x => ThreadPool.QueueUserWorkItem(_Action));
}
Method 3: Parallel.Foreach
(see: http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx)
The Parallel class provides library-based data parallel replacements for common operations such as for loops, for each loops, and execution of a set of statements.
The body delegate is invoked once for each element in the source enumerable. It is provided with the current element as a parameter.
// using System.Threading.Tasks.Parallel
void Do_3()
{
var _List = GetList();
var _Action = new Action<object>((o) => { DoSomething(o); });
Parallel.ForEach(_List, _Action);
}
Method 4: IAsync.BeginInvoke
(see: http://msdn.microsoft.com/en-us/library/cc190824.aspx)
BeginInvoke is asynchronous; therefore, control returns immediately to the calling object after it is called.
// using IAsync.BeginInvoke()
void Do_4()
{
var _List = GetList();
var _Action = new Action<object>((o) => { DoSomething(o); });
_List.ForEach(x => _Action.BeginInvoke(x, null, null));
}
Method 5: BackgroundWorker
(see: http://msdn.microsoft.com/en-us/library/system.componentmodel.backgroundworker.aspx)
To set up for a background operation, add an event handler for the DoWork event. Call your time-consuming operation in this event handler. To start the operation, call RunWorkerAsync. To receive notifications of progress updates, handle the ProgressChanged event. To receive a notification when the operation is completed, handle the RunWorkerCompleted event.
// using System.ComponentModel.BackgroundWorker
void Do_5()
{
var _List = GetList();
using (BackgroundWorker _Worker = new BackgroundWorker())
{
_Worker.DoWork += (s, arg) =>
{
arg.Result = arg.Argument;
DoSomething(arg.Argument);
};
_Worker.RunWorkerCompleted += (s, arg) =>
{
_List.Remove(arg.Result);
if (_List.Any())
_Worker.RunWorkerAsync(_List[0]);
};
if (_List.Any())
_Worker.RunWorkerAsync(_List[0]);
}
}
I suppose the obvious critieria would be:
Is any better than the other for performance?
Is any better than the other for error handling?
Is any better than the other for monitoring/feedback?
But, how do you choose?
Thanks in advance for your insights.
Going to take these in an arbitrary order:
BackgroundWorker (#5)
I like to use BackgroundWorker when I'm doing things with a UI. The advantage that it has is having the progress and completion events fire on the UI thread which means you don't get nasty exceptions when you try to change UI elements. It also has a nice built-in way of reporting progress. One disadvantage that this mode has is that if you have blocking calls (like web requests) in your work, you'll have a thread sitting around doing nothing while the work is happening. This is probably not a problem if you only think you'll have a handful of them though.
IAsyncResult/Begin/End (APM, #4)
This is a widespread and powerful but difficult model to use. Error handling is troublesome since you need to re-catch exceptions on the End call, and uncaught exceptions won't necessarily make it back to any relevant pieces of code that can handle it. This has the danger of permanently hanging requests in ASP.NET or just having errors mysteriously disappear in other applications. You also have to be vigilant about the CompletedSynchronously property. If you don't track and report this properly, the program can hang and leak resources. The flip side of this is that if you're running inside the context of another APM, you have to make sure that any async methods you call also report this value. That means doing another APM call or using a Task and casting it to an IAsyncResult to get at its CompletedSynchronously property.
There's also a lot of overhead in the signatures: You have to support an arbitrary object to pass through, make your own IAsyncResult implementation if you're writing an async method that supports polling and wait handles (even if you're only using the callback). By the way, you should only be using callback here. When you use the wait handle or poll IsCompleted, you're wasting a thread while the operation is pending.
Event-based Asynchronous Pattern (EAP)
One that was not on your list but I'll mention for the sake of completeness. It's a little bit friendlier than the APM. There are events instead of callbacks and there's less junk hanging onto the method signatures. Error handling is a little easier since it's saved and available in the callback rather than re-thrown. CompletedSynchronously is also not part of the API.
Tasks (#1)
Tasks are another friendly async API. Error handling is straightforward: the exception is always there for inspection on the callback and nobody cares about CompletedSynchronously. You can do dependencies and it's a great way to handle execution of multiple async tasks. You can even wrap APM or EAP (one type you missed) async methods in them. Another good thing about using tasks is your code doesn't care how the operation is implemented. It may block on a thread or be totally asynchronous but the consuming code doesn't care about this. You can also mix APM and EAP operations easily with Tasks.
Parallel.For methods (#3)
These are additional helpers on top of Tasks. They can do some of the work to create tasks for you and make your code more readable, if your async tasks are suited to run in a loop.
ThreadPool.QueueUserWorkItem (#2)
This is a low-level utility that's actually used by ASP.NET for all requests. It doesn't have any built-in error handling like tasks so you have to catch everything and pipe it back up to your app if you want to know about it. It's suitable for CPU-intensive work but you don't want to put any blocking calls on it, such as a synchronous web request. That's because as long as it runs, it's using up a thread.
async / await Keywords
New in .NET 4.5, these keywords let you write async code without explicit callbacks. You can await on a Task and any code below it will wait for that async operation to complete, without consuming a thread.
Your first, third and forth examples use the ThreadPool implicitly because by default Tasks are scheduled on the ThreadPool and the TPL extensions use the ThreadPool as well, the API simply hides some of the complexity see here and here. BackgroundWorkers are part of the ComponentModel namespace because they are meant for use in UI scenarios.
Reactive extensions is another upcoming library for handling asynchronous programming, especially when it comes to composition of asynchronous events and methods.
It's not native, however it's developed by Ms labs. It's available both for .NET 3.5 and .NET 4.0 and is essentially a collection of extension methods on the .NET 4.0 introduced IObservable<T> interface.
There are a lot of examples and tutorials on their main site, and I strongly recommend checking some of them out. The pattern might seem a bit odd at first (at least for .NET programmers), but well worth it, even if it's just grasping the new concept.
The real strength of reactive extensions (Rx.NET) is when you need to compose multiple asynchronous sources and events. All operators are designed with this in mind and handles the ugly parts of asynchrony for you.
Main site: http://msdn.microsoft.com/en-us/data/gg577609
Beginner's guide: http://msdn.microsoft.com/en-us/data/gg577611
Examples: http://rxwiki.wikidot.com/101samples
That said, the best async pattern probably depends on what situation you're in. Some are better (simpler) for simpler stuff and some are more extensible and easier to handle when it comes to more complex scenarios. I cannot speak for all the ones you're mentioning though.
The last one is the best for 2,3 at least. It has built-in methods/properties for this.
Other variants are almost the same, just different versions/convinient wrappers
Related
We are using this code snippet from StackOverflow to produce a Task that completes as soon as the first of a collection of tasks completes successfully. Due to the non-linear nature of its execution, async/await is not really viable, and so this code uses ContinueWith() instead. It doesn't specify a TaskScheduler, though, which a number of sources have mentioned can be dangerous because it uses TaskScheduler.Current when most developers usually expect TaskScheduler.Default behavior from continuations.
The prevailing wisdom appears to be that you should always pass an explicit TaskScheduler into ContinueWith. However, I haven't seen a clear explanation of when different TaskSchedulers would be most appropriate.
What is a specific example of a case where it would be best to pass TaskScheduler.Current into ContinueWith(), as opposed to TaskScheduler.Default? Are there rules of thumb to follow when making this decision?
For context, here's the code snippet I'm referring to:
public static Task<T> FirstSuccessfulTask<T>(IEnumerable<Task<T>> tasks)
{
var taskList = tasks.ToList();
var tcs = new TaskCompletionSource<T>();
int remainingTasks = taskList.Count;
foreach(var task in taskList)
{
task.ContinueWith(t =>
if(task.Status == TaskStatus.RanToCompletion)
tcs.TrySetResult(t.Result));
else
if(Interlocked.Decrement(ref remainingTasks) == 0)
tcs.SetException(new AggregateException(
tasks.SelectMany(t => t.Exception.InnerExceptions));
}
return tcs.Task;
}
Probably you need to choose a task scheduler that is appropriate for actions that an executing delegate instance performs.
Consider following examples:
Task ContinueWithUnknownAction(Task task, Action<Task> actionOfTheUnknownNature)
{
// We know nothing about what the action do, so we decide to respect environment
// in which current function is called
return task.ContinueWith(actionOfTheUnknownNature, TaskScheduler.Current);
}
int count;
Task ContinueWithKnownAction(Task task)
{
// We fully control a continuation action and we know that it can be safely
// executed by thread pool thread.
return task.ContinueWith(t => Interlocked.Increment(ref count), TaskScheduler.Default);
}
Func<int> cpuHeavyCalculation = () => 0;
Action<Task> printCalculationResultToUI = task => { };
void OnUserAction()
{
// Assert that SynchronizationContext.Current is not null.
// We know that continuation will modify an UI, and it can be safely executed
// only on an UI thread.
Task.Run(cpuHeavyCalculation)
.ContinueWith(printCalculationResultToUI, TaskScheduler.FromCurrentSynchronizationContext());
}
Your FirstSuccessfulTask() probably is the example where you can use TaskScheduler.Default, because the continuation delegate instance can be safely executed on a thread pool.
You can also use custom task scheduler to implement custom scheduling logic in your library. For example see Scheduler page on Orleans framework website.
For more information check:
It's All About the SynchronizationContext article by Stephen Cleary
TaskScheduler, threads and deadlocks article by Cosmin Lazar
StartNew is Dangerous article by Stephen Cleary
I'll have to rant a bit, this is getting way too many programmers into trouble. Every programming aid that was designed to make threading look easy creates five new problems that programmers have no chance to debug.
BackgroundWorker was the first one, a modest and sensible attempt to hide the complications. But nobody realizes that the worker runs on the threadpool so should never occupy itself with I/O. Everybody gets that wrong, not many ever notice. And forgetting to check e.Error in the RunWorkerCompleted event, hiding exceptions in threaded code is a universal problem with the wrappers.
The async/await pattern is the latest, it makes it really look easy. But it composes extraordinarily poorly, async turtles all the way down until you get to Main(). They had to fix that eventually in C# version 7.2 because everybody got stuck on it. But not fixing the drastic ConfigureAwait() problem in a library. It is completely biased towards library authors knowing what they are doing, notable is that a lot of them work for Microsoft and tinker with WinRT.
The Task class bridged the gap between the two, its design goal was to make it very composable. Good plan, they could not predict how programmers were going to use it. But also a liability, inspiring programmers to ContinueWith() up a storm to glue tasks together. Even when it doesn't make sense to do so because those tasks merely run sequentially. Notable is that they even added an optimization to ensure that the continuation runs on the same thread to avoid the context switch overhead. Good plan, but creating the undebuggable problem that this web site is named for.
So yes, the advice you saw was a good one. Task is useful to deal with asynchronicity. A common problem that you have to deal with when services move into the "cloud" and latency gets to be a detail you can no longer ignore. If you ContinueWith() that kind code then you invariably care about the specific thread that executes the continuation. Provided by TaskScheduler, low odds that it isn't the one provided by FromCurrentSynchronizationContext(). Which is how async/await happened.
If current task is a child task, then using TaskScheduler.Current will mean the scheduler will be that which the task it is in, is scheduled to; and if not inside another task, TaskScheduler.Current will be TaskScheduler.Default and thus use the ThreadPool.
If you use TaskScheduler.Default, then it will always go to the ThreadPool.
The only reason you would use TaskScheduler.Current:
To avoid the default scheduler issue, you should always pass an
explicit TaskScheduler to Task.ContinueWith and Task.Factory.StartNew.
From Stephen Cleary's post ContinueWith is Dangerous, Too.
There's further explanation here from Stephen Toub on his MSDN blog.
I most certainly don't think I am capable of providing bullet proof answer but I will give my five cents.
What is a specific example of a case where it would be best to pass TaskScheduler.Current into ContinueWith(), as opposed to TaskScheduler.Default?
Imagine you are working on some web api that webserver naturally makes multithreaded. So you need to compromise your parallelism because you don't want to use all the resources of your webserver, but at the same time you want to speed up your processing time, so you decide to make custom task scheduler with lowered concurrency level because why not.
Now your api needs to query some database and order the results, but these results are millions so you decide to do it via Merge Sort(divide and conquer), then you need all your child tasks of this algorithm to be complient with your custom task scheduler (TaskScheduler.Current) because otherwise you will end up taking all the resources for the algorithm and your webserver thread pool will starve.
When to use TaskScheduler.Current, TaskScheduler.Default, TaskScheduler.FromCurrentSynchronizationContext(), or some other TaskScheduler
TaskScheduler.FromCurrentSynchronizationContext() - Specific for WPF,
Forms applications UI thread context, you use this basically when you
want to get back to the UI thread after being offloaded some work to
non-UI thread
example taken from here
private void button_Click(…)
{
… // #1 on the UI thread
Task.Factory.StartNew(() =>
{
… // #2 long-running work, so offloaded to non-UI thread
}).ContinueWith(t =>
{
… // #3 back on the UI thread
}, TaskScheduler.FromCurrentSynchronizationContext());
}
TaskScheduler.Default - Almost all the time when you don't have any specific requirements, edge cases to collate with.
TaskScheduler.Current - I think I've given one generic example above, but in general it should be used when you have either custom scheduler or you explicitly passed TaskScheduler.FromCurrentSynchronizationContext() to TaskFactory or Task.StartNew method and later you use continuation tasks or inner tasks (so pretty damn rare imo).
I am using Disruptor-net in a C# application. I'm having some trouble understanding how to do async operations in the disruptor pattern.
Assuming I have a few event handlers, and the last one in the chain hands a message off to my business logic processors, how do I handle async operations inside of my business logic processor? When my business logic needs to do some database insert, does it hand a message off to my output disruptor, which does the insert, then publishes a new message on my input disruptor with all the state to continue the transaction?
In addition, within my output disruptor, would I use Tasks? I'm 99.9% sure I'd want to use tasks so I don't have a ton of event handlers blocking on async operations. How does that fit in with the disruptor pattern then? Seems kind of weird to just do something like this in my EventHandler..
void OnEvent(MyEvent evt, long sequence, bool endOfBatch)
{
db.InsertAsync(evt).ContinueWith(task => inputDisruptor.Publish(task));
}
The Disruptor has the following features:
Dedicated threads, which can be pinned / shielded / prioritized for better performance.
Explicit queues, which can be monitored and generate backpressure.
In-order message processing.
No heap allocations, which can help reduce GC pauses, or even remove them if your own code does not generate heap allocations.
Your code sample does not really follow the Disruptor philosophy:
Task.ContinueWith runs asynchronously by default, so the continuation will use thread-pool threads.
Because you are using the thread-pool, you have no guarantee on the continuation execution order. Even if you use TaskContinuationOptions.ExecuteSynchronously, you have no guarantee that InsertAsync will invoke the continuations in-order.
You are creating an implicit queue with all the pending insert operations. This queue is hidden and does not generate backpressure.
I will put aside the fact that your code is generating heap allocations. You will not benefit from the "no GC pauses" effect but it is probably very acceptable for your use-case.
Also, please note that batching is crucial to support high-throughput for IO operations. You should really use the Disruptor batches in your event handler.
I will simplify the problem to 3 event handlers:
PreInsertEventHandler: pre-insert logic (not shown here)
InsertEventHandler: insert logic
PostInsertEventHandler: post-insert logic
Of course, I am assuming that the post-insert logic must be run only after insert completion.
If your goal is to save the events in InsertEventHandler and to block until completion before processing the event in the next handler, you should probably just wait in InsertEventHandler.
InsertEventHandler:
void OnEvent(MyEvent evt, long sequence, bool endOfBatch)
{
_pendingInserts.Add((evt, task: db.InsertAsync(evt)));
if (endOfBatch)
{
var insertSucceeded = Task.WaitAll(_pendingInserts.Select(x => x.task).ToArray(), _insertTimeout);
foreach (var (pendingEvent, _) in _pendingInserts)
{
pendingEvent.InsertSucceeded = insertSucceeded;
}
_pendingInserts.Clear();
}
}
Of course, if your DB API exposes a bulk-insert method, it might be better to add the events in a list and to save them all at the end of the batch.
There are many other options, like waiting in PostInsertEventHandler, or queueing the insert results in another Disruptor, each coming with its own pros and cons. A SO answer might not be the best place to discuss and analyze all of them.
I'm struggling to get my head around why the following test does not work:
[Fact]
public void repro()
{
var scheduler = new TestScheduler();
var count = 0;
// this observable is a simplification of the system under test
// I've just included it directly in the test for clarity
// in reality it is NOT accessible from the test code - it is
// an implementation detail of the system under test
// but by passing in a TestScheduler to the sut, the test code
// can theoretically control the execution of the pipeline
// but per this question, that doesn't work when using FromAsync
Observable
.Return(1)
.Select(i => Observable.FromAsync(Whatever))
.Concat()
.ObserveOn(scheduler)
.Subscribe(_ => Interlocked.Increment(ref count));
Assert.Equal(0, count);
// this call initiates the observable pipeline, but does not
// wait until the entire pipeline has been executed before
// returning control to the caller
// the question is: why? Rx knows I'm instigating an async task
// as part of the pipeline (that's the point of the FromAsync
// method), so why can't it still treat the pipeline atomically
// when I call Start() on the scheduler?
scheduler.Start();
// count is still zero at this point
Assert.Equal(1, count);
}
private async Task<Unit> Whatever()
{
await Task.Delay(100);
return Unit.Default;
}
What I'm trying to do is run some asynchronous code (represented above by Whatever()) whenever an observable ticks. Importantly, I want those calls to be queued. More importantly, I want to be able to control the execution of the pipeline by using the TestScheduler.
It seems like the call to scheduler.Start() is instigating the execution of Whatever() but it isn't waiting until it completes. If I change Whatever() so that it is synchronous:
private async Task<Unit> Whatever()
{
//await Task.Delay(100);
return Unit.Default;
}
then the test passes, but of course that defeats the purpose of what I'm trying to achieve. I could imagine there being a StartAsync() method on the TestScheduler that I could await, but that does not exist.
Can anyone tell me whether there's a way for me to instigate the execution of the reactive pipeline and wait for its completion even when it contains asynchronous calls?
Let me boil down your question to its essentials:
Is there a way, using the TestScheduler, to execute a reactive pipeline and wait for its completion even when it contains asynchronous calls?
I should warn you up front, there is no quick and easy answer here, no convenient "trick" that can be deployed.
Asynchronous Calls and Schedulers
To answer this question I think we need to clarify some points. The term "asynchronous call" in the question above seems to be used specifically to refer to methods with a Task or Task<T> signature - i.e. methods that use the Task Parallel Library (TPL) to run asynchronously.
This is important to note because Reactive Extensions (Rx) takes a different approach to handling asynchronous operations.
In Rx the introduction of concurrency is managed via a scheduler, a type implementing the IScheduler interface. Any operation that introduces concurrency should make a available a scheduler parameter so that the caller can decide an appropriate scheduler. The core library slavishly adheres to this principle. So, for example, Delay allows specification of a scheduler but Where does not.
As you can see from the source, IScheduler provides a number of Schedule overloads. Operations requiring concurrency use these to schedule execution of work. Exactly how that work is executed is deferred completely to the scheduler. This is the power of the scheduler abstraction.
Rx operations introducing concurrency generally provide overloads that allow the scheduler to be omitted, and in that case select a sensible default. This is important to note, because if you want your code to be testable via the use of TestScheduler you must use a TestScheduler for all operations that introduce concurrency. A rogue method that doesn't allow this, could scupper your testing efforts.
TPL Scheduling Abstraction
The TPL has it's own abstraction to handle concurrency: The TaskScheduler. The idea is very similar. You can read about it here..
There are two very important differences between the two abstractions:
Rx schedulers have a first class representation of their own notion of time - the Now property. TPL schedulers do not.
The use of custom schedulers in the TPL is much less prevalent, and there is no equivalent best practice of providing overloads for providing specific TaskSchedulers to a method introducing concurrency (returning a Task or Task<T>). The vast majority of Task returning methods assume use of the default TaskScheduler and give you no choice about where work is run.
Motivation for TestScheduler
The motivation to use a TestScheduler is generally two-fold:
To remove the need to "wait" for operations by speeding up time.
To check that events occurred at expected points in time.
The way this works depends entirely on the fact that schedulers have their own notion of time. Every time an operation is scheduled via an IScheduler, we specify when it must execute - either as soon as possible, or at a specific time in the future. The scheduler then queues work for execution and will execute it when the specified time (according to the scheduler itself) is reached.
When you call Start on the TestScheduler, it works by emptying the queue of all operations with execution times at or before its current notion of Now - and then advancing its clock to the next scheduled work time and repeating until its queue is empty.
This allows neat tricks like being able to test that an operation will never result in an event! If using real time this would be a challenging task, but with virtual time it's easy - once the scheduler queue is completely empty, then the TestScheduler concludes that no further events will ever happen - since if nothing is left on its queue, there is nothing there to schedule further tasks. In fact Start returns at this precisely this point. For this to work, clearly all concurrent operations to be measured must be scheduled on the TestScheduler.
A custom operator that carelessly makes its own choice of scheduler without allowing that choice to be overriden, or an operation that uses its own form of concurrency without a notion of time (such as TPL based calls) will make it difficult, if not impossible, to control execution via a TestScheduler.
If you have an asynchronous operation run by other means, judicious use of the AdvanceTo and AdvanceBy methods of the TestScheduler can allow you to coordinate with that foreign source of concurrency - but the extent to which this is achievable depends on the control afforded by that foreign source.
In the case of the TPL, you do know when a task is done - which does allow the use of waits and timeouts in tests, as ugly as these can be. Through the use of TaskCompleteSources(TCS) you can mock tasks and use AdvanceTo to hit specific points and complete TCSs, but there is no one simple approach here. Often you just have to resort to ugly waits and timeouts because you don't have sufficient control over foreign concurrency.
Rx is generally free-threaded and tries to avoid introducing concurrency wherever possible. Conversely, it's quite possible that different operations within an Rx call chain will need different types of scheduler abstraction. It's not always possible to simulate a call chain with a single test scheduler. Certainly, I have had cause to use multiple TestSchedulers to simulate some complex scenarios - e.g. chains that use the DispatcherScheduler and TaskScheduler sometimes need complex coordination that means you can't simply serialize their operations on to one TestScheduler.
Some projects I have worked on have mandated the use of Rx for all concurrency specifically to avoid these problems. That is not always feasible, and even in these cases, some use of TPL is generally inevitable.
One particular pain point
One particular pain point of Rx that leaves many testers scratching their heads, is the fact that the TPL -> Rx family of conversions introduce concurrency. e.g. ToObservable, SelectMany's overload accepting Task<T> etc. don't provide overloads with a scheduler and insidiously force you off the TestScheduler thread, even if mocking with TCS. For all the pain this causes in testing alone, I consider this a bug. You can read all about this here - dig through and you'll find Dave Sexton's proposed fix, which provides an overload for specifying a scheduler, and is under consideration for inclusion. You may want to look into that pull request.
A Potential Workaround
If you can edit your code to use it, the following helper method might be of use. It converts a task to an observable that will run on the TestScheduler and complete at the correct virtual time.
It schedules work on the TestScheduler that is responsible for collecting the task result - at the virtual time we state the task should complete. The work itself blocks until the task result is available - allowing the TPL task to run for however long it takes, or until a real amount of specified time has passed in which case a TimeoutException is thrown.
The effect of blocking the work means that the TestScheduler won't advance its virtual time past the expected virtual completion time of the task until the task has actually completed. This way, the rest of the Rx chain can run in full-speed virtual time and we only wait on the TPL task, pausing the rest of the chain at the task completion virtual time whilst this happens.
Crucially, other concurrent Rx operations scheduled to run in between the start virtual time of the Task based operation and the stated end virtual time of the Task are not blocked and their virtual completion time will be unaffected.
So set duration to the length of virtual time you want the task to appear to have taken. The result will then be collected at whatever the virtual time is when the task is started, plus the duration specified.
Set timeout to the actual time you will allow the task to take. If it takes longer, a timeout exception is thrown:
public static IObservable<T> ToTestScheduledObseravble<T>(
this Task<T> task,
TestScheduler scheduler,
TimeSpan duration,
TimeSpan? timeout = null)
{
timeout = timeout ?? TimeSpan.FromSeconds(100);
var subject = Subject.Synchronize(new AsyncSubject<T>(), scheduler);
scheduler.Schedule<Task<T>>(task, duration,
(s, t) => {
if (!task.Wait(timeout.Value))
{
subject.OnError(
new TimeoutException(
"Task duration too long"));
}
else
{
switch (task.Status)
{
case TaskStatus.RanToCompletion:
subject.OnNext(task.Result);
subject.OnCompleted();
break;
case TaskStatus.Faulted:
subject.OnError(task.Exception.InnerException);
break;
case TaskStatus.Canceled:
subject.OnError(new TaskCanceledException(task));
break;
}
}
return Disposable.Empty;
});
return subject.AsObservable();
}
Usage in your code would be like this, and your assert will pass:
Observable
.Return(1)
.Select(i => Whatever().ToTestScheduledObseravble(
scheduler, TimeSpan.FromSeconds(1)))
.Concat()
.Subscribe(_ => Interlocked.Increment(ref count));
Conclusion
In summary, you haven't missed any convenient trick. You need to think about how Rx works, and how the TPL works and decide whether:
You avoid mixing TPL and Rx
You mock your interface between TPL and Rx (using TCS or similar), so you test each independently
You live with ugly waits and timeouts and abandon the TestScheduler altogether
You mix ugly waits and timeouts with TestScheduler to bring some modicum of control over your tests.
Noseratio's more elegant Rx way of writing this test. You can await observables to get their last value. Combine with Count() and it becomes trivial.
Note that the TestScheduler isn't serving any purpose in this example.
[Fact]
public async Task repro()
{
var scheduler = new TestScheduler();
var countObs = Observable
.Return(1)
.Select(i => Observable.FromAsync(Whatever))
.Concat()
//.ObserveOn(scheduler) // serves no purpose in this test
.Count();
Assert.Equal(0, count);
//scheduler.Start(); // serves no purpose in this test.
var count = await countObs;
Assert.Equal(1, count);
}
As James mentions above, you cant mix concurrency models like you are. You remove the concurrency from Rx by using the TestScheduler, but never actually introduce concurrency via Rx. You do however introduce concurrency with the TPL (i.e. Task.Delay(100). Here will will actually run asynchronously on a task pool thread. So your synchronous tests will complete before the task has completed.
You could change to something like this
[Fact]
public void repro()
{
var scheduler = new TestScheduler();
var count = 0;
// this observable is a simplification of the system under test
// I've just included it directly in the test for clarity
// in reality it is NOT accessible from the test code - it is
// an implementation detail of the system under test
// but by passing in a TestScheduler to the sut, the test code
// can theoretically control the execution of the pipeline
// but per this question, that doesn't work when using FromAsync
Observable
.Return(1)
.Select(_ => Observable.FromAsync(()=>Whatever(scheduler)))
.Concat()
.ObserveOn(scheduler)
.Subscribe(_ => Interlocked.Increment(ref count));
Assert.Equal(0, count);
// this call initiates the observable pipeline, but does not
// wait until the entire pipeline has been executed before
// returning control to the caller
// the question is: why? Rx knows I'm instigating an async task
// as part of the pipeline (that's the point of the FromAsync
// method), so why can't it still treat the pipeline atomically
// when I call Start() on the scheduler?
scheduler.Start();
// count is still zero at this point
Assert.Equal(1, count);
}
private async Task<Unit> Whatever(IScheduler scheduler)
{
return await Observable.Timer(TimeSpan.FromMilliseconds(100), scheduler).Select(_=>Unit.Default).ToTask();
}
Alternatively, you need to put the Whatever method behind an interface that you can mock out for testing. In which case you would just have your Stub/Mock/Double return the code from above i.e. return await Observable.Timer(TimeSpan.FromMilliseconds(100), scheduler).Select(_=>Unit.Default).ToTask();
I'm just beginning to learn C# threading and concurrent collections, and am not sure of the proper terminology to pose my question, so I'll describe briefly what I'm trying to do. My grasp of the subject is rudimentary at best at this point. Is my approach below even feasible as I've envisioned it?
I have 100,000 urls in a Concurrent collection that must be tested--is the link still good? I have another concurrent collection, initially empty, that will contain the subset of urls that an async request determines to have been moved (400, 404, etc errors).
I want to spawn as many of these async requests concurrently as my PC and our bandwidth will allow, and was going to start at 20 async-web-request-tasks per second and work my way up from there.
Would it work if a single async task handled both things: it would make the async request and then add the url to the BadUrls collection if it encountered a 4xx error? A new instance of that task would be spawned every 50ms:
class TestArgs args {
ConcurrentBag<UrlInfo> myCollection { get; set; }
System.Uri currentUrl { get; set; }
}
ConcurrentQueue<UrlInfo> Urls = new ConncurrentQueue<UrlInfo>();
// populate the Urls queue
<snip>
// initialize the bad urls collection
ConcurrentBag<UrlInfo> BadUrls = new ConcurrentBag<UrlInfo>();
// timer fires every 50ms, whereupon a new args object is created
// and the timer callback spawns a new task; an autoEvent would
// reset the timer and dispose of it when the queue was empty
void SpawnNewUrlTask(){
// if queue is empty then reset the timer
// otherwise:
TestArgs args = {
myCollection = BadUrls,
currentUrl = getNextUrl() // take an item from the queue
};
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
}
public async Task asyncWebRequestAndConcurrentCollectionUpdater(TestArgs args)
{
//make the async web request
// add the url to the bad collection if appropriate.
}
Feasible? Way off?
The approach seems fine, but there are some issues with the specific code you've shown.
But before I get to that, there have been suggestions in the comments that Task Parallelism is the way to go. I think that's misguided. There's a common misconception that if you want to have lots of work going on in parallel, you necessarily need lots of threads. That's only true if the work is compute-bound. But the work you're doing will be IO bound - this code is going to spend the vast majority of its time waiting for responses. It will do very little computation. So in practice, even if it only used a single thread, your initial target of 20 requests per second doesn't seem like a workload that would cause a single CPU core to break into a sweat.
In short, a single thread can handle very high levels of concurrent IO. You only need multiple threads if you need parallel execution of code, and that doesn't look likely to be the case here, because there's so little work for the CPU in this particular job.
(This misconception predates await and async by years. In fact, it predates the TPL - see http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless for a .NET 1.1 era illustration of how you can handle thousands of concurrent requests with a tiny number of threads. The underlying principles still apply today because Windows networking IO still basically works the same way.)
Not that there's anything particularly wrong with using multiple threads here, I'm just pointing out that it's a bit of a distraction.
Anyway, back to your code. This line is problematic:
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
While you've not given us all your code, I can't see how that will be able to compile. The overloads of StartNew that accept two arguments require the first to be either an Action, an Action<object>, a Func<TResult>, or a Func<object,TResult>. In other words, it has to be a method that either takes no arguments, or accepts a single argument of type object (and which may or may not return a value). Your 'asyncWebRequestAndConcurrentCollectionUpdater' takes an argument of type TestArgs.
But the fact that it doesn't compile isn't the main problem. That's easily fixed. (E.g., change it to Task.Factory.StartNew(() => asyncWebRequestAndConcurrentCollectionUpdater(args));) The real issue is what you're doing is a bit weird: you're using Task.StartNew to invoke a method that already returns a Task.
Task.StartNew is a handy way to take a synchronous method (i.e., one that doesn't return a Task) and run it in a non-blocking way. (It'll run on the thread pool.) But if you've got a method that already returns a Task, then you didn't really need to use Task.StartNew. The weirdness becomes more apparent if we look at what Task.StartNew returns (once you've fixed the compilation error):
Task<Task> t = Task.Factory.StartNew(
() => asyncWebRequestAndConcurrentCollectionUpdater(args));
That Task<Task> reveals what's happening. You've decided to wrap a method that was already asynchronous with a mechanism that is normally used to make non-asynchronous methods asynchronous. And so you've now got a Task that produces a Task.
One of the slightly surprising upshots of this is that if you were to wait for the task returned by StartNew to complete, the underlying work would not necessarily be done:
t.Wait(); // doesn't wait for asyncWebRequestAndConcurrentCollectionUpdater to finish!
All that will actually do is wait for asyncWebRequestAndConcurrentCollectionUpdater to return a Task. And since asyncWebRequestAndConcurrentCollectionUpdater is already an async method, it will return a task more or less immediately. (Specifically, it'll return a task the moment it performs an await that does not complete immediately.)
If you want to wait for the work you've kicked off to finish, you'll need to do this:
t.Result.Wait();
or, potentially more efficiently, this:
t.Unwrap().Wait();
That says: get me the Task that my async method returned, and then wait for that. This may not be usefully different from this much simpler code:
Task t = asyncWebRequestAndConcurrentCollectionUpdater("foo");
... maybe queue up some other tasks ...
t.Wait();
You may not have gained anything useful by introducing `Task.Factory.StartNew'.
I say "may" because there's an important qualification: it depends on the context in which you start the work. C# generates code which, by default, attempts to ensure that when an async method continues after an await, it does so in the same context in which the await was initially performed. E.g., if you're in a WPF app and you await while on the UI thread, when the code continues it will arrange to do so on the UI thread. (You can disable this with ConfigureAwait.)
So if you're in a situation in which the context is essentially serialized (either because it's single-threaded, as will be the case in a GUI app, or because it uses something resembling a rental model, e.g. the context of an particular ASP.NET request), it may actually be useful to kick an async task off via Task.Factory.StartNew because it enables you to escape the original context. However, you just made your life harder - tracking your tasks to completion is somewhat more complex. And you might have been able to achieve the same effect simply by using ConfigureAwait inside your async method.
And it may not matter anyway - if you're only attempting to manage 20 requests a second, the minimal amount of CPU effort required to do that means that you can probably manage it entirely adequately on one thread. (Also, if this is a console app, the default context will come into play, which uses the thread pool, so your tasks will be able to run multithreaded in any case.)
But to get back to your question, it seems entirely reasonable to me to have a single async method that picks a url off the queue, makes the request, examines the response, and if necessary, adds an entry to the bad url collection. And kicking the things off from a timer also seems reasonable - that will throttle the rate at which connections are attempted without getting bogged down with slow responses (e.g., if a load of requests end up attempting to talk to servers that are offline). It might be necessary to introduce a cap for the maximum number of requests in flight if you hit some pathological case where you end up with tens of thousands of URLs in a row all pointing to a server that isn't responding. (On a related note, you'll need to make sure that you're not going to hit any per-client connection limits with whichever HTTP API you're using - that might end up throttling the effective throughput.)
You will need to add some sort of completion handling - just kicking off asynchronous operations and not doing anything to handle the results is bad practice, because you can end up with exceptions that have nowhere to go. (In .NET 4.0, these used to terminate your process, but as of .NET 4.5, by default an unhandled exception from an asynchronous operation will simply be ignored!) And if you end up deciding that it is worth launching via Task.Factory.StartNew remember that you've ended up with an extra layer of wrapping, so you'll need to do something like myTask.Unwrap().ContinueWith(...) to handle it correctly.
Of course you can. Concurrent collections are called 'concurrent' because they can be used... concurrently by multiple threads, with some warranties about their behaviour.
A ConcurrentQueue will ensure that each element inserted in it is extracted exactly once (concurrent threads will never extract the same item by mistake, and once the queue is empty, all the items have been extracted by a thread).
EDIT: the only thing that could go wrong is that 50ms is not enough to complete the request, and so more and more tasks cumulate in the task queue. If that happens, your memory could get filled, but the thing would work anyway. So yes, it is feasible.
Anyway, I would like to underline the fact that a task is not a thread. Even if you create 100 tasks, the framework will decide how many of them will be actually executed concurrently.
If you want to have more control on the level of parallelism, you should use asynchronous requests.
In your comments, you wrote "async web request", but I can't understand if you wrote async just because it's on a different thread or because you intend to use the async API.
If you were using the async API, I'd expect to see some handler attached to the completion event, but I couldn't see it, so I assumed you're using synchronous requests issued from an asynchronous task.
If you're using asynchronous requests, then it's pointless to use tasks, just use the timer to issue the async requests, since they are already asynchronous.
When I say "asynchronous request" I'm referring to methods like WebRequest.GetResponseAsync and WebRequest.BeginGetResponse.
EDIT2: if you want to use asynchronous requests, then you can just make requests from the timer handler. The BeginGetResponse method takes two arguments. The first one is a callback procedure, that will be called to report the status of the request. You can pass the same procedure for all the requests. The second one is an user-provided object, which will store status about the request, you can use this argument to differentiate among different requests. You can even do it without the timer. Something like:
private readonly int desiredConcurrency = 20;
struct RequestData
{
public UrlInfo url;
public HttpWebRequest request;
}
/// Handles the completion of an asynchronous request
/// When a request has been completed,
/// tries to issue a new request to another url.
private void AsyncRequestHandler(IAsyncResult ar)
{
if (ar.IsCompleted)
{
RequestData data = (RequestData)ar.AsyncState;
HttpWebResponse resp = data.request.EndGetResponse(ar);
if (resp.StatusCode != 200)
{
BadUrls.Add(data.url);
}
//A request has been completed, try to start a new one
TryIssueRequest();
}
}
/// If urls is not empty, dequeues a url from it
/// and issues a new request to the extracted url.
private bool TryIssueRequest()
{
RequestData rd;
if (urls.TryDequeue(out rd.url))
{
rd.request = CreateRequestTo(rd.url); //TODO implement
rd.request.BeginGetResponse(AsyncRequestHandler, rd);
return true;
}
else
{
return false;
}
}
//Called by a button handler, or something like that
void StartTheRequests()
{
for (int requestCount = 0; requestCount < desiredConcurrency; ++requestCount)
{
if (!TryIssueRequest()) break;
}
}
I have a library that is a complicated arbiter of many network connections. Each method of it's primary object takes a delegate, which is called when the network responds to a given request.
I want to translate my library to use the new .NET 4.5 "async/await" pattern; this would require me to return a "Task" object, which would signal to the user that the asynchronous part of the call is complete. Creating this object requires a function for the task to represent - As far as my understanding, it is essentially a lightweight thread.
This doesn't really fit the design of my library - I would like the task to behave more like an event, and directly signal to the user that their request has completed, rather then representing a function. Is this possible? Should i avoid abusing the "async/await" pattern in this way?
I don't know if I'm wording this very well, I hope you understand my meaning. Thank you for any help.
As far as my understanding, it is essentially a lightweight thread.
No, that's not really true. I can be true, under certain circumstances, but that's just one usage of Task. You can start a thread by passing it a delegate, and having it execute it (usually asynchronously, possibly synchronously, and by default using the thread pool).
Another way of using threads is through the use of a TaskCompletionSource. When you do that the task is (potentially) not creating any threads, using the thread pool, or anything along those lines. One common usage of this model is converting an event-based API to a Task-based API:
Let's assume, just because it's a common example, that we want to have a Task that will be completed when a From that we have is closed. There is already a FormClosed event that fires when that event occurs:
public static Task WhenClosed(this Form form)
{
var tcs = new TaskCompletionSource<object>();
form.FormClosing += (_, args) =>
{
tcs.SetResult(null);
};
return tcs.Task;
}
We create a TaskCompletionSource, add a handler to the event in question, in that handler we signal the completion of the task, and the TaskCompletionSource provides us with a Task to return to the caller. This Task will not have resulted in any new threads being created, it won't use the thread pool, or anything like that.
You can have a Task/Event based model using this construct that appears quite asynchronous, but only using a single thread to do all work (the UI thread).
In general, anytime you want to have a Task that represents something other than the execution of a function you'll want to consider using a TaskCompletionSource. It's usually the appropriate conceptual way to approach the problem, other than possibly using one of the existing TPL methods, such as WhenAll, WhenAny, etc.
Should I avoid abusing the "async/await" pattern in this way?
No, because it's not abuse. It's a perfectly appropriate use of Task constructs, as well as async/await. Consider, for example, the code that you can write using the helper method that I have above:
private async void button1_Click(object sender, EventArgs e)
{
Form2 popup = new Form2();
this.Hide();
popup.Show();
await popup.WhenClosed();
this.Show();
}
This code now works just like it reads; create a new form, hide myself, show the popup, wait until the popup is closed, and then show myself again. However, since it's not a blocking wait the UI thread isn't blocked. We also don't need to bother with events; adding handlers, dealing with multiple contexts; moving our logic around, or any of it. (All of that happens, it's just hidden from us.)
I want to translate my library to use the new .NET 4.5 "async/await" pattern; this would require me to return a "Task" object, which would signal to the user that the asynchronous part of the call is complete.
Well, not really - you can return anything which implements the awaitable pattern - but a Task is the simplest way of doing this.
This doesn't really fit the design of my library - I would like the task to behave more like an event, and directly signal to the user that their request has completed, rather then representing a function.
You can call Task.ContinueWith to act as a "handler" to execute when the task completes. Indeed, that's what TaskAwaiter does under the hood.
Your question isn't terribly clear to be honest, but if you really want to create a Task which you can then force to completion whenever you like, I suspect you just want TaskCompletionSource<TResult> - you call the SetResult, SetCanceled or SetException methods to indicate the appropriate kind of completion. (Or you can call the TrySet... versions.) Use the Task property to return a task to whatever needs it.