What is the exact difference between Task.ContinueWith and ActionBlock.LinkTo? - c#

I am new to TPL Dataflow ActionBlock, TransformBlock etc. I used to practice Task.ContinueWith() to create a pipeline if needed. I recently started practicing about the TPL Dataflow and its blocks.
But I am a bit confused about the exact difference between those two. So could you please advise me when to use what?

These are two separate methods that have similar behavior but really don't relate to one another. ContinueWith schedules a continuation for a Task. With async/await you should not really need to use ContinueWith since the async/await keywords already schedule the remainder of your method as continuation. For example the two methods AsyncAwait and Continuation produce the same result.
public async Task AsyncAwait()
{
await DoAsync();
DoSomethingElse();
}
public async Task Continuation()
{
await DoAsync().ContinueWith(_ => DoSomethingElse());
}
public Task DoAsync() => Task.Delay(TimeSpan.FromSeconds(1));
public void DoSomethingElse()
{
//More Work
}
LinkTo on the other hand creates a disposable link between two Tpl-Dataflow blocks. That link can be configured in a number of ways see DatflowLinkOptions. One of the most configuration items is to PropagateCompletion. As you can hopefully see a dataflow link can be much more than simple continuation. You can pass completion, add a predicate to filter data or even link blocks into a complex structure like a mesh or feedback loop. Also, dataflow links allow you setup "backpressure" to throttle a flow. If the downstream block becomes overloaded and it's input buffer fills the upstream blocks can pause processing. The complete behavior of a dataflow link is not easily implemented with continuations by hand.
public ITargetBlock<int> BuildPipeline()
{
var block1 = new TransformBlock<int, int>(x => x);
var block2 = new ActionBlock<int>(x => Console.WriteLine(x));
block1.LinkTo(block2 , new DataflowLinkOptions() { PropagateCompletion = true });
return block1;
}
Unless you're doing complex linking you should always prefer the use of async/await over raw continuations. async/await makes the code easier to write, understand and maintain. LinkTo only applies to dataflow blocks and should be viewed as something separate from continuations and used to construct dataflow networks.

Related

Parallelizing execution with Task.Run

I am trying to improve performane of some code which does some shopping function calling number of different vendors. 3rd party vendor call is async and results are processed to generate a result. Strucure of the code is as follows.
public async Task<List<ShopResult>> DoShopping(IEnumerable<Vendor> vendors)
{
var res = vendors.Select(async s => await DoShopAndProcessResultAsync(s));
await Task.WhenAll(res); ....
}
Since DoShopAndProcessResultAsync is both IO bound and CPU bound, and each vendor iteration is independant I think Task.Run can be used to do something like below.
public async Task<List<ShopResult>> DoShopping(IEnumerable<Vendor> vendors)
{
var res = vendors.Select(s => Task.Run(() => DoShopAndProcessResultAsync(s)));
await Task.WhenAll(res); ...
}
Using Task.Run as is having a performance gain and I can see multiple threads are being involved here from the order of execution of the calls. And it is running without any issue locally on my machine.
However, it is a tasks of tasks scenario and wondering whether any pitfalls or this is deadlock prone in a high traffic prod environment.
What are your opinions on the approach of using Task.Run to parallelize async calls?
Tasks are .NET's low-level building blocks. .NET almost always has a better high-level abstraction for specific concurrency paradigms.
To paraphrase Rob Pike (slides) Concurrency is not parallelism is not asynchronous execution. What you ask is concurrent execution, with a specific degree-of-parallelism. NET already offers high-level classes that can do that, without resorting to low-level task handling.
At the end, I explain why these distinctions matter and how they're implemented using different .NET classes or libraries
Dataflow blocks
At the highest level, the Dataflow classes allow creating a pipeline of processing blocks similar to a Powershell or Bash pipeline, where each block can use one or more tasks to process input. Dataflow blocks preserve message order, ensuring results are emitted in the order the input messages were received.
You'll often see combinations of block called meshes, not pipelines. Dataflow grew out of the Microsoft Robotics Framework and can be used to create a network of independent processing blocks. Most programmers just use to build a pipeline of steps though.
In your case, you could use a TransformBlock to execute DoShopAndProcessResultAsync and feed the output either to another processing block, or a BufferBlock you can read after processing all results. You could even split Shop and Process into separate blocks, each with its own logic and degree of parallelism
Eg.
var buffer=new BufferBlock<ShopResult>();
var blockOptions=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism=3,
BoundedCapacity=1
};
var shop=new TransformBlock<Vendor,ShopResult)(DoShopAndProcessResultAsync,
blockOptions);
var linkOptions=new DataflowLinkOptions{ PropagateCompletion=true;}
shop.LinkTo(buffer,linkOptions);
foreach(var v in vendors)
{
await shop.SendAsync(v);
}
shop.Complete();
await shop.Completion;
buffer.TryReceiveAll(out IList<ShopResult> results);
You can use two separate blocks to shop and process :
var shop=new TransformBlock<Vendor,ShopResponse>(DoShopAsync,shopOptions);
var process=new TransformBlock<ShopResponse,ShopResult>(DoProcessAsync,processOptions);
shop.LinkTo(process,linkOptions);
process.LinkTo(results,linkOptions);
foreach(var v in vendors)
{
await shop.SendAsync(v);
}
shop.Complete();
await process.Completion;
In this case we await the completion of the last block in the chain before reading the results.
Instead of reading from a buffer block, we could use an ActionBlock at the end to do whatever we want to do with the results, eg store them to a database. The results can be batched using a BatchBlock to reduce the number of storage operations
...
var batch=new BatchBlock<ShopResult>(100);
var store=new ActionBlock<ShopResult[]>(DoStoreAsync);
shop.LinkTo(process,linkOptions);
process.LinkTo(batch,linkOptions);
batch.LinkTo(store,linkOptions);
...
shop.Complete();
await store.Completion;
Why do names matter
Tasks are the lowest level building blocks used to implement multiple paradigms. In other languages you'd see them described as Futures or Promises (eg Javascript)
Parallelism in .NET means executing CPU-bound computations over a lot of data using all available cores. Parallel.ForEach will partition the input data into roughly as many partitions as there are cores and use one worker task per partition. PLINQ goes one step further, allowing the use of LINQ operators to specify the computation and let PLINQ to use algorithms optimized for parallel execution to map, filter, sort, group and collect results. That's why Parallel.ForEach can't be used for async work at all.
Concurrency means executing multiple independent and often IO-bound jobs. At the lowest level you can use Tasks but Dataflow, Rx.NET, Channels, IAsyncEnumerable etc allow the use of high-level patterns like CSP/Pipelines, event stream processing etc
Asynchronous execution means you don't have to block while waiting for I/O-bound work to complete.
What is alarming with the Task.Run approach in your question, is that it depletes the ThreadPool from available worker threads in a non-controlled manner. It doesn't offer any configuration option that would allow you to reduce the parallelism of each individual request, in favor of preserving the scalability of the whole service. That's something that might bite you in the long run.
Ideally you would like to control both the parallelism and the concurrency, and control them independently. For example you might want to limit the maximum concurrency of the I/O-bound work to 10, and the maximum parallelism of the CPU-bound work to 2. Regarding the former you could take a look at this question: How to limit the amount of concurrent async I/O operations?
Regarding the later, you could use a TaskScheduler with limited concurrency. The ConcurrentExclusiveSchedulerPair is a handy class for this purpose. Here is an example of how you could rewrite your DoShopping method in a way that limits the ThreadPool usage to two threads at maximum (per request), without limiting at all the concurrency of the I/O-bound work:
public async Task<ShopResult[]> DoShopping(IEnumerable<Vendor> vendors)
{
var scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 2).ConcurrentScheduler;
var tasks = vendors.Select(vendor =>
{
return Task.Factory.StartNew(() => DoShopAndProcessResultAsync(vendor),
default, TaskCreationOptions.DenyChildAttach, scheduler).Unwrap();
});
return await Task.WhenAll(tasks);
}
Important: In order for this to work, the DoShopAndProcessResultAsync method should be implemented internally without .ConfigureAwait(false) at the await points. Otherwise the continuations after the await will not run on our preferred scheduler, and the goal of limiting the ThreadPool utilization will be defeated.
My personal preference though would be to use instead the new (.NET 6) Parallel.ForEachAsync API. Apart from making it easy to control the concurrency through the MaxDegreeOfParallelism option, it also comes with a better behavior in case of exceptions. Instead of launching invariably all the async operations, it stops launching new operations as soon as a previously launched operation has failed. This can make a big difference in the responsiveness of your service, in case for example that all individual async operations are failing with a timeout exception. You can find here a synopsis of the main differences between the Parallel.ForEachAsync and the Task.WhenAll APIs.
Unfortunately the Parallel.ForEachAsync has the disadvantage that it doesn't return the results of the async operations. Which means that you have to collect the results manually as a side-effect of each async operation. I've posted here a ForEachAsync variant that returns results, that combines the best aspects of the Parallel.ForEachAsync and the Task.WhenAll APIs. You could use it like this:
public async Task<ShopResult[]> DoShopping(IEnumerable<Vendor> vendors)
{
var scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 2).ConcurrentScheduler;
ParallelOptions options = new() { MaxDegreeOfParallelism = 10 };
return await ForEachAsync(vendors, options, async (vendor, ct) =>
{
return await Task.Factory.StartNew(() => DoShopAndProcessResultAsync(vendor),
ct, TaskCreationOptions.DenyChildAttach, scheduler).Unwrap();
});
}
Note: In my initial answer (revision 1) I had suggested erroneously to pass the scheduler through the ParallelOptions.TaskScheduler property. I just found out that this doesn't work as I expected. The ParallelOptions class has an internal property EffectiveMaxConcurrencyLevel that represents the minimum of the MaxDegreeOfParallelism and the TaskScheduler.MaximumConcurrencyLevel. The implementation of the Parallel.ForEachAsync method uses this property, instead of reading directly the MaxDegreeOfParallelism. So the MaxDegreeOfParallelism, by being larger than the MaximumConcurrencyLevel, was effectively ignored.
You've probably also noticed by now that the names of these two settings are confusing. We use the MaximumConcurrencyLevel in order to control the number of threads (aka the parallelization), and we use the MaxDegreeOfParallelism in order to control the amount of concurrent async operations (aka the concurrency). The reason for this confusing terminology can be traced to the historic origins of these APIs. The ParallelOptions class was introduced before the async-await era, and the designers of the new Parallel.ForEachAsync API aimed at making it compatible with the older non-asynchronous members of the Parallel class.

Parallel.ForEach with async lambda waiting forall iterations to complete

recently I have seen several SO threads related to Parallel.ForEach mixed with async lambdas, but all proposed answers were some kind of workarounds.
Is there any way how could I write:
List<int> list = new List<int>[]();
Parallel.ForEach(arrayValues, async (item) =>
{
var x = await LongRunningIoOperationAsync(item);
list.Add(x);
});
How can I ensure that list will contain all items from all iterations executed withing lambdas in each iteration?
How will generally Parallel.ForEach work with async lambdas, if it hit await will it hand over its thread to next iteration?
I assume ParallelLoopResult IsCompleted field is not proper one, as it will return true when all iterations are executed, no matter if their actual lambda jobs are finished or not?
recently I have seen several SO threads related to Parallel.ForEach mixed with async lambdas, but all proposed answers were some kind of workarounds.
Well, that's because Parallel doesn't work with async. And from a different perspective, why would you want to mix them in the first place? They do opposite things. Parallel is all about adding threads and async is all about giving up threads. If you want to do asynchronous work concurrently, then use Task.WhenAll. That's the correct tool for the job; Parallel is not.
That said, it sounds like you want to use the wrong tool, so here's how you do it...
How can I ensure that list will contain all items from all iterations executed withing lambdas in each iteration?
You'll need to have some kind of a signal that some code can block on until the processing is done, e.g., CountdownEvent or Monitor. On a side note, you'll need to protect access to the non-thread-safe List<T> as well.
How will generally Parallel.ForEach work with async lambdas, if it hit await will it hand over its thread to next iteration?
Since Parallel doesn't understand async lambdas, when the first await yields (returns) to its caller, Parallel will assume that interation of the loop is complete.
I assume ParallelLoopResult IsCompleted field is not proper one, as it will return true when all iterations are executed, no matter if their actual lambda jobs are finished or not?
Correct. As far as Parallel knows, it can only "see" the method to the first await that returns to its caller. So it doesn't know when the async lambda is complete. It also will assume iterations are complete too early, which throws partitioning off.
You don't need Parallel.For/ForEach here you just need to await a list of tasks.
Background
In short you need to be very careful about async lambdas, and if you are passing them to an Action or Func<Task>
Your problem is because Parallel.For / ForEach is not suited for the async and await pattern or IO bound tasks. They are suited for cpu bound workloads. Which means they essentially have Action parameters and let's the task scheduler create the tasks for you
If you want to run multiple async tasks at the same time use Task.WhenAll , or a TPL Dataflow Block (or something similar) which can deal effectively with both CPU bound and IO bound works loads, or said more directly, they can deal with tasks which is what an async method is.
Unless you need to do more inside of your lambda (for which you haven't shown), just use aSelect and WhenAll
var tasks = items .Select(LongRunningIoOperationAsync);
var results = await Task.WhenAll(tasks); // here is your list of int
If you do, you can still use the await,
var tasks = items.Select(async (item) =>
{
var x = await LongRunningIoOperationAsync(item);
// do other stuff
return x;
});
var results = await Task.WhenAll(tasks);
Note : If you need the extended functionality of Parallel.ForEach (namely the Options to control max concurrency), there are several approach, however RX or DataFlow might be the most succinct

C# making concurrent executing asynchronous

I'm currently trying to improve my understanding of Multithreading and the TPL in particular.
A lot of the constructs make complete sense and I can see how they improve scalability / execution speed.
I know that for asynchronous calls that don't tie up a thread (like I/O bound calls), Task.WhenAll would be the perfect fit.
One thing I am wondering about, though, is the best practice for making CPU-bound work that I want to run in parallel asynchronous.
To make code run in parallel the obvious choice would be the Parallel class.
As an example, say I have an array of data I want to perform some number crunching on:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
Parallel.ForEach(arr, (s) =>
{
SomeReallyLongRunningMethod(s);
});
This would run in parallel (if the analyser decides that parallel is faster than synchronous), but it would also block the thread.
Now the first thing that came to my mind was simply wrapping it all in Task.Run() ala:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
await Task.Run(() => Parallel.ForEach(arr, (s) =>
{
SomeReallyLongRunningMethod(s);
}));
Another option would be to either have a seperate Task returing method or inline it and use Task.WhenAll like so:
static async Task SomeReallyLongRunningMethodAsync(string s)
{
await Task.Run(() =>
{
//work...
});
}
// ...
await Task.WhenAll(arr.Select(s => SomeReallyLongRunningMethodAsync(s)));
The way I understand it is that option 1 creates a whole Task that will, for the life of it, tie up a thread to just sit there and wait until the Parallel.ForEach finishes.
Option 2 uses Task.WhenAll (for which I don't know whether it ties up a thread or not) to await all Tasks, but the Tasks had to be created manually. Some of my resources (expecially MS ExamRef 70-483) have explicitly advised against manually creating Tasks for CPU-bound work as the Parallel class is supposed to be used for it.
Now I'm left wondering about the best performing version / best practice for the problem of wanting parallel execution that can be awaited.
I hope some more experienced programmer can shed some light on this for me!
You really should use Microsoft's Reactive Framework for this. It's the perfect solution. You can do this:
string[] arr = { "SomeData", "SomeMoreData", "SomeOtherData" };
var query =
from s in arr.ToObservable()
from r in Observable.Start(() => SomeReallyLongRunningMethod(s))
select new { s, r };
IDisposable subscription =
query
.Subscribe(x =>
{
/* Do something with each `x.s` and `x.r` */
/* Values arrive as soon as they are computed */
}, () =>
{
/* All Done Now */
});
This assuming that the signature of SomeReallyLongRunningMethod is int SomeReallyLongRunningMethod(string input), but it is easy to cope with something else.
It's all run on multi-threads in parallel.
If you need to marshal back to the UI thread you can do that with an .ObserveOn just prior to the .Subscribe call.
If you want to stop the computation early you can call subscription.Dispose().
Option 1 is the way to go as the thread from thread pool being used for the task will also get used in parallel for loop. Similar question answered here.

Asynchronously invoking a parallel for loop

I would like to do something like this:
public async Task MyMethod()
{
// Do some preparation
await Parallel.ForEachAsync(0, count, i => { // Do some work //});
// do some finalization
}
However, I did not find an elegant way of doing so. I thought of two ways, but they are sub-optimal:
The only thing I thought about is manually partitioning the range, creating tasks, and then using Task.WhenAll.
Using the following code Task.Factory.StartNew(() => Parallel.For(...));.
The problem is that it "wastes" a thread on the asynchronous task.
Using TPL Dataflow's ActionBlock, and posting the integers one by one. The drawback is that it does not partition the range in a smart way like Parallel.For does, and works on each iteration one by one.
Manually using a Partitioner with Partitioner.Create, but it is less elegant. I want the framework to do intelligent partitioning for me.
You have a regular synchronous parallel loop that you'd like to invoke asynchronously (presumably to move it off the UI thread).
You can do this the same way you'd move any other CPU-bound work off the UI thread: using Task.Run:
public async Task MyMethod()
{
// Do some preparation
await Task.Run(() => Parallel.ForEach(0, count, i => { /* Do some work */ }));
// do some finalization
}
There is no thread "wasted" because Parallel.ForEach will use the calling thread as one of its worker threads.
(This is recipe 7.4 "Async Wrappers for Parallel Code" in my book).

ConfigureAwait for IObservable<T>

I'm experiencing a deadlock when I use blocking code with Task.Wait(), waiting an async method that inside awaits an Rx LINQ query.
This is an example:
public void BlockingCode()
{
this.ExecuteAsync().Wait();
}
public async Task ExecuteAsync()
{
await this.service.GetFooAsync().ConfigureAwait(false);
//This is the RX query that doesn't support ConfigureAwaitawait
await this.service.Receiver
.FirstOrDefaultAsync(x => x == "foo")
.Timeout(TimeSpan.FromSeconds(1));
}
So, my question is if there is any equivalent for ConfigureAwait on awaitable IObservable to ensure that the continuation is not resumed on the same SynchronizationContext.
You have to comprehend what "awaiting an Observable" means. Check out this. So, semantically, your code
await this.service.Receiver
.FirstOrDefaultAsync(x => x == "foo")
.Timeout(TimeSpan.FromSeconds(1));
is equivalent to
await this.service.Receiver
.FirstOrDefaultAsync(x => x == "foo")
.Timeout(TimeSpan.FromSeconds(1))
.LastAsync()
.ToTask();
(Note that there is some kind of redundance here, calling FirstOrDefaultAsyncand LastAsync but that's no problem).
So there you got your task (it can take an additional CancellationToken if available). You may now use ConfigureAwait.
ConfigureAwait is not directly related to awaiters themselves, rather it is a functionality of the TPL to configure how Task should complete. It's problematic because this TPL method doesn't return a new Task, so you can't compose it with a conversion to an observable.
Rx itself is basically free-threaded. You can control the threads used during subscription and events with far finer control than Tasks - see here for more on this: ObserveOn and SubscribeOn - where the work is being done
It's hard to fix your code because you don't provide a small but complete working example - however, the built in functions in Rx won't ever try to marshall to a particular thread unless you specifically tell them to with one of the above operators.
If you combine an operator like Observable.FromAsync to convert the Task to an observable, you can use Observable.SubscribeOn(Scheduler.Default) to start the Task off the current SynchronizationContext.
Here's a gist to play with (designed for LINQPad, runs with nuget package rx-main): https://gist.github.com/james-world/82c3cc39babab7870f6d

Categories

Resources