Multiple async/await calls in a foreach loop iteration - c#

I am trying to wrap my head around how to handle multiple async/await calls in a foreach loop. I have around 20,000 rows of data that are processed by the foreach loop. Roughly my code is:
foreach (var item in data)
{
if (ConditionA(item))
{
if (ConditionAB(item));
{
await CreateThingViaAPICall(item)
}
else
{
var result = await GetExistingRecord(item);
var result2 = await GetOtherExistingRecord(result);
var result3 = await GetOtherExistingRecord(result2);
//Do processing
...
await CreateThingViaAPICall();
}
}
... and so on
}
I've seen many posts saying the best way to use async in a loop is to build a list of tasks and then use Task.WhenAll. In my case I have Tasks that depend on each other as part of each iteration. How do I build up a list of tasks to execute in this case?

It's easiest if you break the processing of an individual item into a separate (async) method:
private async Task ProcessItemAsync(Item item)
{
if (ConditionA(item))
{
if (ConditionAB(item));
{
await CreateThingViaAPICall(item)
}
else
{
var result = await GetExistingRecord(item);
var result2 = await GetOtherExistingRecord(result);
var result3 = await GetOtherExistingRecord(result2);
//Do processing
...
await CreateThingViaAPICall();
}
}
... and so on
}
Then process your collection like so:
var tasks = data.Select(ProcessItemAsync);
await Task.WhenAll(tasks);
This effectively wraps the multiple dependent Tasks required to process a single item into one Task, allowing those steps to happen sequentially while items of the collection itself are processed concurrently.
With 10's of thousands of items, you may, for a variety of reasons, find that you need to throttle the number of Tasks running concurrently. Have a look at TPL Dataflow for this type of scenario. See here for an example.

If I'm not mistaken the recommended way to use async/wait in a foreach is to build a list of Tasks first then call Task.WhenAll.
You're partly mistaken.
If you have a multiple tasks that don't depend on each other then it is indeed generally a very good idea to have those multiple task happen in a WhenAll so that they can be scheduled together, giving better throughput.
If however each task depends on the results of the previous, then this approach isn't viable. Instead you should just await them within a foreach.
Indeed, this will work fine for any case, it's just suboptimal to have tasks wait on each other if they don't have to.
The ability to await tasks in a foreach is in fact one of the biggest gains that async/await has given us. Most code that uses await can be re-written to use ContinueWith quite easily, if less elegantly, but loops were trickier and if the actual end of the loop was only found by examining the results of the tasks themselves, trickier again.

Related

Parallel.ForEach with async lambda waiting forall iterations to complete

recently I have seen several SO threads related to Parallel.ForEach mixed with async lambdas, but all proposed answers were some kind of workarounds.
Is there any way how could I write:
List<int> list = new List<int>[]();
Parallel.ForEach(arrayValues, async (item) =>
{
var x = await LongRunningIoOperationAsync(item);
list.Add(x);
});
How can I ensure that list will contain all items from all iterations executed withing lambdas in each iteration?
How will generally Parallel.ForEach work with async lambdas, if it hit await will it hand over its thread to next iteration?
I assume ParallelLoopResult IsCompleted field is not proper one, as it will return true when all iterations are executed, no matter if their actual lambda jobs are finished or not?
recently I have seen several SO threads related to Parallel.ForEach mixed with async lambdas, but all proposed answers were some kind of workarounds.
Well, that's because Parallel doesn't work with async. And from a different perspective, why would you want to mix them in the first place? They do opposite things. Parallel is all about adding threads and async is all about giving up threads. If you want to do asynchronous work concurrently, then use Task.WhenAll. That's the correct tool for the job; Parallel is not.
That said, it sounds like you want to use the wrong tool, so here's how you do it...
How can I ensure that list will contain all items from all iterations executed withing lambdas in each iteration?
You'll need to have some kind of a signal that some code can block on until the processing is done, e.g., CountdownEvent or Monitor. On a side note, you'll need to protect access to the non-thread-safe List<T> as well.
How will generally Parallel.ForEach work with async lambdas, if it hit await will it hand over its thread to next iteration?
Since Parallel doesn't understand async lambdas, when the first await yields (returns) to its caller, Parallel will assume that interation of the loop is complete.
I assume ParallelLoopResult IsCompleted field is not proper one, as it will return true when all iterations are executed, no matter if their actual lambda jobs are finished or not?
Correct. As far as Parallel knows, it can only "see" the method to the first await that returns to its caller. So it doesn't know when the async lambda is complete. It also will assume iterations are complete too early, which throws partitioning off.
You don't need Parallel.For/ForEach here you just need to await a list of tasks.
Background
In short you need to be very careful about async lambdas, and if you are passing them to an Action or Func<Task>
Your problem is because Parallel.For / ForEach is not suited for the async and await pattern or IO bound tasks. They are suited for cpu bound workloads. Which means they essentially have Action parameters and let's the task scheduler create the tasks for you
If you want to run multiple async tasks at the same time use Task.WhenAll , or a TPL Dataflow Block (or something similar) which can deal effectively with both CPU bound and IO bound works loads, or said more directly, they can deal with tasks which is what an async method is.
Unless you need to do more inside of your lambda (for which you haven't shown), just use aSelect and WhenAll
var tasks = items .Select(LongRunningIoOperationAsync);
var results = await Task.WhenAll(tasks); // here is your list of int
If you do, you can still use the await,
var tasks = items.Select(async (item) =>
{
var x = await LongRunningIoOperationAsync(item);
// do other stuff
return x;
});
var results = await Task.WhenAll(tasks);
Note : If you need the extended functionality of Parallel.ForEach (namely the Options to control max concurrency), there are several approach, however RX or DataFlow might be the most succinct

What's the diference between Task.WhenAll() and foreach(var task in tasks)

After a few hours of struggle I found a bug in my app. I considered the 2 functions below to have identical behavior, but it turned out they don't.
Can anyone tell me what's really going on under the hood, and why they behave in a different way?
public async Task MyFunction1(IEnumerable<Task> tasks){
await Task.WhenAll(tasks);
Console.WriteLine("all done"); // happens AFTER all tasks are finished
}
public async Task MyFunction2(IEnumerable<Task> tasks){
foreach(var task in tasks){
await task;
}
Console.WriteLine("all done"); // happens BEFORE all tasks are finished
}
They'll function identically if all tasks complete successfully.
If you use WhenAll and any items fail, it still won't be completed until all of the items are finished, and it'll represent an AggregatException that wraps all errors from all tasks.
If you await each one then it'll complete as soon as it hits any item that fails, and it'll represent an exception for that one error, not any others.
The two also differ in that WhenAll will materialize the entire IEnumerable right at the start, before adding any continuations to other items. If the IEnumerable represents a collection of already existing and started tasks, then this isn't relevant, but if the act of iterating the enumerable creates and/or starts tasks, then materializing the sequence at the start would run them all in parallel, and awaiting each before fetching the next task would execute them sequentially. Below is a IEnumerable you could pass in that would behave as I've described here:
public static IEnumerable<Task> TaskGeneratorSequence()
{
for(int i = 0; i < 10; i++)
yield return Task.Delay(TimeSpan.FromSeconds(2);
}
Likely the most important functional difference is that Task.WhenAll can introduce concurrency when your tasks perform truly asynchronous operations, for example, IO. This may or may not be what you want depending on your situation.
For example, if your tasks are querying the database using the same EF DbContext, the next query would fire as soon as the first one is "in flight" which causes EF to blow up as it doesn't support multiple simultaneous queries using the same context.
That's because you're not awaiting each asynchronous operation individually. You're awaiting a task that represents the completion of all of those asynchronous operations. They can also be completed in any order.
However when you await each one individually in a foreach, you only fire the next task when the current one completes, preventing concurrency and ensuring serial execution.
A simple example demonstrating this behavior:
async Task Main()
{
var tasks = new []{1, 2, 3, 4, 5}.Select(i => OperationAsync(i));
foreach(var t in tasks)
{
await t;
}
await Task.WhenAll(tasks);
}
static Random _rand = new Random();
public async Task OperationAsync(int number)
{
// simulate an asynchronous operation
// taking anywhere between 100 to 3000 milliseconds
await Task.Delay(_rand.Next(100, 3000));
Console.WriteLine(number);
}
You'll see that no matter how long OperationAsync takes, with foreach you always get 1, 2, 3, 4, 5 printed. But with Task.WhenAll they are executed concurrently and printed in their completion order.

WhenAll vs WaitAll in parallel

I'm trying to understand how WaitAll and WhenAll works and have following problem. There are two possible ways to get a result from a method:
return Task.WhenAll(tasks).Result.SelectMany(r=> r);
return tasks.Select(t => t.Result).SelectMany(r => r).ToArray();
If I understand correctly, the second case is like calling WaitAll on tasks and fetching the results after that.
It looks like the second case has much better performance. I know that the proper usage of WhenAll is with await keyword, but still, i'm wondering why there is so big difference in performance for these lines.
After analyzing the flow of the system I think I've figured out how to model the problem in a simple test application (test code is based on I3arnon answer):
public static void Test()
{
var tasks = Enumerable.Range(1, 1000).Select(n => Task.Run(() => Compute(n)));
var baseTasks = new Task[100];
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
baseTasks[i] = Task.Run(() =>
{
tasks.Select(t => t.Result).SelectMany(r => r).ToList();
});
}
Task.WaitAll(baseTasks);
Console.WriteLine("Select - {0}", stopwatch.Elapsed);
baseTasks = new Task[100];
stopwatch.Restart();
for (int i = 0; i < 100; i++)
{
baseTasks[i] = Task.Run(() =>
{
Task.WhenAll(tasks).Result.SelectMany(result => result).ToList();
});
}
Task.WaitAll(baseTasks);
Console.WriteLine("Task.WhenAll - {0}", stopwatch.Elapsed);
}
It looks like the problem is in starting tasks from other tasks (or in Parallel loop). In that case WhenAll results in much worse performance of the program. Why is that?
You are starting tasks inside a Parallel.ForEach loop which you should avoid. The whole point of Paralle.ForEach is to parallelize many small but intensive computations across the available CPU cores and starting a task is not an intensive computation. Rather it creates a task object and stores it on a queue if the task pool is saturated which it quickly will be with 1000 tasks being starteed. So now Parallel.ForEach competes with the task pool for compute resources.
In the first loop that is quite slow it seems that the scheduling is suboptimal and very little CPU is used probably because of Task.WhenAll inside the Parallel.ForEach. If you change the Parallel.ForEach to a normal for loop you will see a speedup.
But if you code really is as simple as a Compute function without any state carried forward between iterations you can get rid of the tasks and simply use Parallel.ForEach to maximize performance:
Parallel.For(0, 100, (i, s) =>
{
Enumerable.Range(1, 1000).Select(n => Compute(n)).SelectMany(r => r).ToList();
});
As to why Task.WhenAll performs much worse you should realize that this code
tasks.Select(t => t.Result).SelectMany(r => r).ToList();
will not run the tasks in parallel. The ToList basically wraps the iteration in a foreach loop and the body of the loop creates a task and then waits for the task to complete because you retrieve the Task.Result property. So each iteration of the loop will create a task and then wait for it to complete. The 1000 tasks are executed one after the other and there is very little overhead in handling the tasks. This means that you do not need the tasks which is also what I have suggested above.
On the other hand, the code
Task.WhenAll(tasks).Result.SelectMany(result => result).ToList();
will start all the tasks and try to execute them concurrently and because the task pool is unable to execute 1000 tasks in parallel most of these tasks are queued before they are executed. This creates a big management and task switch overhead which explains the bad performance.
With regard to the final question you added: If the only purpose of the outer task is to start the inner tasks then the outer task has no useful purpose but if the outer tasks are there to perform some kind of coordination of the inner tasks then it might make sense (perhaps you want to combine Task.WhenAny with Task.WhenAll). Without more context it is hard to answer. However, your question seems to be about performance and starting 100,000 tasks may add considerable overhead.
Parallel.ForEach is a good choice if you want to perform 100,000 independent computations like you do in your example. Tasks are very good for executing concurrent activities involving "slow" calls to other systems where you want to wait for and combine results and also handle errors. For massive parallelism they are probably not the best choice.
Your test is way too complicated so I've made my own. Here's a simple test that incorporates your Consume method:
public static void Test()
{
var tasks = Enumerable.Repeat(int.MaxValue, 10000).Select(n => Task.Run(() => Compute(n)));
var stopwatch = Stopwatch.StartNew();
Task.WhenAll(tasks).Result.SelectMany(result => result).ToList();
Console.WriteLine("Task.WhenAll - {0}", stopwatch.Elapsed);
stopwatch.Restart();
tasks.Select(t => t.Result).SelectMany(r => r).ToList();
Console.WriteLine("Select - {0}", stopwatch.Elapsed);
}
private static List<int> Compute(int seed)
{
var results = new List<int>();
for (int i = 0; i < 5000; i++)
{
results.Add(seed * i);
}
return results;
}
Output:
Task.WhenAll - 00:00:01.2894227
Select - 00:00:01.7114142
However if I use Enumerable.Repeat(int.MaxValue, 100) the output is:
Task.WhenAll - 00:00:00.0205375
Select - 00:00:00.0178089
Basically the difference between the options is if you're blocking once or blocking for each element. Blocking once is better when there are many elements, but for few blocking for each one could be better.
Since there ins't really a big difference and, you care about performance only when you're dealing with many items and logically you want to proceed when all the tasks completed I recommend using Task.WhenAll.

How to call an async method from within a loop without awaiting?

Consider this piece of code, where there is some work being done within a for loop, and then a recursive call to process sub items. I wanted to convert DoSomething(item) and GetItems(id) to async methods, but if I await on them here, the for loop is going to wait for each iteration to finish before moving on, essentially losing the benefit of parallel processing. How could I improve the performance of this method? Is it possible to do it using async/await?
public void DoWork(string id)
{
var items = GetItems(id); //takes time
if (items == null)
return;
Parallel.ForEach(items, item =>
{
DoSomething(item); //takes time
DoWork(item.subItemId);
});
}
Instead of using Parallel.ForEach to loop over the items you can create a sequence of tasks and then use Task.WhenAll to wait for them all to complete. As your code also involves recursion it gets slightly more complicated and you need to combine DoSomething and DoWork into a single method which I have aptly named DoIt:
async Task DoWork(String id) {
var items = GetItems(id);
if (items == null)
return;
var tasks = items.Select(DoIt);
await Task.WhenAll(tasks);
}
async Task DoIt(Item item) {
await DoSomething(item);
await DoWork(item.subItemId);
}
Mixing Parallel.ForEach and async/await is a bad idea. Parallel.ForEach will allow your code to execute in parallel and for compute intensive but parallelizable algorithms you get the best performance. However async/await allows your code to execute concurrently and for instance reuse threads that are blocked on IO operations.
Simplified Parallel.ForEach will setup as many threads as you have CPU cores on your computer and then partition the items you are iterating to be executed across these threads. So Parallel.ForEach should be used once at the bottom of your call stack where it will then fan out the work to multiple threads and wait for them to complete. Calling Parallel.ForEach in a recursive manner inside each of these threads is just crazy and will not improve performance at all.

run a method multiple times simultaneously in c#

I have a method that returns XML elements, but that method takes some time to finish and return a value.
What I have now is
foreach (var t in s)
{
r.add(method(test));
}
but this only runs the next statement after previous one finishes. How can I make it run simultaneously?
You should be able to use tasks for this:
//first start a task for each element in s, and add the tasks to the tasks collection
var tasks = new List<Task>();
foreach( var t in s)
{
tasks.Add(Task.Factory.StartNew(method(t)));
}
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
foreach( var task in tasks)
{
r.Add(task.Result);
}
EDIT
There are some problems with the code above. See the code below for a working version. Here I have also rewritten the loops to use LINQ for readability issues (and in the case of the first loop, to avoid the closure on t inside the lambda expression causing problems).
var tasks = s.Select(t => Task<int>.Factory.StartNew(() => method(t))).ToArray();
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
r = tasks.Select(task => task.Result).ToList();
You can use Parallel.ForEach which will utilize multiple threads to do the execution in parallel. You have to make sure that all code called is thread safe and can be executed in parallel.
Parallel.ForEach(s, t => r.add(method(t));
From what I'm seeing you are updating a shared collection inside the loop. This means that if you execute the loop in parallel a data race will occur because multiple threads will try to update a non-synchronized collection (assuming r is a List or something like this) at the same time, causing an inconsistent state.
To execute correctly in parallel, you will need to wrap that section of code inside a lock statement:
object locker = new object();
Parallel.Foreach (s,
t =>
{
lock(locker) r.add(method(t));
});
However, this will make the execution actually serial, because each thread needs to acquire the lock and two threads cannot do so at the same time.
The better solution would be to have a local list for each thread, add the partial results to that list and then merge the results when all threads have finished. Probably #Øyvind Knobloch-Bråthen's second solution is the best one, assuming method(t) is the real CPU-hog in this case.
Modification to the correct answer for this question
change
tasks.Add(Task.Factory.StartNew(method(t);));
to
//solution will be the following code
tasks.Add(Task.Factory.StartNew(() => { method(t);}));

Categories

Resources