TPL Dataflow with Rx: Observer misses messages - c#

In the sample coding below, the source and IObservable and also part of a pipeline. The sink at the end of the pipeline gets all the messages. But the observer gets only the first and last messages.
public class Test
{
public static async Task Run()
{
var source = new BroadcastBlock<int>(x => x);
var transform = new TransformBlock<int, int>(x => x * 2);
var sink = new ActionBlock<int>(x => Console.WriteLine($"from ActionBlock: {x}"));
source.LinkTo(transform);
transform.LinkTo(sink);
var observable = source.AsObservable();
IDisposable subscription = observable.Subscribe(x => Console.WriteLine($"from observer: {x}"));
await source.SendAsync(1);
await source.SendAsync(2);
await source.SendAsync(3);
await source.SendAsync(4);
await source.SendAsync(5);
await source.SendAsync(6);
subscription.Dispose();
}
}
from observer: 1
from ActionBlock: 2
from ActionBlock: 4
from ActionBlock: 6
from ActionBlock: 8
from ActionBlock: 10
from ActionBlock: 12
from observer: 6

From a relevant GitHub issue regarding similar problems with the WriteOnceBlock<T>:
at present WriteOnceBlock composability with AsObservable isn't perfect.
(Stephen Toub commented on Jun 30, 2022)
If I a was you I wouldn't mess with TPL Dataflow and Rx interop. It seems that these two technologies have not been rigorously tested together, and there are bugs lurking in there. Bugs that currently no one cares about, because no one actually uses these two technologies together to do anything serious.

Related

BlockCollection alternatives?

We are using BlockCollection to implement producer-consumer pattern in a real-time application, i.e.
BlockingCollection<T> collection = new BlockingCollection<T>();
CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();
// Starting up consumer
Task.Run(() => consumer(this.cancellationTokenSource.Token));
…
void Producer(T item)
{
collection.Add(item);
}
…
void consumer()
{
while (true)
{
var item = this.blockingCollection.Take(token);
process (item);
}
}
To be sure, this is a very simplified version of the actual production code.
Sometimes when the application is under heavy load, we observe that the consuming part is lagging behind the producing part. Since the application logic is very complex, it involves interaction with other applications over network, as well as with SQL databases. Delays could be occurring in many places; they could occur in the calls to process(), which might in principle explain why the consuming part can be slow.
All the above considerations aside, is there something inherent in using BlockingCollection, which could explain this phenomenon? Are there more efficient options in .Net to realise producer-consumer pattern?
First of all, BlockingCollection isn't the best choice for producer/consumer scenarios. There are at least two better options (Dataflow, Channels) and the choice depends on the actual application scenario - which is missing from the question.
It's also possible to create a producer/consumer pipeline without a buffer, by using async streams and IAsyncEnmerable.
Async Streams
In this case, the producer can be an async iterator. The consumer will receive the IAsyncEnumerable and iterate over it until it completes. It could also produce its own IAsyncEnumerable output, which can be passed to the next method in the pipeline:
The producer can be :
public static async IAsyncEnumerable<Message> ProducerAsync(CancellationToken token)
{
while(!token.IsCancellationRequested)
{
var msg=await Task.Run(()=>SomeHeavyWork());
yield return msg;
}
}
And the consumer :
async Task ConsumeAsync(IAsyncEnumerable<Message> source)
{
await foreach(var msg in source)
{
await consumeMessage(msg);
}
}
There's no buffering in this case, and the producer can't emit a new message until the consumer consumes the current one. The consumer can be parallelized with Parallel.ForEachAsync. Finally, the System.Linq.Async provides LINQ operations to async streams, allowing us to write eg :
List<OtherMsg> results=await ProducerAsync(cts.Token)
.Select(msg=>consumeAndReturn(msg))
.ToListAsync();
Dataflow - ActionBlock
Dataflow blocks can be used to construct entire processing pipelines, with each block receiving a message (data) from the previous one, processing it and passing it to the next block. Most blocks have input and where appropriate output buffers. Each block uses a single worker task but can be configured to use more. The application code doesn't have to handle the tasks though.
In the simplest case, a single ActionBlock can process messages posted to it by one or more producers, acting as a consumer:
async Task ConsumeAsync<Message>(Message message)
{
//Do something with the message
}
...
ExecutionDataflowBlockOptions _options= new () {
MaxDegreeOfParallelism=4,
BoundedCapacity=5
};
ActionBlock<Message> _block=new ActionBlock(ConsumeAsync,_options);
async Task ProduceAsync(CancellationToken token)
{
while(!token.IsCancellationRequested)
{
var msg=await produceNewMessageAsync();
await _block.SendAsync(msg);
}
_block.Complete();
await _block.Completion;
}
In this example the block uses 4 worker tasks and will block if more than 5 items are waiting in its input buffer, beyond those currently being processed.
BufferBlock as a producer/consumer queue
A BufferBlock is an inactive block that's used as a buffer by other blocks. It can be used as an asynchronous producer/consumer collection as shown in How to: Implement a producer-consumer dataflow pattern. In this case, the code needs to receive messages explicitly. Threading is up to the developer. :
static void Produce(ITargetBlock<byte[]> target)
{
var rand = new Random();
for (int i = 0; i < 100; ++ i)
{
var buffer = new byte[1024];
rand.NextBytes(buffer);
target.Post(buffer);
}
target.Complete();
}
static async Task<int> ConsumeAsync(ISourceBlock<byte[]> source)
{
int bytesProcessed = 0;
while (await source.OutputAvailableAsync())
{
byte[] data = await source.ReceiveAsync();
bytesProcessed += data.Length;
}
return bytesProcessed;
}
static async Task Main()
{
var buffer = new BufferBlock<byte[]>();
var consumerTask = ConsumeAsync(buffer);
Produce(buffer);
var bytesProcessed = await consumerTask;
Console.WriteLine($"Processed {bytesProcessed:#,#} bytes.");
}
Parallelized consumer
In .NET 6 the consumer can be simplified by using await foreach and ReceiveAllAsync :
static async Task<int> ConsumeAsync(IReceivableSourceBlock<byte[]> source)
{
int bytesProcessed = 0;
await foreach(var data in source.ReceiveAllAsync())
{
bytesProcessed += data.Length;
}
return bytesProcessed;
}
And processed concurrently using Parallel.ForEachAsync :
static async Task ConsumeAsync(IReceivableSourceBlock<byte[]> source)
{
var msgs=source.ReceiveAllAsync();
await Parallel.ForEachAsync(msgs,
new ParallelOptions { MaxDegreeOfParallelism = 4},
msg=>ConsumeMsgAsync(msg));
}
By default Parallel.ForeachAsync will use as many worker tasks as there are cores
Channels
Channels are similar to Go's channels. They are built specifically for producer/consumer scenarios and allow creating pipelines at a lower level than the Dataflow library. If the Dataflow library was built today, it would be built on top of Channels.
A channel can't be accessed directly, only through its Reader or Writer interfaces. This is intentional, and allows easy pipelining of methods. A very common pattern is for a producer method to create an channel it owns and return only a ChannelReader. Consuming methods accept that reader as input. This way, the producer can control the channel's lifetime without worrying whether other producers are writing to it.
With channels, a producer would look like this :
ChannelReader<Message> Producer(CancellationToken token)
{
var channel=Channel.CreateBounded(5);
var writer=channel.Writer;
_ = Task.Run(()=>{
while(!token.IsCancellationRequested)
{
...
await writer.SendAsync(msg);
}
},token)
.ContinueWith(t=>writer.TryComplete(t.Exception));
return channel.Reader;
}
The unusual .ContinueWith(t=>writer.TryComplete(t.Exception)); is used to signal completion to the writer. This will signal readers to complete as well. This way completion propagates from one method to the next. Any exceptions are propagated as well
writer.TryComplete(t.Exception)) doesn't block or perform any significant work so it doesn't matter what thread it executes on. This means there's no need to use await on the worker task, which would complicate the code by rethrowing any exceptions.
A consuming method only needs the ChannelReader as source.
async Task ConsumerAsync(ChannelReader<Message> source)
{
await Parallel.ForEachAsync(source.ReadAllAsync(),
new ParallelOptions { MaxDegreeOfParallelism = 4},
msg=>consumeMessageAsync(msg)
);
}
A method may read from one channel and publish new data to another using the producer pattern :
ChannelReader<OtherMessage> ConsumerAsync(ChannelReader<Message> source)
{
var channel=Channel.CreateBounded<OtherMessage>();
var writer=channel.Writer;
await Parallel.ForEachAsync(source.ReadAllAsync(),
new ParallelOptions { MaxDegreeOfParallelism = 4},
async msg=>{
var newMsg=await consumeMessageAsync(msg);
await writer.SendAsync(newMsg);
})
.ContinueWith(t=>writer.TryComplete(t.Exception));
}
You could look at using the Dataflow library. I'm not sure if it is more performant than a BlockingCollection. As others have said, there is no guarantee that you can consume faster than produce, so it is always possible to fall behind.

Clarification on running multiple async tasks in parallel with throttling

EDIT: Since the Bulkhead policy needs to be wrapped with a WaitAndRetry policy, anyway...I'm leaning towards example 3 as the best solution to keep parallelism, throttling, and polly policy retrying. Just seems strange since I thought the Parallel.ForEach was for sync operations and Bulkhead would be better for async
I'm trying to run multiple async Tasks in parallel with throttling using polly AsyncBulkheadPolicy. My understanding so far is that the policy method ExecuteAsync does not itself make a call onto a thread, but is leaving that to the default TaskScheduler or someone before it. Thus, if my tasks are CPU bound in some way then I need to use Parallel.ForEach when executing tasks or Task.Run() with the ExecuteAsync method in order to schedule the tasks to background threads.
Can someone look at the examples below and clarify how they would work in terms of parallism and threadpooling?
https://github.com/App-vNext/Polly/wiki/Bulkhead - Operation: Bulkhead policy does not create it's own threads, it assumes we have already done so.
async Task DoSomething(IEnumerable<object> objects);
//Example 1:
//Simple use, but then I don't have access to retry policies from polly
Parallel.ForEach(groupedObjects, (set) =>
{
var task = DoSomething(set);
task.Wait();
});
//Example 2:
//Uses default TaskScheduler which may or may not run the tasks in parallel
var parallelTasks = new List<Task>();
foreach (var set in groupedObjects)
{
var task = bulkheadPolicy.ExecuteAsync(async () => DoSomething(set));
parallelTasks.Add(task);
};
await Task.WhenAll(parallelTasks);
//Example 3:
//seems to defeat the purpose of the bulkhead since Parallel.ForEach and
//PolicyBulkheadAsync can both do throttling...just use basic RetryPolicy
//here?
Parallel.ForEach(groupedObjects, (set) =>
{
var task = bulkheadPolicy.ExecuteAsync(async () => DoSomething(set));
task.Wait();
});
//Example 4:
//Task.Run still uses the default Task scheduler and isn't any different than
//Example 2; just makes more tasks...this is my understanding.
var parallelTasks = new List<Task>();
foreach (var set in groupedObjects)
{
var task = Task.Run(async () => await bulkheadPolicy.ExecuteAsync(async () => DoSomething(set)));
parallelTasks.Add(task);
};
await Task.WhenAll(parallelTasks);
DoSomething is an async method doing operations on a set of objects. I'd like this to happen in parallel threads while respecting retry policies from polly and allowing for throttling.
I seem to have confused myself along the way in what exactly the functional behavior of Parallel.ForEach and using Bulkhead.ExecuteAsync does, however, when it comes to how tasks/threads are handled.
You are probably right that using Parallel.ForEach defeats the purpose of the bulkhead. I think that a simple loop with a delay will do the job of feeding the bulkhead with tasks. Although I guess that in a real life example there would be a continuous stream of data, and not a predefined list or array.
using Polly;
using Polly.Bulkhead;
static async Task Main(string[] args)
{
var groupedObjects = Enumerable.Range(0, 10)
.Select(n => new object[] { n }); // Create 10 sets to work with
var bulkheadPolicy = Policy
.BulkheadAsync(3, 3); // maxParallelization, maxQueuingActions
var parallelTasks = new List<Task>();
foreach (var set in groupedObjects)
{
Console.WriteLine(#$"Scheduling, Available: {bulkheadPolicy
.BulkheadAvailableCount}, QueueAvailable: {bulkheadPolicy
.QueueAvailableCount}");
// Start the task
var task = bulkheadPolicy.ExecuteAsync(async () =>
{
// Await the task without capturing the context
await DoSomethingAsync(set).ConfigureAwait(false);
});
parallelTasks.Add(task);
await Task.Delay(50); // Interval between scheduling more tasks
}
var whenAllTasks = Task.WhenAll(parallelTasks);
try
{
// Await all the tasks (await throws only one of the exceptions)
await whenAllTasks;
}
catch when (whenAllTasks.IsFaulted) // It might also be canceled
{
// Ignore rejections, rethrow other exceptions
whenAllTasks.Exception.Handle(ex => ex is BulkheadRejectedException);
}
Console.WriteLine(#$"Processed: {parallelTasks
.Where(t => t.Status == TaskStatus.RanToCompletion).Count()}");
Console.WriteLine($"Faulted: {parallelTasks.Where(t => t.IsFaulted).Count()}");
}
static async Task DoSomethingAsync(IEnumerable<object> set)
{
// Pretend we are doing something with the set
await Task.Delay(500).ConfigureAwait(false);
}
Output:
Scheduling, Available: 3, QueueAvailable: 3
Scheduling, Available: 2, QueueAvailable: 3
Scheduling, Available: 1, QueueAvailable: 3
Scheduling, Available: 0, QueueAvailable: 3
Scheduling, Available: 0, QueueAvailable: 2
Scheduling, Available: 0, QueueAvailable: 1
Scheduling, Available: 0, QueueAvailable: 0
Scheduling, Available: 0, QueueAvailable: 0
Scheduling, Available: 0, QueueAvailable: 0
Scheduling, Available: 0, QueueAvailable: 1
Processed: 7
Faulted: 3
Try it on Fiddle.
Update: A slightly more realistic version of DoSomethingAsync, that actually forces the CPU to do some real work (CPU utilization near 100% in my quad core machine).
private static async Task DoSomethingAsync(IEnumerable<object> objects)
{
await Task.Run(() =>
{
long sum = 0; for (int i = 0; i < 500000000; i++) sum += i;
}).ConfigureAwait(false);
}
This method is not running for all the data sets. It's running only for the sets that are not rejected by the bulkhead.

Linking multiple TransformBlocks to single BatchBlock TPL

I've built two pipelines using TPL Dataflow:
TransformBlock => TransformBlock => BatchBlock => ....
TransformBlock => BatchBlock => TransformBlock => ....
I want to accomplish
/ => Transform Block => TransformBlock => BatchBlock => ....
BatchBlock /
\
\ => Transform Block => BatchBlock => TransformBlock => ....
However only the first pipeline gets executed.
My code
batchMediaBlock.LinkTo(pipelineA.FirstBlock, new DataflowLinkOptions {PropagateCompletion = true});
batchMediaBlock.LinkTo(pipelineB.FirstBlock, new DataflowLinkOptions {PropagateCompletion = true});
How can I accomplish this?
You'll need a BroadcastBlock after your BatchBlock. But be advised, completion will only propagate to one of your TransformBlocks. See below for a partial example to handle completion:
using System.Threading.Tasks.Dataflow;
namespace MyDataflow {
class MyDataflow {
public void HandlingCompletion() {
var batchBlock = new BatchBlock<int>(10);
var broadcastBlock = new BroadcastBlock<int[]>(_ => _);
var xForm1 = new TransformBlock<int[], int[]>(_ => _);
var xForm2 = new TransformBlock<int[], int[]>(_ => _);
batchBlock.LinkTo(broadcastBlock, new DataflowLinkOptions() { PropagateCompletion = true });
broadcastBlock.LinkTo(xForm1);
broadcastBlock.LinkTo(xForm1);
broadcastBlock.Completion.ContinueWith(broadcastBlockCompletionTask => {
if (!broadcastBlockCompletionTask.IsFaulted) {
xForm1.Complete();
xForm2.Complete();
}else {
((IDataflowBlock)xForm1).Fault(broadcastBlockCompletionTask.Exception);
((IDataflowBlock)xForm2).Fault(broadcastBlockCompletionTask.Exception);
}
});
xForm1.Completion.ContinueWith(async _ => {
try {
await xForm2.Completion;
//continue passing completion / fault on to rest of pipeline
} catch {
}
});
}
}
}
Alternatively, if your pipeline never converges again you can handle completion separately for each pipeline after continuing the BroacastBlock. The example provided will complete each step in the pipeline at the same time, flowing completion along in sync.
By default, linking in TPL Dataflow is considered greedy, so the first target always get message and removes it from previous block' output, that's why your second block doesn't get any messages. Such situations can be addressed by BroadcastBlock<T>, which
ensures that the current element is broadcast to any linked targets before allowing the element to be overwritten.
You also should note that this block do clone the message.
So you basically should add a broadcast after your batch block, but! you should not propagate your completion from broadcast block to consumers - only first one will get a completion. You should add a ContinueWith handler for your broadcast, as #JSteward suggested.

Ensuring completion of async OnNext code before process terminates

The unit test below will never print "Async 3" because the test finishes first. How can I ensure it runs to completion? The best I could come up with was an arbitrary Task.Delay at the end or WriteAsync().Result, neither are ideal.
public async Task TestMethod1() // eg. webjob
{
TestContext.WriteLine("Starting test...");
var observable = Observable.Create<int>(async ob =>
{
ob.OnNext(1);
await Task.Delay(1000); // Fake async REST api call
ob.OnNext(2);
await Task.Delay(1000);
ob.OnNext(3);
ob.OnCompleted();
});
observable.Subscribe(i => TestContext.WriteLine($"Sync {i}"));
observable.SelectMany(i => WriteAsync(i).ToObservable()).Subscribe();
await observable;
TestContext.WriteLine("Complete.");
}
public async Task WriteAsync(int value) // Fake async DB call
{
await Task.Delay(1000);
TestContext.WriteLine($"Async {value}");
}
Edit
I realise that mentioning unit tests was probably misleading.This isn't a testing question. The code is a simulation of a real issue of a process running in an Azure WebJob, where both the producer and consumer need to call some Async IO. The issue is that the webjob runs to completion before the consumer has really finished. This is because I can't figure out how to properly await anything from the consumer side. Maybe this just isn't possible with RX...
EDIT:
You're basically looking for a blocking operator. The old blocking operators (like ForEach) were deprecated in favor of async versions. You want to await the last item like so:
public async Task TestMethod1()
{
TestContext.WriteLine("Starting test...");
var observable = Observable.Create<int>(async ob =>
{
ob.OnNext(1);
await Task.Delay(1000);
ob.OnNext(2);
await Task.Delay(1000);
ob.OnNext(3);
ob.OnCompleted();
});
observable.Subscribe(i => TestContext.WriteLine($"Sync {i}"));
var selectManyObservable = observable.SelectMany(i => WriteAsync(i).ToObservable()).Publish().RefCount();
selectManyObservable.Subscribe();
await selectManyObservable.LastOrDefaultAsync();
TestContext.WriteLine("Complete.");
}
While that will solve your immediate problem, it looks like you're going to keep running into issues because of the below (and I added two more). Rx is very powerful when used right, and confusing as hell when not.
Old answer:
A couple things:
Mixing async/await and Rx generally results in getting the pitfalls of both and the benefits of neither.
Rx has robust testing functionality. You're not using it.
Side-Effects, like a WriteLine are best performed exclusively in a subscribe, and not in an operator like SelectMany.
You may want to brush up on cold vs hot observables.
The reason it isn't running to completion is because of your test runner. Your test runner is terminating the test at the conclusion of TestMethod1. The Rx subscription would live on otherwise. When I run your code in Linqpad, I get the following output:
Starting test...
Sync 1
Sync 2
Async 1
Sync 3
Async 2
Complete.
Async 3
...which is what I'm assuming you want to see, except you probably want the Complete after the Async 3.
Using Rx only, your code would look something like this:
public void TestMethod1()
{
TestContext.WriteLine("Starting test...");
var observable = Observable.Concat<int>(
Observable.Return(1),
Observable.Empty<int>().Delay(TimeSpan.FromSeconds(1)),
Observable.Return(2),
Observable.Empty<int>().Delay(TimeSpan.FromSeconds(1)),
Observable.Return(3)
);
var syncOutput = observable
.Select(i => $"Sync {i}");
syncOutput.Subscribe(s => TestContext.WriteLine(s));
var asyncOutput = observable
.SelectMany(i => WriteAsync(i, scheduler));
asyncOutput.Subscribe(s => TestContext.WriteLine(s), () => TestContext.WriteLine("Complete."));
}
public IObservable<string> WriteAsync(int value, IScheduler scheduler)
{
return Observable.Return(value)
.Delay(TimeSpan.FromSeconds(1), scheduler)
.Select(i => $"Async {value}");
}
public static class TestContext
{
public static void WriteLine(string s)
{
Console.WriteLine(s);
}
}
This still isn't taking advantage of Rx's testing functionality. That would look like this:
public void TestMethod1()
{
var scheduler = new TestScheduler();
TestContext.WriteLine("Starting test...");
var observable = Observable.Concat<int>(
Observable.Return(1),
Observable.Empty<int>().Delay(TimeSpan.FromSeconds(1), scheduler),
Observable.Return(2),
Observable.Empty<int>().Delay(TimeSpan.FromSeconds(1), scheduler),
Observable.Return(3)
);
var syncOutput = observable
.Select(i => $"Sync {i}");
syncOutput.Subscribe(s => TestContext.WriteLine(s));
var asyncOutput = observable
.SelectMany(i => WriteAsync(i, scheduler));
asyncOutput.Subscribe(s => TestContext.WriteLine(s), () => TestContext.WriteLine("Complete."));
var asyncExpected = scheduler.CreateColdObservable<string>(
ReactiveTest.OnNext(1000.Ms(), "Async 1"),
ReactiveTest.OnNext(2000.Ms(), "Async 2"),
ReactiveTest.OnNext(3000.Ms(), "Async 3"),
ReactiveTest.OnCompleted<string>(3000.Ms() + 1) //+1 because you can't have two notifications on same tick
);
var syncExpected = scheduler.CreateColdObservable<string>(
ReactiveTest.OnNext(0000.Ms(), "Sync 1"),
ReactiveTest.OnNext(1000.Ms(), "Sync 2"),
ReactiveTest.OnNext(2000.Ms(), "Sync 3"),
ReactiveTest.OnCompleted<string>(2000.Ms()) //why no +1 here?
);
var asyncObserver = scheduler.CreateObserver<string>();
asyncOutput.Subscribe(asyncObserver);
var syncObserver = scheduler.CreateObserver<string>();
syncOutput.Subscribe(syncObserver);
scheduler.Start();
ReactiveAssert.AreElementsEqual(
asyncExpected.Messages,
asyncObserver.Messages);
ReactiveAssert.AreElementsEqual(
syncExpected.Messages,
syncObserver.Messages);
}
public static class MyExtensions
{
public static long Ms(this int ms)
{
return TimeSpan.FromMilliseconds(ms).Ticks;
}
}
...So unlike your Task tests, you don't have to wait. The test executes instantly. You can bump up the Delay times to minutes or hours, and the TestScheduler will essentially mock the time for you. And then your test runner will probably be happy.
Well, you can use Observable.ForEach to block until an IObservable has terminated:
observable.ForEach(unusedValue => { });
Can you make TestMethod1 a normal, non-async method, and then replace await observable; with this?

How to correctly queue up tasks to run in C#

I have an enumeration of items (RunData.Demand), each representing some work involving calling an API over HTTP. It works great if I just foreach through it all and call the API during each iteration. However, each iteration takes a second or two so I'd like to run 2-3 threads and divide up the work between them. Here's what I'm doing:
ThreadPool.SetMaxThreads(2, 5); // Trying to limit the amount of threads
var tasks = RunData.Demand
.Select(service => Task.Run(async delegate
{
var availabilityResponse = await client.QueryAvailability(service);
// Do some other stuff, not really important
}));
await Task.WhenAll(tasks);
The client.QueryAvailability call basically calls an API using the HttpClient class:
public async Task<QueryAvailabilityResponse> QueryAvailability(QueryAvailabilityMultidayRequest request)
{
var response = await client.PostAsJsonAsync("api/queryavailabilitymultiday", request);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<QueryAvailabilityResponse>();
}
throw new HttpException((int) response.StatusCode, response.ReasonPhrase);
}
This works great for a while, but eventually things start timing out. If I set the HttpClient Timeout to an hour, then I start getting weird internal server errors.
What I started doing was setting a Stopwatch within the QueryAvailability method to see what was going on.
What's happening is all 1200 items in RunData.Demand are being created at once and all 1200 await client.PostAsJsonAsync methods are being called. It appears it then uses the 2 threads to slowly check back on the tasks, so towards the end I have tasks that have been waiting for 9 or 10 minutes.
Here's the behavior I would like:
I'd like to create the 1,200 tasks, then run them 3-4 at a time as threads become available. I do not want to queue up 1,200 HTTP calls immediately.
Is there a good way to go about doing this?
As I always recommend.. what you need is TPL Dataflow (to install: Install-Package System.Threading.Tasks.Dataflow).
You create an ActionBlock with an action to perform on each item. Set MaxDegreeOfParallelism for throttling. Start posting into it and await its completion:
var block = new ActionBlock<QueryAvailabilityMultidayRequest>(async service =>
{
var availabilityResponse = await client.QueryAvailability(service);
// ...
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
foreach (var service in RunData.Demand)
{
block.Post(service);
}
block.Complete();
await block.Completion;
Old question, but I would like to propose an alternative lightweight solution using the SemaphoreSlim class. Just reference System.Threading.
SemaphoreSlim sem = new SemaphoreSlim(4,4);
foreach (var service in RunData.Demand)
{
await sem.WaitAsync();
Task t = Task.Run(async () =>
{
var availabilityResponse = await client.QueryAvailability(serviceCopy));
// do your other stuff here with the result of QueryAvailability
}
t.ContinueWith(sem.Release());
}
The semaphore acts as a locking mechanism. You can only enter the semaphore by calling Wait (WaitAsync) which subtracts one from the count. Calling release adds one to the count.
You're using async HTTP calls, so limiting the number of threads will not help (nor will ParallelOptions.MaxDegreeOfParallelism in Parallel.ForEach as one of the answers suggests). Even a single thread can initiate all requests and process the results as they arrive.
One way to solve it is to use TPL Dataflow.
Another nice solution is to divide the source IEnumerable into partitions and process items in each partition sequentially as described in this blog post:
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
While the Dataflow library is great, I think it's a bit heavy when not using block composition. I would tend to use something like the extension method below.
Also, unlike the Partitioner method, this runs the async methods on the calling context - the caveat being that if your code is not truly async, or takes a 'fast path', then it will effectively run synchronously since no threads are explicitly created.
public static async Task RunParallelAsync<T>(this IEnumerable<T> items, Func<T, Task> asyncAction, int maxParallel)
{
var tasks = new List<Task>();
foreach (var item in items)
{
tasks.Add(asyncAction(item));
if (tasks.Count < maxParallel)
continue;
var notCompleted = tasks.Where(t => !t.IsCompleted).ToList();
if (notCompleted.Count >= maxParallel)
await Task.WhenAny(notCompleted);
}
await Task.WhenAll(tasks);
}

Categories

Resources