TPL Dataflow Consumer to Process Multiple Items at a time

TPL Dataflow Consumer to Process Multiple Items at a time - c#

I have a requirement to iterate through a large list and for each item call a web service to get some data. However, I want to throttle the number of requests to the WS to say no more than 5 concurrent requests executing at any one time. All calls to the WS are made using async/await. I am using the TPL dataflow BufferBlock with a BoundedCapacity of 5. Everything is working properly but what I am noticing is that the consumer, which awaits the WS call, blocks the queue until its finished resulting in all the requests in the bufferblock being performed serially. Is it possible to have the consumer always processing 5 items off the queue at a time? Or do I need to set up multiple consumers or start looking into action blocks? So in a nutshell I want to seed the queue with 5 items. As one item is processed a sixth will take its place and so on so I always have 5 concurrent requests going until there are no more items to process.
I used this as my guide: Async Producer/Consumer Queue using Dataflow
Thanks for any help. Below is a simplified version of the code
//set up
BufferBlock<CustomObject> queue = new BufferBlock<CustomObject>(new DataflowBlockOptions { BoundedCapacity = 5 });
var producer = QueueValues(queue, values);
var consumer = ConsumeValues(queue);
await Task.WhenAll(producer, consumer, queue.Completion);
counter = await consumer;
//producer
function QueueValues(BufferBlock<CustomObject> queue, IList<CustomObject> values)
{
foreach (CustomObject value in values)
{
await queue.SendAsync(value);
}
queue.Complete();
}
//consumer
function ConsumeValues(BufferBlock<CustomObject> queue)
{
while (await queue.OutputAvailableAsync())
{
CustomObject value = await queue.ReceiveAsync();
await CallWebServiceAsync(value);
}
}

Your use of TPL Dataflow is rather strange. Normally, you'd move the consumption and processing into the flow. Append a TransformBlock to call the webservice. Delete ConsumeValues.
ConsumeValues executes sequentially which is fundamentally not what you want.
Instead of BoundedCapacity I think you rather want MaxDegreeOfParallelism.

You should be using ActionBlock with MaxDegreeOfParallelism set to 5. You may also want to set a BoundedCapacity but that's for throttling the producer and not the consumer:
var block = new ActionBlock<CustomObject>(
item => CallWebServiceAsync(item),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
BoundedCapacity = 1000
});
foreach (CustomObject value in values)
{
await block.SendAsync(value);
}
block.Complete();
await block.Completion;

Related

Why do multiple short Tasks end up getting the same id?

I have a list of items to process, and I create a task for each one, and then await using Task.WhenAny(). I am following the pattern described here: Start Multiple Async Tasks and Process Them As They Complete .
I have changed one thing: I am using HashSet<Task> instead of List<Task>. But I notice that all the tasks end-up getting the same id, and thus the HashSet only adds one of them, and hence I end up waiting for only one task.
I have a working example here in dotnetfiddle: https://dotnetfiddle.net/KQN2ow
Also pasting the code below:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace ReproTasksWithSameId
{
public class Program
{
public static async Task Main(string[] args)
{
List<int> itemIds = new List<int>() { 1, 2, 3, 4 };
await ProcessManyItems(itemIds);
}
private static async Task ProcessManyItems(List<int> itemIds)
{
//
// Create tasks for each item and then wait for them using Task.WhenAny
// Following Task.WhenAny() pattern described here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/start-multiple-async-tasks-and-process-them-as-they-complete
// But replaced List<Task> with HashSet<Task>.
//
HashSet<Task> tasks = new HashSet<Task>();
// We map the task ids to item ids so that we have enough info to log if a task throws an exception.
Dictionary<int, int> taskIdToItemId = new Dictionary<int, int>();
foreach (int itemId in itemIds)
{
Task task = ProcessOneItem(itemId);
Console.WriteLine("Created task with id: {0}", task.Id);
tasks.Add(task);
taskIdToItemId[task.Id] = itemId;
}
// Add a loop to process the tasks one at a time until none remain.
while (tasks.Count > 0)
{
// Identify the first task that completes.
Task task = await Task.WhenAny(tasks);
// Remove the selected task from the list so that we don't
// process it more than once.
tasks.Remove(task);
// Get the item id from our map, so that we can log rich information.
int itemId = taskIdToItemId[task.Id];
try
{
// Await the completed task.
await task; // unwrap exceptions.
Console.WriteLine("Successfully processed task with id: {0}, itemId: {1}", task.Id, itemId);
}
catch (Exception ex)
{
Console.WriteLine("Failed to process task with id: {0}, itemId: {1}. Just logging & eating the exception {1}", task.Id, itemId, ex);
}
}
}
private static async Task ProcessOneItem(int itemId)
{
// Assume this method awaits on some asynchronous IO.
Console.WriteLine("item: {0}", itemId);
}
}
}
The output I get is this:
item: 1
Created task with id: 1
item: 2
Created task with id: 1
item: 3
Created task with id: 1
item: 4
Created task with id: 1
Successfully processed task with id: 1, itemId: 4
So basically the program exits after awaiting just the first task.
Why do multiple short Tasks end up getting the same id? BTW I also tested with a method that returns Task<TResult> instead of Task, and in that case it works fine.
Is there a better approach I can use?

The question's code is synchronous so there's only one completed task going around. async doesn't make something run asynchronously, it's syntactic sugar that allows using await to await an already executing asynchronous operation to complete without blocking the calling thread.
As for the documentation example, that's what it is. A documentation example, not a pattern and certainly not something that can be used in production except for simple cases.
What happens if you can only make 5 requests at a time to avoid flooding your network or CPU? You'd need to download only a fixed number of records for that. What if you need to process the downloaded data? What if the list of URLs comes from another thread?
Those issues are handled by concurrent containers, pub/sub patterns and the purpose-built Dataflow and Channel classes.
Dataflow
The older Dataflow classes take care of buffering input and output and handling worker tasks automatically. The entire download code can be replaced with an ActionBlock:
var client=new HttpClient(....);
//Cancel if the process takes longer than 30 minutes
var cts=new CancellationTokenSource(TimeSpan.FromMinutes(30));
var options=new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=10,
BoundedCapacity=5,
CancellationToken=cts.Token
};
var block=new ActionBlock<string>(url=>ProcessUrl(url,client,cts.Token));
That's it. The block will use up to 10 concurrent tasks to perform up to 10 concurrent downloads. It will keep up to 5 urls in memory (it would buffer everything otherwise). If the input buffer becomes full, sending items to the block will await asynchronously, t thus preventing slow downloads from flooding memory with URLs.
On the same or a different thread, the "publisher" of urls can post as many URLs as it wants, for as long as it wants.
foreach(var url in urls)
{
await block.SendAsync(url);
}
//Tell the block we're done
block.Complete();
//Wait until all downloads are complete
await block.Completion;
We can use other blocks like TransformBlock to produce output, pass it to another block and thus, create a concurrent processing pipeline. Let's say we have two methods, DownloadURL and ParseResponse instead of just ProcessUrl :
Task<string> DownloadUrlAsync(string url,HttpClient client)
{
return client.GetStringAsync(url);
}
void ParseResponse(string content)
{
var object=JObject.Parse();
DoSomethingWith(object);
}
We could create a separate block for each step in the pipeline, with different DOP and buffers :
var dlOptions=new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=5,
BoundedCapacity=5,
CancellationToken=cts.Token
};
var downloader=new TransformBlock<string,string>(
url=>DownloadUrlAsync(url,client),
dlOptions);
var parseOptions = new ExecutionDataflowBlockOptions(){
MaxDegreeOfParallelism=10,
BoundedCapacity=2,
CancellationToken=cts.Token
};
var parser=new ActionBlock<string>(ParseResponse);
downloader.LinkTo(parser, new DataflowLinkOptions{PropageateCompletion=true});
We can post URLs to the downloader now and wait until all of them are parsed. By using different DOP and capacities, we can balance the number of downloader and parser tasks to download as many URLs as we can parse and handle eg slow downloads or big responses.
foreach(var url in urls)
{
await downloader.SendAsync(url);
}
//Tell the block we're done
downloader.Complete();
//Wait until all urls are parsed
await parser.Completion;
Channels
System.Threading.Channels introduces Go-style channels. These are actually lower-level concepts that a Dataflow block. If Channels were available back in 2012, they'd be written using channels.
An equivalent download method would look like this :
ChannelReader<string> Downloader(ChannelReader<string> ulrs,HttpClient client,
int capacity,CancellationToken token=default)
{
var channel=Channel.CreateBounded(capacity);
var writer=channel.Writer;
_ = Task.Run(async ()=>{
await foreach(var url in urls.ReadAsStreamAsync(token))
{
var response=await client.GetStringAsync(url);
await writer.WriteAsync(response);
}
}).ContinueWith(t=>writer.Complete(t.Exception));
return channel.Reader;
}
That's more verbose but it allows us to do things like create the HttpClient in the method and reuse it. Using a ChannelReader as both input and output may look weird, but now we can chain such methods simply by passing an output reader as input to another method.
The "magic" is that we create a worker task that waits to process messages and return a reader immediatelly. Whenever a result is produced, it's sent to the channel and the next step in the pipeline.
To use multiple worker tasks, we can use Enumerable.Range to start many of them and use Task.WhenAny to close the channel when all channels are done :
ChannelReader<string> Downloader(ChannelReader<string> ulrs,HttpClient client,
int capacity,int dop,CancellationToken token=default)
{
var channel=Channel.CreateBounded(capacity);
var writer=channel.Writer;
var tasks = Enumerable
.Range(0,dop)
.Select(_=> Task.Run(async ()=>{
await foreach(var url in urls.ReadAllAsync(token))
{
var response=await client.GetStringAsync(url);
await writer.WriteAsync(response);
}
});
_=Task.WhenAll(tasks)
.ContinueWith(t=>writer.Complete(t.Exception));
return channel.Reader;
}
Publishers can create their own channel and pass a reader to the Downloader method. They don't need to publish anything in advance either :
var channel=Channel.CreateUnbounded<string>();
var dlReader=Downloader(channel.Reader,client,5,5);
foreach(var url in someUrlList)
{
await channel.Writer.WriteAsync(url);
}
channel.Writer.Complete();
Fluent pipelines
This is so common that someone could create an extension method for this. Eg, to convert an IList to a Channel<T>, we don't need to wait as all the results are already available :
ChannelReader<T> Generate<T>(this IEnumerable<T> source)
{
var channel=Channel.CreateUnbounded<T>();
foreach(var item in source)
{
channel.Writer.TryWrite(T);
}
channel.Writer.Complete();
return channel.Reader;
}
If we convert the Downloader to an extension method too, we can use :
var pipeline= someUrls.Generate()
.Downloader(client,5,5);

It's because ProcessOneItem is not async.
You should see the following warning:
This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
Once you add await (...) to ProcessOneItem the return task will have a unique-ish id.

From the documentation of Task.Id property:
Task IDs are assigned on-demand and do not necessarily represent the order in which task instances are created. Note that although collisions are very rare, task identifiers are not guaranteed to be unique.
From what I understand this property is mainly there for debugging purposes. You should probably avoid depending on it for production code.

Completion in TPL Dataflow Loops

I have a problem with determining how to detect completion within a looping TPL Dataflow.
I have a feedback loop in part of a dataflow which is making GET requests to a remote server and processing data responses (transforming these with more dataflow then committing the results).
The data source splits its results into pages of 1000 records, and won't tell me how many pages it has available for me. I have to just keep reading until i get less than a full page of data.
Usually the number of pages is 1, frequently it is up to 10, every now and again we have 1000s.
I have many requests to fetch at the start.
I want to be able to use a pool of threads to deal with this, all of which is fine, I can queue multiple requests for data and request them concurrently. If I stumble across an instance where I need to get a big number of pages I want to be using all of my threads for this. I don't want to be left with one thread churning away whilst the others have finished.
The issue I have is when I drop this logic into dataflow, such as:
//generate initial requests for activity
var request = new TransformManyBlock<int, DataRequest>(cmp => QueueRequests(cmp));
//fetch the initial requests and feedback more requests to our input buffer if we need to
TransformBlock<DataRequest, DataResponse> fetch = null;
fetch = new TransformBlock<DataRequest, DataResponse>(async req =>
{
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
await fetch.SendAsync(QueueAnotherRequest(req));
return resp;
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
//commit each type of request
var commit = new ActionBlock<DataResponse>(async resp => await Commit(resp));
request.LinkTo(fetch);
fetch.LinkTo(commit);
//when are we complete?
QueueRequests produces an IEnumerable<DataRequest>. I queue the next N page requests at once, accepting that this means I send slightly more calls than I need to. DataRequest instances share a LastPage counter to avoid neadlessly making requests that we know are after the last page. All this is fine.
The problem:
If I loop by feeding back more requests into fetch's input buffer as I've shown in this example, then i have a problem with how to signal (or even detect) completion. I can't set completion on fetch from request, as once completion is set I can't feedback any more.
I can monitor for the input and output buffers being empty on fetch, but I think I'd be risking fetch still being busy with a request when I set completion, thus preventing queuing requests for additional pages.
I could do with some way of knowing that fetch is busy (either has input or is busy processing an input).
Am I missing an obvious/straightforward way to solve this?
I could loop within fetch, rather than queuing more requests. The problem with that is I want to be able to use a set maximum number of threads to throttle what I'm doing to the remote server. Could a parallel loop inside the block share a scheduler with the block itself and the resulting thread count be controlled via the scheduler?
I could create a custom transform block for fetch to handle the completion signalling. Seems like a lot of work for such a simple scenario.
Many thanks for any help offered!

In TPL Dataflow, you can link the blocks with DataflowLinkOptions with specifying the propagation of completion of the block:
request.LinkTo(fetch, new DataflowLinkOptions { PropagateCompletion = true });
fetch.LinkTo(commit, new DataflowLinkOptions { PropagateCompletion = true });
After that, you simply call the Complete() method for the request block, and you're done!
// the completion will be propagated to all the blocks
request.Complete();
The final thing you should use is Completion task property of the last block:
commit.Completion.ContinueWith(t =>
{
/* check the status of the task and correctness of the requests handling */
});

For now I have added a simple busy state counter to the fetch block:-
int fetch_busy = 0;
TransformBlock<DataRequest, DataResponse> fetch_activity=null;
fetch = new TransformBlock<DataRequest, ActivityResponse>(async req =>
{
try
{
Interlocked.Increment(ref fetch_busy);
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
{
await fetch.SendAsync( QueueAnotherRequest(req) );
}
Interlocked.Decrement(ref fetch_busy);
return resp;
}
catch (Exception ex)
{
Interlocked.Decrement(ref fetch_busy);
throw ex;
}
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
Which I then use to signal complete as follows:-
request.Completion.ContinueWith(async _ =>
{
while ( fetch.InputCount > 0 || fetch_busy > 0 )
{
await Task.Delay(100);
}
fetch.Complete();
});
Which doesnt seem very elegant, but should work I think.

How to correctly queue up tasks to run in C#

I have an enumeration of items (RunData.Demand), each representing some work involving calling an API over HTTP. It works great if I just foreach through it all and call the API during each iteration. However, each iteration takes a second or two so I'd like to run 2-3 threads and divide up the work between them. Here's what I'm doing:
ThreadPool.SetMaxThreads(2, 5); // Trying to limit the amount of threads
var tasks = RunData.Demand
.Select(service => Task.Run(async delegate
{
var availabilityResponse = await client.QueryAvailability(service);
// Do some other stuff, not really important
}));
await Task.WhenAll(tasks);
The client.QueryAvailability call basically calls an API using the HttpClient class:
public async Task<QueryAvailabilityResponse> QueryAvailability(QueryAvailabilityMultidayRequest request)
{
var response = await client.PostAsJsonAsync("api/queryavailabilitymultiday", request);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<QueryAvailabilityResponse>();
}
throw new HttpException((int) response.StatusCode, response.ReasonPhrase);
}
This works great for a while, but eventually things start timing out. If I set the HttpClient Timeout to an hour, then I start getting weird internal server errors.
What I started doing was setting a Stopwatch within the QueryAvailability method to see what was going on.
What's happening is all 1200 items in RunData.Demand are being created at once and all 1200 await client.PostAsJsonAsync methods are being called. It appears it then uses the 2 threads to slowly check back on the tasks, so towards the end I have tasks that have been waiting for 9 or 10 minutes.
Here's the behavior I would like:
I'd like to create the 1,200 tasks, then run them 3-4 at a time as threads become available. I do not want to queue up 1,200 HTTP calls immediately.
Is there a good way to go about doing this?

As I always recommend.. what you need is TPL Dataflow (to install: Install-Package System.Threading.Tasks.Dataflow).
You create an ActionBlock with an action to perform on each item. Set MaxDegreeOfParallelism for throttling. Start posting into it and await its completion:
var block = new ActionBlock<QueryAvailabilityMultidayRequest>(async service =>
{
var availabilityResponse = await client.QueryAvailability(service);
// ...
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
foreach (var service in RunData.Demand)
{
block.Post(service);
}
block.Complete();
await block.Completion;

Old question, but I would like to propose an alternative lightweight solution using the SemaphoreSlim class. Just reference System.Threading.
SemaphoreSlim sem = new SemaphoreSlim(4,4);
foreach (var service in RunData.Demand)
{
await sem.WaitAsync();
Task t = Task.Run(async () =>
{
var availabilityResponse = await client.QueryAvailability(serviceCopy));
// do your other stuff here with the result of QueryAvailability
}
t.ContinueWith(sem.Release());
}
The semaphore acts as a locking mechanism. You can only enter the semaphore by calling Wait (WaitAsync) which subtracts one from the count. Calling release adds one to the count.

You're using async HTTP calls, so limiting the number of threads will not help (nor will ParallelOptions.MaxDegreeOfParallelism in Parallel.ForEach as one of the answers suggests). Even a single thread can initiate all requests and process the results as they arrive.
One way to solve it is to use TPL Dataflow.
Another nice solution is to divide the source IEnumerable into partitions and process items in each partition sequentially as described in this blog post:
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate
{
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}

While the Dataflow library is great, I think it's a bit heavy when not using block composition. I would tend to use something like the extension method below.
Also, unlike the Partitioner method, this runs the async methods on the calling context - the caveat being that if your code is not truly async, or takes a 'fast path', then it will effectively run synchronously since no threads are explicitly created.
public static async Task RunParallelAsync<T>(this IEnumerable<T> items, Func<T, Task> asyncAction, int maxParallel)
{
var tasks = new List<Task>();
foreach (var item in items)
{
tasks.Add(asyncAction(item));
if (tasks.Count < maxParallel)
continue;
var notCompleted = tasks.Where(t => !t.IsCompleted).ToList();
if (notCompleted.Count >= maxParallel)
await Task.WhenAny(notCompleted);
}
await Task.WhenAll(tasks);
}

How do I detect all TransformManyBlocks have completed

I have a TransformManyBlock that creates many "actors". They flow through several TransformBlocks of processing. Once all of the actors have completed all of the steps I need to process everything as a whole again.
The final
var CopyFiles = new TransformBlock<Actor, Actor>(async actor =>
{
//Copy all fo the files and then wait until they are all done
await actor.CopyFiles();
//pass me along to the next process
return actor;
}, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = -1 });
How do I detect when all of the Actors have been processed? Completions seems to be propagated immediately and only tell you there are no more items to process. That doesn't tell me when the last Actor is finished processing.

Completion in TPL Dataflow is done by calling Complete which returns immediately and signals completion. That makes the block refuse further messages but it continues processing the items it already contain.
When a block completes processing all its items it completes its Completion task. You can await that task to be notified when all the block's work has been done.
When you link blocks together (with PropogateCompletion turned on) you only need to call Complete on the first and await the last one's Completion property:
var copyFilesBlock = new TransformBlock<Actor, Actor>(async actor =>
{
await actor.CopyFilesAsync();
return actor;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = -1 });
// Fill copyFilesBlock
copyFilesBlock.Complete();
await copyFilesBlock.Completion;

How to dequeue when new item in queue

I've an application that works with a queue with strings (which corresponds to different tasks that application needs to perform). At random moments the queue can be filled with strings (like several times a minute sometimes but it also can take a few hours.
Till now I always had a timer that checked every few seconds the queue whether there were items in the queue and removed them.
I think there must be a nicer solution than this way. Is there any way to get an event or so when an item is added to the queue?

Yes. Take a look at TPL Dataflow, in particular, the BufferBlock<T>, which does more or less the same as BlockingCollection without the nasty side-effect of jamming up your threads by leveraging async/await.
So you can:
void Main()
{
var b = new BufferBlock<string>();
AddToBlockAsync(b);
ReadFromBlockAsync(b);
}
public async Task AddToBlockAsync(BufferBlock<string> b)
{
while (true)
{
b.Post("hello");
await Task.Delay(1000);
}
}
public async Task ReadFromBlockAsync(BufferBlock<string> b)
{
await Task.Delay(10000); //let some messages buffer up...
while(true)
{
var msg = await b.ReceiveAsync();
Console.WriteLine(msg);
}
}

I'd take a look at BlockingCollection.GetConsumingEnumerable. The collection will be backed with a queue by default, and it is a nice way to automatically take values from the queue as they are added using a simple foreach loop.
There is also an overload that allows you to supply a CancellationToken meaning you can cleanly break out.

Have you looked at BlockingCollection ? The GetConsumingEnumerable() method allows an indefinite loop to be run on the consumer, to which will new items will be yielded once an item becomes available, with no need for timers, or Thread.Sleep's:
// Common:
BlockingCollection<string> _blockingCollection =
new BlockingCollection<string>();
// Producer
for (var i = 0; i < 100; i++)
{
_blockingCollection.Add(i.ToString());
Thread.Sleep(500); // So you can track the consumer synchronization. Remove.
}
// Consumer:
foreach (var item in _blockingCollection.GetConsumingEnumerable())
{
Debug.WriteLine(item);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

TPL Dataflow Consumer to Process Multiple Items at a time - c#

Related

Why do multiple short Tasks end up getting the same id?

Completion in TPL Dataflow Loops

How to correctly queue up tasks to run in C#

How do I detect all TransformManyBlocks have completed

How to dequeue when new item in queue

Categories

Resources