it seems that I do not understand TPL Dataflow error handling.
Lets assume I have a list of items I wanna process and I use a ActionBlock for that:
var actionBlock = new ActionBlock<int[]>(async tasks =>
{
foreach (var task in tasks)
{
await Task.Delay(1);
if (task > 30)
{
throw new InvalidOperationException();
}
Console.WriteLine("{0} Completed", task);
}
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = 200,
MaxDegreeOfParallelism = 4
});
for (i = 0; i < 10000; i++)
{
if (!await bufferBlock.SendAsync(i))
{
break;
}
}
actionBlock.Complete();
await actionBlock.Completion;
If an error occurs the block transitions to faulted state and SendAsync(...) returns false. I can just stop my loop and complete it and when I await the completion an exception is thrown. So far so good.
When I put a BufferBlock in between it does not work anymore:
bufferBlock.LinkTo(actionBlock, new DataflowLinkOptions
{
PropagateCompletion = true
});
for (i = 0; i < 10000; i++)
{
if (!await bufferBlock.SendAsync(i, cts.Token))
{
break;
}
}
bufferBlock.Complete();
await actionBlock.Completion;
The call to SendAsync() just "blocks" forever, because the BufferBlock never transitions to faulted state.
The only solution I found is this:
using (var cts = new CancellationTokenSource())
{
actionBlock.Completion.ContinueWith(x =>
{
if (x.Status != TaskStatus.RanToCompletion)
{
cts.Cancel();
}
});
var i = 0;
try
{
for (i = 0; i < 10000; i++)
{
if (cts.Token.IsCancellationRequested)
{
break;
}
if (!await bufferBlock.SendAsync(i, cts.Token))
{
break;
}
}
}
catch (OperationCanceledException)
{
}
bufferBlock.Complete();
await actionBlock.Completion;
}
Because the state propagates I have to listen to the state of the last block in my network and when this block stops I have to stop my loop.
Is this the intended way to work with Dataflow library or is there a better solution?
Don't allow unhandled exceptions. An unhandled exception in a block means the block and by extension the entire pipeline is terminally broken and must be aborted. That's not a TPL Dataflow bug, that's how the overall dataflow paradigm works. Exceptions are meant to signal errors up a call stack. There's no call stack in a dataflow though.
Blocks are independent workers that communicate through messages. There's no ownership relation between linked blocks and a faulting block doesn't mean any previous or following blocks should have to abort as well. That's why PropagateCompletion is false by default.
If a source links to more than one blocks the messages can easily go to the other blocks. It's also possible to change the links between blocks at runtime.
In a pipeline there are two different kinds of errors:
Message errors that occur when a block/actor/worker processes a message
Pipeline errors that invalidate the pipeline and may require aborting
There's no reason to abort the pipeline if a single message faults.
Message errors
If something goes wrong while processing a message, the actor should do something with that message and proceed with the next one. That something may be:
Log the error and go on
Send an "error" message to another block
Use a Result<TMessage,TError> class in the entire pipeline instead of using raw message types, and add any errors to the result
Retry and recovery strategies can be built on top of that, eg forwarding any failed messages to a "retry" block or dead message block
The simplest way would be to just catch the exceptions and log them :
var block=new ActionBlock<int[]>(msg=>{
try
{
...
}
catch(Exception exc)
{
_logger.LogError(exc);
}
});
Another option is to manually post to eg a dead-letter queue :
var dead=new BufferBlock<(int[] data,Exception error)>();
var block=new ActionBlock<int[]>(msg=>{
try
{
...
}
catch(Exception exc)
{
await _dead.SendAsync(msg,exc);
_logger.LogError(exc);
}
});
Going even further, one could define a Result<TMessage,TError> class to wrap results. Downstream blocks could ignore faulted results. The LinkTo predicate can also be used to reroute error messages. I'll cheat and hard-code the error to Exception. A better implementation would use different types for success and error :
record Result<TMessage>(TMessage? Message,Exception ? Error)
{
public bool HasError=>error!=null;
}
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
if (msg.HasError)
{
//Propagate the error message
return new Result<double>(default,msg.Error);
}
try
{
var sum=(double)msg.Message.Sum();
if (sum % 5 ==0)
{
throw new Exception("Why not?");
}
return new Result(sum,null);
}
catch(Exception exc)
{
return new Result(null,exc);
}
});
var block2=new ActionBlock<Result<double>>(...);
block1.LinkTo(block2);
Another option is to redirect error messages to a different block:
var errorBlock=new ActionBlock<Result<int[]>>(msg=>{
_logger.LogError(msg.Error);
});
block1.LinkTo(errorBlock,msg=>msg.HasError);
block1.LinkTo(block2);
This redirects all errored messages to the error block. All other messages move on to block2
Pipeline errors
In some cases, an error is so severe the current block can't recover and perhaps even the entire pipeline must be cancelled/aborted. Cancellation in .NET is handled through a CancellationToken. All blocks accept a CancellationToken to allow aborting.
There's no single abort strategy that's appropriate to all pipelines. Propagating cancellation forward is common but definitely not the only option.
In the simplest case,
var pipeLineCancellation = new CancellationTokenSource();
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
},
new ExecutionDataflowBlockOptions {
CancellationToken=pipeLineCancellation.Token
});
The block exception handler could request cancellation in case of a serious error :
//Wrong table name. We can't use the database
catch(SqlException exc) when (exc.Number ==208)
{
...
pipeLineCancellation.Cancel();
}
This would abort all blocks that use the same CancellationTokenSource. That doesn't mean that all blocks should be connected to the same CancellationTokenSource though.
Flowing cancellation backwards
In Go pipelines it's common to use an error channel that sends a cancellation message to the previous block. The same can be done in C# using linked CancellationTokenSources. One could even say this is even better than Go.
It's possible to create multiple linked CancellationTokenSources with CreateLinkedTokenSource. By creating sources that link backwards we can have a block signal cancellation for its own source and have the cancellation flow to the root.
var cts5=new CancellationTokenSource();
var cts4=CancellationTokenSource.CreateLinkedTokenSource(cts5.Token);
...
var cts1=CancellationTokenSource.CreateLinkedTokenSource(cts2.Token);
...
var block3=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
catch(SqlException)
{
cts3.Cancel();
}
},
new ExecutionDataflowBlockOptions {
CancellationToken=cts3.Token
});
This will signal cancellation backwards, block by block, without cancelling the downstream blocks.
Pipeline Patterns
Dataflow in .NET is a gem few people know about, so it's really hard to find good references and patterns. The concepts are similar in Go though, so one could use the patterns found in Go Concurrency Patterns: Pipelines and cancellation.
The TPL Dataflow implements the processing loop and completion propagation so one typically only needs to provide the Action or Func that processes messages. The rest of the patterns have to be implemented, although .NET offers some advantages over Go.
The done channel is essentially a CancellationTokenSource.
Fan-in, fan-out are already handled through existing blocks, or can be handled using a relatively simple custom block that clones messages
CancellationTokenSources can be linked explicitly. In Go each "stage" (essentially a block) has to propagate completion/cancellation to other stages
One CancellationTokenSource can be used by all stages/blocks.
Linking allows not just easier composition but even runtime modifications to the pipeline/mesh.
Let's say we want to just stop processing messages after a while, even though there's no error. All that's needed is to create a CTS used by all blocks:
var pipeLineCancellation = new CancellationTokenSource();
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
},
new ExecutionDataflowBlockOptions {
CancellationToken=pipeLineCancellation.Token
});
var block2 =.....;
pipeLineCancellation.Cancel();
Perhaps we want to run the pipeline for only a minute? Easy with
var pipeLineCancellation = new CancellationTokenSource(60000);
There are some disadvantages too, as a Dataflow block has no access to the "channels" or control over the loop
In Go it's easy to pass data, error and done channels to each stage, simplifying the error reporting and completion. In .NET the block delegates may have to access other blocks or CTSs directly.
In Go it's easier to use common state to eg accumulate data, or manage session/remote connection state. Imagine stage/block that controls a screen scraper like Selenium. We really don't want to restart the browser on every message.
Or we may want to insert data into a database using SqlBulkCopy. With an ActionBlock we'd have to create a new instance for each batch, which may or may not be a problem.
Related
I've been working on a project and saw the below code. I am new to the async/await world. As far as I know, only a single task is performing in the method then why it is decorated with async/await. What benefits I am getting by using async/await and what is the drawback if I remove async/await i.e make it synchronous I am a little bit confused so any help will be appreciated.
[Route("UpdatePersonalInformation")]
public async Task<DataTransferObject<bool>> UpdatePersonalInformation([FromBody] UserPersonalInformationRequestModel model)
{
DataTransferObject<bool> transfer = new DataTransferObject<bool>();
try
{
model.UserId = UserIdentity;
transfer = await _userService.UpdateUserPersonalInformation(model);
}
catch (Exception ex)
{
transfer.TransactionStatusCode = 500;
transfer.ErrorMessage = ex.Message;
}
return transfer;
}
Service code
public async Task<DataTransferObject<bool>> UpdateUserPersonalInformation(UserPersonalInformationRequestModel model)
{
DataTransferObject<bool> transfer = new DataTransferObject<bool>();
await Task.Run(() =>
{
try
{
var data = _userProfileRepository.FindBy(x => x.AspNetUserId == model.UserId)?.FirstOrDefault();
if (data != null)
{
var userProfile = mapper.Map<UserProfile>(model);
userProfile.UpdatedBy = model.UserId;
userProfile.UpdateOn = DateTime.UtcNow;
userProfile.CreatedBy = data.CreatedBy;
userProfile.CreatedOn = data.CreatedOn;
userProfile.Id = data.Id;
userProfile.TypeId = data.TypeId;
userProfile.AspNetUserId = data.AspNetUserId;
userProfile.ProfileStatus = data.ProfileStatus;
userProfile.MemberSince = DateTime.UtcNow;
if(userProfile.DOB==DateTime.MinValue)
{
userProfile.DOB = null;
}
_userProfileRepository.Update(userProfile);
transfer.Value = true;
}
else
{
transfer.Value = false;
transfer.Message = "Invalid User";
}
}
catch (Exception ex)
{
transfer.ErrorMessage = ex.Message;
}
});
return transfer;
}
What benefits I am getting by using async/await
Normally, on ASP.NET, the benefit of async is that your server is more scalable - i.e., can handle more requests than it otherwise could. The "Synchronous vs. Asynchronous Request Handling" section of this article goes into more detail, but the short explanation is that async/await frees up a thread so that it can handle other requests while the asynchronous work is being done.
However, in this specific case, that's not actually what's going on. Using async/await in ASP.NET is good and proper, but using Task.Run on ASP.NET is not. Because what happens with Task.Run is that another thread is used to run the delegate within UpdateUserPersonalInformation. So this isn't asynchronous; it's just synchronous code running on a background thread. UpdateUserPersonalInformation will take another thread pool thread to run its synchronous repository call and then yield the request thread by using await. So it's just doing a thread switch for no benefit at all.
A proper implementation would make the repository asynchronous first, and then UpdateUserPersonalInformation can be implemented without Task.Run at all:
public async Task<DataTransferObject<bool>> UpdateUserPersonalInformation(UserPersonalInformationRequestModel model)
{
DataTransferObject<bool> transfer = new DataTransferObject<bool>();
try
{
var data = _userProfileRepository.FindBy(x => x.AspNetUserId == model.UserId)?.FirstOrDefault();
if (data != null)
{
...
await _userProfileRepository.UpdateAsync(userProfile);
transfer.Value = true;
}
else
{
transfer.Value = false;
transfer.Message = "Invalid User";
}
}
catch (Exception ex)
{
transfer.ErrorMessage = ex.Message;
}
return transfer;
}
The await keyword only indicates that the execution of the current function is halted until the Task which is being awaited is completed. This means if you remove the async, the method will continue execution and therefore immediately return the transfer object, even if the UpdateUserPersonalInformation Task is not finished.
Take a look at this example:
private void showInfo()
{
Task.Delay(1000);
MessageBox.Show("Info");
}
private async void showInfoAsync()
{
await Task.Delay(1000);
MessageBox.Show("Info");
}
In the first method, the MessageBox is immediately displayed, since the newly created Task (which only waits a specified amount of time) is not awaited. However, the second method specifies the await keyword, therefore the MessageBox is displayed only after the Task is finished (in the example, after 1000ms elapsed).
But, in both cases the delay Task is ran asynchronously in the background, so the main thread (for example the UI) will not freeze.
The usage of async-await mechanism mainly used
when you have some long calculation process which takes some time and you want it to be on the background
in UI when you don't want to make the main thread stuck which will be reflected on UI performance.
you can read more here:
https://learn.microsoft.com/en-us/dotnet/csharp/async
Time Outs
The main usages of async and await operates preventing TimeOuts by waiting for long operations to complete. However, there is another less known, but very powerful one.
If you don't await long operation, you will get a result back, such as a null, even though the actual request as not completed yet.
Cancellation Tokens
Async requests have a default parameter you can add:
public async Task<DataTransferObject<bool>> UpdatePersonalInformation(
[FromBody] UserPersonalInformationRequestModel model,
CancellationToken cancellationToken){..}
A CancellationToken allows the request to stop when the user changes pages or interrupts the connection. A good example of this is a user has a search box, and every time a letter is typed you filter and search results from your API. Now imagine the user types a very long string with say 15 characters. That means that 15 requests are sent and 15 requests need to be completed. Even if the front end is not awaiting the first 14 results, the API is still doing all the 15 requests.
A cancellation token simply tells the API to drop the unused threads.
I would like to chime in on this because most answers although good, do not point to a definite time when to use and when not.
From my experience, if you are developing anything with a front-end, add async/await to your methods when expecting output from other threads to be input to your UI. This is the best strategy for handling multithread output and Microsoft should be commended to come out with this when they did. Without async/await you would have to add more code to handle thread output to UI (e.g Event, Event Handler, Delegate, Event Subscription, Marshaller).
Don't need it anywhere else except if using strategically for slow peripherals.
I'm trying to create an AWS SQS windows service consumer that will poll messages in batch of 10. Each messages will be executed in its own task for parallel execution. Message processing includes calling different api's and sending email so it might take some time.
My problem is that first, I only want to poll the queue when 10 messages can be processed immediately. This is due to sqs visibility timeout and having the received messages "wait" might go over the visibility timeout and be "back" on the queue. This will produce duplication. I don't think tweaking the visibility timeout is good, because there are still chances that messages will be duplicated and that's what I'm trying to avoid. Second, I want to have some sort of limit for parallelism (ex. max limit of 100 concurrent tasks), so that server resources can be kept at bay since there are also other apps running in the server.
How to achieve this? Or are there any other way to remedy these problems?
This answer makes the following assumptions:
Fetching messages from the AWS should be serialized. Only the processing of messages should be parallelized.
Every message fetched from the AWS should be processed. The whole execution should not terminate before all fetched messages have a chance to be processed.
Every message-processing operation should be awaited. The whole execution should not terminate before the completion of all started tasks.
Any error that occurs during the processing of a message should be ignored. The whole execution should not terminate because the processing of a single message failed.
Any error that occurs during the fetching of messages from the AWS should be fatal. The whole execution should terminate, but not before all currently running message-processing operations have completed.
The execution mechanism should be able to handle the case that a fetch-from-the-AWS operation returned a batch having a different number of messages than the requested number.
Below is an implementation that (hopefully) satisfies these requirements:
/// <summary>
/// Starts an execution loop that fetches batches of messages sequentially,
/// and process them one by one in parallel.
/// </summary>
public static async Task ExecutionLoopAsync<TMessage>(
Func<int, Task<TMessage[]>> fetchMessagesAsync,
Func<TMessage, Task> processMessageAsync,
int fetchCount,
int maxDegreeOfParallelism,
CancellationToken cancellationToken = default)
{
// Arguments validation omitted
var semaphore = new SemaphoreSlim(maxDegreeOfParallelism, maxDegreeOfParallelism);
// Count how many times we have acquired the semaphore, so that we know
// how many more times we have to acquire it before we exit from this method.
int acquiredCount = 0;
try
{
while (true)
{
Debug.Assert(acquiredCount == 0);
for (int i = 0; i < fetchCount; i++)
{
await semaphore.WaitAsync(cancellationToken);
acquiredCount++;
}
TMessage[] messages = await fetchMessagesAsync(fetchCount)
?? Array.Empty<TMessage>();
for (int i = 0; i < messages.Length; i++)
{
if (i >= fetchCount) // We got more messages than we asked for
{
await semaphore.WaitAsync();
acquiredCount++;
}
ProcessAndRelease(messages[i]);
acquiredCount--;
}
if (messages.Length < fetchCount)
{
// We got less messages than we asked for
semaphore.Release(fetchCount - messages.Length);
acquiredCount -= fetchCount - messages.Length;
}
// This method is 'async void' because it is not expected to throw ever
async void ProcessAndRelease(TMessage message)
{
try { await processMessageAsync(message); }
catch { } // Swallow exceptions
finally { semaphore.Release(); }
}
}
}
catch (SemaphoreFullException)
{
// Guard against the (unlikely) scenario that the counting logic is flawed.
// The counter is no longer reliable, so skip the awaiting in finally.
acquiredCount = maxDegreeOfParallelism;
throw;
}
finally
{
// Wait for all pending operations to complete. This could cause a deadlock
// in case the counter has become out of sync.
for (int i = acquiredCount; i < maxDegreeOfParallelism; i++)
await semaphore.WaitAsync();
}
}
Usage example:
var cts = new CancellationTokenSource();
Task executionTask = ExecutionLoopAsync<Message>(async count =>
{
return await GetBatchFromAwsAsync(count);
}, async message =>
{
await ProcessMessageAsync(message);
}, fetchCount: 10, maxDegreeOfParallelism: 100, cts.Token);
I have a problem with determining how to detect completion within a looping TPL Dataflow.
I have a feedback loop in part of a dataflow which is making GET requests to a remote server and processing data responses (transforming these with more dataflow then committing the results).
The data source splits its results into pages of 1000 records, and won't tell me how many pages it has available for me. I have to just keep reading until i get less than a full page of data.
Usually the number of pages is 1, frequently it is up to 10, every now and again we have 1000s.
I have many requests to fetch at the start.
I want to be able to use a pool of threads to deal with this, all of which is fine, I can queue multiple requests for data and request them concurrently. If I stumble across an instance where I need to get a big number of pages I want to be using all of my threads for this. I don't want to be left with one thread churning away whilst the others have finished.
The issue I have is when I drop this logic into dataflow, such as:
//generate initial requests for activity
var request = new TransformManyBlock<int, DataRequest>(cmp => QueueRequests(cmp));
//fetch the initial requests and feedback more requests to our input buffer if we need to
TransformBlock<DataRequest, DataResponse> fetch = null;
fetch = new TransformBlock<DataRequest, DataResponse>(async req =>
{
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
await fetch.SendAsync(QueueAnotherRequest(req));
return resp;
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
//commit each type of request
var commit = new ActionBlock<DataResponse>(async resp => await Commit(resp));
request.LinkTo(fetch);
fetch.LinkTo(commit);
//when are we complete?
QueueRequests produces an IEnumerable<DataRequest>. I queue the next N page requests at once, accepting that this means I send slightly more calls than I need to. DataRequest instances share a LastPage counter to avoid neadlessly making requests that we know are after the last page. All this is fine.
The problem:
If I loop by feeding back more requests into fetch's input buffer as I've shown in this example, then i have a problem with how to signal (or even detect) completion. I can't set completion on fetch from request, as once completion is set I can't feedback any more.
I can monitor for the input and output buffers being empty on fetch, but I think I'd be risking fetch still being busy with a request when I set completion, thus preventing queuing requests for additional pages.
I could do with some way of knowing that fetch is busy (either has input or is busy processing an input).
Am I missing an obvious/straightforward way to solve this?
I could loop within fetch, rather than queuing more requests. The problem with that is I want to be able to use a set maximum number of threads to throttle what I'm doing to the remote server. Could a parallel loop inside the block share a scheduler with the block itself and the resulting thread count be controlled via the scheduler?
I could create a custom transform block for fetch to handle the completion signalling. Seems like a lot of work for such a simple scenario.
Many thanks for any help offered!
In TPL Dataflow, you can link the blocks with DataflowLinkOptions with specifying the propagation of completion of the block:
request.LinkTo(fetch, new DataflowLinkOptions { PropagateCompletion = true });
fetch.LinkTo(commit, new DataflowLinkOptions { PropagateCompletion = true });
After that, you simply call the Complete() method for the request block, and you're done!
// the completion will be propagated to all the blocks
request.Complete();
The final thing you should use is Completion task property of the last block:
commit.Completion.ContinueWith(t =>
{
/* check the status of the task and correctness of the requests handling */
});
For now I have added a simple busy state counter to the fetch block:-
int fetch_busy = 0;
TransformBlock<DataRequest, DataResponse> fetch_activity=null;
fetch = new TransformBlock<DataRequest, ActivityResponse>(async req =>
{
try
{
Interlocked.Increment(ref fetch_busy);
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
{
await fetch.SendAsync( QueueAnotherRequest(req) );
}
Interlocked.Decrement(ref fetch_busy);
return resp;
}
catch (Exception ex)
{
Interlocked.Decrement(ref fetch_busy);
throw ex;
}
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
Which I then use to signal complete as follows:-
request.Completion.ContinueWith(async _ =>
{
while ( fetch.InputCount > 0 || fetch_busy > 0 )
{
await Task.Delay(100);
}
fetch.Complete();
});
Which doesnt seem very elegant, but should work I think.
I am trying to implement a data processing pipeline using TPL Dataflow. However, I am relatively new to dataflow and not completely sure how to use it properly for the problem I am trying to solve.
Problem:
I am trying to iterate through the list of files and process each file to read some data and then further process that data. Each file is roughly 700MB to 1GB in size. Each file contains JSON data. In order to process these files in parallel and not run of of memory, I am trying to use IEnumerable<> with yield return and then further process the data.
Once I get list of files, I want to process maximum 4-5 files at a time in parallel. My confusion comes from:
How to use IEnumerable<> and yeild return with async/await and dataflow. Came across this answer by svick, but still not sure how to convert IEnumerable<> to ISourceBlock and then link all blocks together and track completion.
In my case, producer will be really fast (going through list of files), but consumer will be very slow (processing each file - read data, deserialize JSON). In this case, how to track completion.
Should I use LinkTo feature of datablocks to connect various blocks? or use method such as OutputAvailableAsync() and ReceiveAsync() to propagate data from one block to another.
Code:
private const int ProcessingSize= 4;
private BufferBlock<string> _fileBufferBlock;
private ActionBlock<string> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
var bufferTask = ListFilesAsync(_fileBufferBlock, token);
var tasks = new List<Task> { bufferTask, _processingBlock.Completion };
return Task.WhenAll(tasks);
}
private async Task ListFilesAsync(ITargetBlock<string> targetBlock, CancellationToken token)
{
...
// Get list of file Uris
...
foreach(var fileNameUri in fileNameUris)
await targetBlock.SendAsync(fileNameUri, token);
targetBlock.Complete();
}
private async Task ProcessFileAsync(string fileNameUri, CancellationToken token)
{
var httpClient = new HttpClient();
try
{
using (var stream = await httpClient.GetStreamAsync(fileNameUri))
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
var data = _jsonSerializer.Deserialize<DataType>(jsonTextReader)
await _messageBufferBlock.SendAsync(data, token);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
catch(Exception ex)
{
// Should throw?
// Or if converted to block then report using Fault() method?
}
finally
{
httpClient.Dispose();
buffer.Complete();
}
}
private void PrepareDataflow(CancellationToken token)
{
_fileBufferBlock = new BufferBlock<string>(new DataflowBlockOptions
{
CancellationToken = token
});
var actionExecuteOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = ProcessingSize,
MaxMessagesPerTask = 1,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new ActionBlock<string>(async fileName =>
{
try
{
await ProcessFileAsync(fileName, token);
}
catch (Exception ex)
{
_logger.Fatal(ex, $"Failed to process fiel: {fileName}, Error: {ex.Message}");
// Should fault the block?
}
}, actionExecuteOptions);
_fileBufferBlock.LinkTo(_processingBlock, new DataflowLinkOptions { PropagateCompletion = true });
_messageBufferBlock = new BufferBlock<DataType>(new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
_messageBufferBlock.LinkTo(DataflowBlock.NullTarget<DataType>());
}
In the above code, I am not using IEnumerable<DataType> and yield return as I cannot use it with async/await. So I am linking input buffer to ActionBlock<DataType> which in turn posts to another queue. However by using ActionBlock<>, I cannot link it to next block for processing and have to manually Post/SendAsync from ActionBlock<> to BufferBlock<>. Also, in this case, not sure, how to track completion.
This code works, but, I am sure there could be better solution then this and I can just link all the block (instead of ActionBlock<DataType> and then sending messages from it to BufferBlock<DataType>)
Another option could be to convert IEnumerable<> to IObservable<> using Rx, but again I am not much familiar with Rx and don't know exactly how to mix TPL Dataflow and Rx
Question 1
You plug an IEnumerable<T> producer into your TPL Dataflow chain by using Post or SendAsync directly on the consumer block, as follows:
foreach (string fileNameUri in fileNameUris)
{
await _processingBlock.SendAsync(fileNameUri).ConfigureAwait(false);
}
You can also use a BufferBlock<TInput>, but in your case it actually seems rather unnecessary (or even harmful - see the next part).
Question 2
When would you prefer SendAsync instead of Post? If your producer runs faster than the URIs can be processed (and you have indicated this to be the case), and you choose to give your _processingBlock a BoundedCapacity, then when the block's internal buffer reaches the specified capacity, your SendAsync will "hang" until a buffer slot frees up, and your foreach loop will be throttled. This feedback mechanism creates back pressure and ensures that you don't run out of memory.
Question 3
You should definitely use the LinkTo method to link your blocks in most cases. Unfortunately yours is a corner case due to the interplay of IDisposable and very large (potentially) sequences. So your completion will flow automatically between the buffer and processing blocks (due to LinkTo), but after that - you need to propagate it manually. This is tricky, but doable.
I'll illustrate this with a "Hello World" example where the producer iterates over each character and the consumer (which is really slow) outputs each character to the Debug window.
Note: LinkTo is not present.
// REALLY slow consumer.
var consumer = new ActionBlock<char>(async c =>
{
await Task.Delay(100);
Debug.Print(c.ToString());
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
var producer = new ActionBlock<string>(async s =>
{
foreach (char c in s)
{
await consumer.SendAsync(c);
Debug.Print($"Yielded {c}");
}
});
try
{
producer.Post("Hello world");
producer.Complete();
await producer.Completion;
}
finally
{
consumer.Complete();
}
// Observe combined producer and consumer completion/exceptions/cancellation.
await Task.WhenAll(producer.Completion, consumer.Completion);
This outputs:
Yielded H
H
Yielded e
e
Yielded l
l
Yielded l
l
Yielded o
o
Yielded
Yielded w
w
Yielded o
o
Yielded r
r
Yielded l
l
Yielded d
d
As you can see from the output above, the producer is throttled and the handover buffer between the blocks never grows too large.
EDIT
You might find it cleaner to propagate completion via
producer.Completion.ContinueWith(
_ => consumer.Complete(), TaskContinuationOptions.ExecuteSynchronously
);
... right after producer definition. This allows you to slightly reduce producer/consumer coupling - but at the end you still have to remember to observe Task.WhenAll(producer.Completion, consumer.Completion).
In order to process these files in parallel and not run of of memory, I am trying to use IEnumerable<> with yield return and then further process the data.
I don't believe this step is necessary. What you're actually avoiding here is just a list of filenames. Even if you had millions of files, the list of filenames is just not going to take up a significant amount of memory.
I am linking input buffer to ActionBlock which in turn posts to another queue. However by using ActionBlock<>, I cannot link it to next block for processing and have to manually Post/SendAsync from ActionBlock<> to BufferBlock<>. Also, in this case, not sure, how to track completion.
ActionBlock<TInput> is an "end of the line" block. It only accepts input and does not produce any output. In your case, you don't want ActionBlock<TInput>; you want TransformManyBlock<TInput, TOutput>, which takes input, runs a function on it, and produces output (with any number of output items for each input item).
Another point to keep in mind is that all buffer blocks have an input buffer. So the extra BufferBlock is unnecessary.
Finally, if you're already in "dataflow land", it's usually best to end with a dataflow block that actually does something (e.g., ActionBlock instead of BufferBlock). In this case, you could use the BufferBlock as a bounded producer/consumer queue, where some other code is consuming the results. Personally, I would consider that it may be cleaner to rewrite the consuming code as the action of an ActionBlock, but it may also be cleaner to keep the consumer independent of the dataflow. For the code below, I left in the final bounded BufferBlock, but if you use this solution, consider changing that final block to a bounded ActionBlock instead.
private const int ProcessingSize= 4;
private static readonly HttpClient HttpClient = new HttpClient();
private TransformBlock<string, DataType> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
ListFiles(_fileBufferBlock, token);
_processingBlock.Complete();
return _processingBlock.Completion;
}
private void ListFiles(ITargetBlock<string> targetBlock, CancellationToken token)
{
... // Get list of file Uris, occasionally calling token.ThrowIfCancellationRequested()
foreach(var fileNameUri in fileNameUris)
_processingBlock.Post(fileNameUri);
}
private async Task<IEnumerable<DataType>> ProcessFileAsync(string fileNameUri, CancellationToken token)
{
return Process(await HttpClient.GetStreamAsync(fileNameUri), token);
}
private IEnumerable<DataType> Process(Stream stream, CancellationToken token)
{
using (stream)
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
token.ThrowIfCancellationRequested();
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
yield _jsonSerializer.Deserialize<DataType>(jsonTextReader);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
private void PrepareDataflow(CancellationToken token)
{
var executeOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new TransformManyBlock<string, DataType>(fileName =>
ProcessFileAsync(fileName, token), executeOptions);
_messageBufferBlock = new BufferBlock<DataType>(new DataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
}
Alternatively, you could use Rx. Learning Rx can be pretty difficult though, especially for mixed asynchronous and parallel dataflow situations, which you have here.
As for your other questions:
How to use IEnumerable<> and yeild return with async/await and dataflow.
async and yield are not compatible at all. At least in today's language. In your situation, the JSON readers have to read from the stream synchronously anyway (they don't support asynchronous reading), so the actual stream processing is synchronous and can be used with yield. Doing the initial back-and-forth to get the stream itself can still be asynchronous and can be used with async. This is as good as we can get today, until the JSON readers support asynchronous reading and the language supports async yield. (Rx could do an "async yield" today, but the JSON reader still doesn't support async reading, so it won't help in this particular situation).
In this case, how to track completion.
If the JSON readers did support asynchronous reading, then the solution above would not be the best one. In that case, you would want to use a manual SendAsync call, and would need to link just the completion of these blocks, which can be done as such:
_processingBlock.Completion.ContinueWith(
task =>
{
if (task.IsFaulted)
((IDataflowBlock)_messageBufferBlock).Fault(task.Exception);
else if (!task.IsCanceled)
_messageBufferBlock.Complete();
},
CancellationToken.None,
TaskContinuationOptions.DenyChildAttach | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
Should I use LinkTo feature of datablocks to connect various blocks? or use method such as OutputAvailableAsync() and ReceiveAsync() to propagate data from one block to another.
Use LinkTo whenever you can. It handles all the corner cases for you.
// Should throw?
// Should fault the block?
That's entirely up to you. By default, when any processing of any item fails, the block faults, and if you are propagating completion, the entire chain of blocks would fault.
Faulting blocks are rather drastic; they throw away any work in progress and refuse to continue processing. You have to build a new dataflow mesh if you want to retry.
If you prefer a "softer" error strategy, you can either catch the exceptions and do something like log them (which your code currently does), or you can change the nature of your dataflow block to pass along the exceptions as data items.
It would be worth looking at Rx. Unless I'm missing something your entire code that you need (apart from your existing ProcessFileAsync method) would look like this:
var query =
fileNameUris
.Select(fileNameUri =>
Observable
.FromAsync(ct => ProcessFileAsync(fileNameUri, ct)))
.Merge(maxConcurrent : 4);
var subscription =
query
.Subscribe(
u => { },
() => { Console.WriteLine("Done."); });
Done. It's run asynchronously. It's cancellable by calling subscription.Dispose();. And you can specify the maximum parallelism.
I'm working on a system that involves accepting commands over a TCP network connection, then sending responses upon execution of those commands. Fairly basic stuff, but I'm looking to support a few requirements:
Multiple clients can connect at the same time and establish separate sessions. Sessions can last as long or as short as desired, with the same client IP able to establish multiple parallel sessions, if desired.
Each session can process multiple commands at the same time, as some of the requested operations can be performed in parallel.
I'd like to implement this cleanly using async/await and, based on what I've read, TPL Dataflow sounds like a good way to cleanly break up the processing into nice chunks that can run on the thread pool instead of tying up threads for different sessions/commands, blocking on wait handles.
This is what I'm starting with (some parts stripped out to simplify, such as details of exception handling; I've also omitted a wrapper that provides an efficient awaitable for the network I/O):
private readonly Task _serviceTask;
private readonly Task _commandsTask;
private readonly CancellationTokenSource _cancellation;
private readonly BufferBlock<Command> _pendingCommands;
public NetworkService(ICommandProcessor commandProcessor)
{
_commandProcessor = commandProcessor;
IsRunning = true;
_cancellation = new CancellationTokenSource();
_pendingCommands = new BufferBlock<Command>();
_serviceTask = Task.Run((Func<Task>)RunService);
_commandsTask = Task.Run((Func<Task>)RunCommands);
}
public bool IsRunning { get; private set; }
private async Task RunService()
{
_listener = new TcpListener(IPAddress.Any, ServicePort);
_listener.Start();
while (IsRunning)
{
Socket client = null;
try
{
client = await _listener.AcceptSocketAsync();
client.Blocking = false;
var session = RunSession(client);
lock (_sessions)
{
_sessions.Add(session);
}
}
catch (Exception ex)
{ //Handling here...
}
}
}
private async Task RunCommands()
{
while (IsRunning)
{
var command = await _pendingCommands.ReceiveAsync(_cancellation.Token);
var task = Task.Run(() => RunCommand(command));
}
}
private async Task RunCommand(Command command)
{
try
{
var response = await _commandProcessor.RunCommand(command.Content);
Send(command.Client, response);
}
catch (Exception ex)
{
//Deal with general command exceptions here...
}
}
private async Task RunSession(Socket client)
{
while (client.Connected)
{
var reader = new DelimitedCommandReader(client);
try
{
var content = await reader.ReceiveCommand();
_pendingCommands.Post(new Command(client, content));
}
catch (Exception ex)
{
//Exception handling here...
}
}
}
The basics seem straightforward, but one part is tripping me up: how do I make sure that when I'm shutting down the application, I wait for all pending command tasks to complete? I get the Task object when I use Task.Run to execute the command, but how do I keep track of pending commands so that I can make sure that all of them are complete before allowing the service to shut down?
I've considered using a simple List, with removal of commands from the List as they finish, but I'm wondering if I'm missing some basic tools in TPL Dataflow that would allow me to accomplish this more cleanly.
EDIT:
Reading more about TPL Dataflow, I'm wondering if what I should be using is a TransformBlock with an increased MaxDegreeOfParallelism to allow processing parallel commands? This sets an upper limit on the number of commands that can run in parallel, but that's a sensible limitation for my system, I think. I'm curious to hear from those who have experience with TPL Dataflow to know if I'm on the right track.
Yeah, so... you're kinda half using the power of TPL here. The fact that you're still manually receiving items from the BufferBlock in your own while loop in a background Task is not the "way" you want to do it if you're subscribing to the TPL DataFlow style.
What you would do is link an ActionBlock to the BufferBlock and do your command processing/sending from within that. This is also the block where you would set the MaxDegreeOfParallelism to control just how many concurrent commands you want to process. So that setup might look something like this:
// Initialization logic to build up the TPL flow
_pendingCommands = new BufferBlock<Command>();
_commandProcessor = new ActionBlock<Command>(this.ProcessCommand);
_pendingCommands.LinkTo(_commandProcessor);
private Task ProcessCommand(Command command)
{
var response = await _commandProcessor.RunCommand(command.Content);
this.Send(command.Client, response);
}
Then, in your shutdown code, you would need to signal that you're done adding items into the pipeline by calling Complete on the _pipelineCommands BufferBlock and then wait on the _commandProcessor ActionBlock to complete to ensure that all items have made their way through the pipeline. You do this by grabbing the Task returned by the block's Completion property and calling Wait on it:
_pendingCommands.Complete();
_commandProcessor.Completion.Wait();
If you want to go for bonus points, you can even separate the command processing from the command sending. This would allow you to configure those steps separately from one another. For example, maybe you need to limit the number of threads processing commands, but want to have more sending out the responses. You would do this by simply introducing a TransformBlock into the middle of the flow:
_pendingCommands = new BufferBlock<Command>();
_commandProcessor = new TransformBlock<Command, Tuple<Client, Response>>(this.ProcessCommand);
_commandSender = new ActionBlock<Tuple<Client, Response>(this.SendResponseToClient));
_pendingCommands.LinkTo(_commandProcessor);
_commandProcessor.LinkTo(_commandSender);
private Task ProcessCommand(Command command)
{
var response = await _commandProcessor.RunCommand(command.Content);
return Tuple.Create(command, response);
}
private Task SendResponseToClient(Tuple<Client, Response> clientAndResponse)
{
this.Send(clientAndResponse.Item1, clientAndResponse.Item2);
}
You probably want to use your own data structure instead of Tuple, it was just for illustrative purposes, but the point is this is exactly the kind of structure you want to use to break up the pipeline so that you can control the various aspects of it exactly how you might need to.
Tasks are by default background, which means that when application terminates they are also immediately terminated. You should use a Thread not a Task. Then you can set:
Thread.IsBackground = false;
This will prevent your application from terminating while the worker thread is running.
Although of course this will require some changes in your above code.
What's more you, when executing the shutdown method, you could also just wait for any outstanding tasks from the main thread.
I do not see a better solution to this.