Completion in TPL Dataflow Loops - c#

I have a problem with determining how to detect completion within a looping TPL Dataflow.
I have a feedback loop in part of a dataflow which is making GET requests to a remote server and processing data responses (transforming these with more dataflow then committing the results).
The data source splits its results into pages of 1000 records, and won't tell me how many pages it has available for me. I have to just keep reading until i get less than a full page of data.
Usually the number of pages is 1, frequently it is up to 10, every now and again we have 1000s.
I have many requests to fetch at the start.
I want to be able to use a pool of threads to deal with this, all of which is fine, I can queue multiple requests for data and request them concurrently. If I stumble across an instance where I need to get a big number of pages I want to be using all of my threads for this. I don't want to be left with one thread churning away whilst the others have finished.
The issue I have is when I drop this logic into dataflow, such as:
//generate initial requests for activity
var request = new TransformManyBlock<int, DataRequest>(cmp => QueueRequests(cmp));
//fetch the initial requests and feedback more requests to our input buffer if we need to
TransformBlock<DataRequest, DataResponse> fetch = null;
fetch = new TransformBlock<DataRequest, DataResponse>(async req =>
{
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
await fetch.SendAsync(QueueAnotherRequest(req));
return resp;
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
//commit each type of request
var commit = new ActionBlock<DataResponse>(async resp => await Commit(resp));
request.LinkTo(fetch);
fetch.LinkTo(commit);
//when are we complete?
QueueRequests produces an IEnumerable<DataRequest>. I queue the next N page requests at once, accepting that this means I send slightly more calls than I need to. DataRequest instances share a LastPage counter to avoid neadlessly making requests that we know are after the last page. All this is fine.
The problem:
If I loop by feeding back more requests into fetch's input buffer as I've shown in this example, then i have a problem with how to signal (or even detect) completion. I can't set completion on fetch from request, as once completion is set I can't feedback any more.
I can monitor for the input and output buffers being empty on fetch, but I think I'd be risking fetch still being busy with a request when I set completion, thus preventing queuing requests for additional pages.
I could do with some way of knowing that fetch is busy (either has input or is busy processing an input).
Am I missing an obvious/straightforward way to solve this?
I could loop within fetch, rather than queuing more requests. The problem with that is I want to be able to use a set maximum number of threads to throttle what I'm doing to the remote server. Could a parallel loop inside the block share a scheduler with the block itself and the resulting thread count be controlled via the scheduler?
I could create a custom transform block for fetch to handle the completion signalling. Seems like a lot of work for such a simple scenario.
Many thanks for any help offered!

In TPL Dataflow, you can link the blocks with DataflowLinkOptions with specifying the propagation of completion of the block:
request.LinkTo(fetch, new DataflowLinkOptions { PropagateCompletion = true });
fetch.LinkTo(commit, new DataflowLinkOptions { PropagateCompletion = true });
After that, you simply call the Complete() method for the request block, and you're done!
// the completion will be propagated to all the blocks
request.Complete();
The final thing you should use is Completion task property of the last block:
commit.Completion.ContinueWith(t =>
{
/* check the status of the task and correctness of the requests handling */
});

For now I have added a simple busy state counter to the fetch block:-
int fetch_busy = 0;
TransformBlock<DataRequest, DataResponse> fetch_activity=null;
fetch = new TransformBlock<DataRequest, ActivityResponse>(async req =>
{
try
{
Interlocked.Increment(ref fetch_busy);
var resp = await Fetch(req);
if (resp.Results.Count == 1000)
{
await fetch.SendAsync( QueueAnotherRequest(req) );
}
Interlocked.Decrement(ref fetch_busy);
return resp;
}
catch (Exception ex)
{
Interlocked.Decrement(ref fetch_busy);
throw ex;
}
}
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });
Which I then use to signal complete as follows:-
request.Completion.ContinueWith(async _ =>
{
while ( fetch.InputCount > 0 || fetch_busy > 0 )
{
await Task.Delay(100);
}
fetch.Complete();
});
Which doesnt seem very elegant, but should work I think.

Related

How to do error handling with connected TPL dataflow blocks?

it seems that I do not understand TPL Dataflow error handling.
Lets assume I have a list of items I wanna process and I use a ActionBlock for that:
var actionBlock = new ActionBlock<int[]>(async tasks =>
{
foreach (var task in tasks)
{
await Task.Delay(1);
if (task > 30)
{
throw new InvalidOperationException();
}
Console.WriteLine("{0} Completed", task);
}
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = 200,
MaxDegreeOfParallelism = 4
});
for (i = 0; i < 10000; i++)
{
if (!await bufferBlock.SendAsync(i))
{
break;
}
}
actionBlock.Complete();
await actionBlock.Completion;
If an error occurs the block transitions to faulted state and SendAsync(...) returns false. I can just stop my loop and complete it and when I await the completion an exception is thrown. So far so good.
When I put a BufferBlock in between it does not work anymore:
bufferBlock.LinkTo(actionBlock, new DataflowLinkOptions
{
PropagateCompletion = true
});
for (i = 0; i < 10000; i++)
{
if (!await bufferBlock.SendAsync(i, cts.Token))
{
break;
}
}
bufferBlock.Complete();
await actionBlock.Completion;
The call to SendAsync() just "blocks" forever, because the BufferBlock never transitions to faulted state.
The only solution I found is this:
using (var cts = new CancellationTokenSource())
{
actionBlock.Completion.ContinueWith(x =>
{
if (x.Status != TaskStatus.RanToCompletion)
{
cts.Cancel();
}
});
var i = 0;
try
{
for (i = 0; i < 10000; i++)
{
if (cts.Token.IsCancellationRequested)
{
break;
}
if (!await bufferBlock.SendAsync(i, cts.Token))
{
break;
}
}
}
catch (OperationCanceledException)
{
}
bufferBlock.Complete();
await actionBlock.Completion;
}
Because the state propagates I have to listen to the state of the last block in my network and when this block stops I have to stop my loop.
Is this the intended way to work with Dataflow library or is there a better solution?
Don't allow unhandled exceptions. An unhandled exception in a block means the block and by extension the entire pipeline is terminally broken and must be aborted. That's not a TPL Dataflow bug, that's how the overall dataflow paradigm works. Exceptions are meant to signal errors up a call stack. There's no call stack in a dataflow though.
Blocks are independent workers that communicate through messages. There's no ownership relation between linked blocks and a faulting block doesn't mean any previous or following blocks should have to abort as well. That's why PropagateCompletion is false by default.
If a source links to more than one blocks the messages can easily go to the other blocks. It's also possible to change the links between blocks at runtime.
In a pipeline there are two different kinds of errors:
Message errors that occur when a block/actor/worker processes a message
Pipeline errors that invalidate the pipeline and may require aborting
There's no reason to abort the pipeline if a single message faults.
Message errors
If something goes wrong while processing a message, the actor should do something with that message and proceed with the next one. That something may be:
Log the error and go on
Send an "error" message to another block
Use a Result<TMessage,TError> class in the entire pipeline instead of using raw message types, and add any errors to the result
Retry and recovery strategies can be built on top of that, eg forwarding any failed messages to a "retry" block or dead message block
The simplest way would be to just catch the exceptions and log them :
var block=new ActionBlock<int[]>(msg=>{
try
{
...
}
catch(Exception exc)
{
_logger.LogError(exc);
}
});
Another option is to manually post to eg a dead-letter queue :
var dead=new BufferBlock<(int[] data,Exception error)>();
var block=new ActionBlock<int[]>(msg=>{
try
{
...
}
catch(Exception exc)
{
await _dead.SendAsync(msg,exc);
_logger.LogError(exc);
}
});
Going even further, one could define a Result<TMessage,TError> class to wrap results. Downstream blocks could ignore faulted results. The LinkTo predicate can also be used to reroute error messages. I'll cheat and hard-code the error to Exception. A better implementation would use different types for success and error :
record Result<TMessage>(TMessage? Message,Exception ? Error)
{
public bool HasError=>error!=null;
}
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
if (msg.HasError)
{
//Propagate the error message
return new Result<double>(default,msg.Error);
}
try
{
var sum=(double)msg.Message.Sum();
if (sum % 5 ==0)
{
throw new Exception("Why not?");
}
return new Result(sum,null);
}
catch(Exception exc)
{
return new Result(null,exc);
}
});
var block2=new ActionBlock<Result<double>>(...);
block1.LinkTo(block2);
Another option is to redirect error messages to a different block:
var errorBlock=new ActionBlock<Result<int[]>>(msg=>{
_logger.LogError(msg.Error);
});
block1.LinkTo(errorBlock,msg=>msg.HasError);
block1.LinkTo(block2);
This redirects all errored messages to the error block. All other messages move on to block2
Pipeline errors
In some cases, an error is so severe the current block can't recover and perhaps even the entire pipeline must be cancelled/aborted. Cancellation in .NET is handled through a CancellationToken. All blocks accept a CancellationToken to allow aborting.
There's no single abort strategy that's appropriate to all pipelines. Propagating cancellation forward is common but definitely not the only option.
In the simplest case,
var pipeLineCancellation = new CancellationTokenSource();
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
},
new ExecutionDataflowBlockOptions {
CancellationToken=pipeLineCancellation.Token
});
The block exception handler could request cancellation in case of a serious error :
//Wrong table name. We can't use the database
catch(SqlException exc) when (exc.Number ==208)
{
...
pipeLineCancellation.Cancel();
}
This would abort all blocks that use the same CancellationTokenSource. That doesn't mean that all blocks should be connected to the same CancellationTokenSource though.
Flowing cancellation backwards
In Go pipelines it's common to use an error channel that sends a cancellation message to the previous block. The same can be done in C# using linked CancellationTokenSources. One could even say this is even better than Go.
It's possible to create multiple linked CancellationTokenSources with CreateLinkedTokenSource. By creating sources that link backwards we can have a block signal cancellation for its own source and have the cancellation flow to the root.
var cts5=new CancellationTokenSource();
var cts4=CancellationTokenSource.CreateLinkedTokenSource(cts5.Token);
...
var cts1=CancellationTokenSource.CreateLinkedTokenSource(cts2.Token);
...
var block3=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
catch(SqlException)
{
cts3.Cancel();
}
},
new ExecutionDataflowBlockOptions {
CancellationToken=cts3.Token
});
This will signal cancellation backwards, block by block, without cancelling the downstream blocks.
Pipeline Patterns
Dataflow in .NET is a gem few people know about, so it's really hard to find good references and patterns. The concepts are similar in Go though, so one could use the patterns found in Go Concurrency Patterns: Pipelines and cancellation.
The TPL Dataflow implements the processing loop and completion propagation so one typically only needs to provide the Action or Func that processes messages. The rest of the patterns have to be implemented, although .NET offers some advantages over Go.
The done channel is essentially a CancellationTokenSource.
Fan-in, fan-out are already handled through existing blocks, or can be handled using a relatively simple custom block that clones messages
CancellationTokenSources can be linked explicitly. In Go each "stage" (essentially a block) has to propagate completion/cancellation to other stages
One CancellationTokenSource can be used by all stages/blocks.
Linking allows not just easier composition but even runtime modifications to the pipeline/mesh.
Let's say we want to just stop processing messages after a while, even though there's no error. All that's needed is to create a CTS used by all blocks:
var pipeLineCancellation = new CancellationTokenSource();
var block1=new TransformBlock<Result<int[]>,Result<double>>(msg=>{
...
},
new ExecutionDataflowBlockOptions {
CancellationToken=pipeLineCancellation.Token
});
var block2 =.....;
pipeLineCancellation.Cancel();
Perhaps we want to run the pipeline for only a minute? Easy with
var pipeLineCancellation = new CancellationTokenSource(60000);
There are some disadvantages too, as a Dataflow block has no access to the "channels" or control over the loop
In Go it's easy to pass data, error and done channels to each stage, simplifying the error reporting and completion. In .NET the block delegates may have to access other blocks or CTSs directly.
In Go it's easier to use common state to eg accumulate data, or manage session/remote connection state. Imagine stage/block that controls a screen scraper like Selenium. We really don't want to restart the browser on every message.
Or we may want to insert data into a database using SqlBulkCopy. With an ActionBlock we'd have to create a new instance for each batch, which may or may not be a problem.

Parallel.ForEach faster than Task.WaitAll for I/O bound tasks?

I have two versions of my program that submit ~3000 HTTP GET requests to a web server.
The first version is based off of what I read here. That solution makes sense to me because making web requests is I/O bound work, and the use of async/await along with Task.WhenAll or Task.WaitAll means that you can submit 100 requests all at once and then wait for them all to finish before submitting the next 100 requests so that you don't bog down the web server. I was surprised to see that this version completed all of the work in ~12 minutes - way slower than I expected.
The second version submits all 3000 HTTP GET requests inside a Parallel.ForEach loop. I use .Result to wait for each request to finish before the rest of the logic within that iteration of the loop can execute. I thought that this would be a far less efficient solution, since using threads to perform tasks in parallel is usually better suited for performing CPU bound work, but I was surprised to see that the this version completed all of the work within ~3 minutes!
My question is why is the Parallel.ForEach version faster? This came as an extra surprise because when I applied the same two techniques against a different API/web server, version 1 of my code was actually faster than version 2 by about 6 minutes - which is what I expected. Could performance of the two different versions have something to do with how the web server handles the traffic?
You can see a simplified version of my code below:
private async Task<ObjectDetails> TryDeserializeResponse(HttpResponseMessage response)
{
try
{
using (Stream stream = await response.Content.ReadAsStreamAsync())
using (StreamReader readStream = new StreamReader(stream, Encoding.UTF8))
using (JsonTextReader jsonTextReader = new JsonTextReader(readStream))
{
JsonSerializer serializer = new JsonSerializer();
ObjectDetails objectDetails = serializer.Deserialize<ObjectDetails>(
jsonTextReader);
return objectDetails;
}
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<HttpResponseMessage> TryGetResponse(string urlStr)
{
try
{
HttpResponseMessage response = await httpClient.GetAsync(urlStr)
.ConfigureAwait(false);
if (response.StatusCode != HttpStatusCode.OK)
{
throw new WebException("Response code is "
+ response.StatusCode.ToString() + "... not 200 OK.");
}
return response;
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<ListOfObjects> GetObjectDetailsAsync(string baseUrl, int id)
{
string urlStr = baseUrl + #"objects/id/" + id + "/details";
HttpResponseMessage response = await TryGetResponse(urlStr);
ObjectDetails objectDetails = await TryDeserializeResponse(response);
return objectDetails;
}
// With ~3000 objects to retrieve, this code will create 100 API calls
// in parallel, wait for all 100 to finish, and then repeat that process
// ~30 times. In other words, there will be ~30 batches of 100 parallel
// API calls.
private Dictionary<int, Task<ObjectDetails>> GetAllObjectDetailsInBatches(
string baseUrl, Dictionary<int, MyObject> incompleteObjects)
{
int batchSize = 100;
int numberOfBatches = (int)Math.Ceiling(
(double)incompleteObjects.Count / batchSize);
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= new Dictionary<int, Task<ObjectDetails>>(incompleteObjects.Count);
var orderedIncompleteObjects = incompleteObjects.OrderBy(pair => pair.Key);
for (int i = 0; i < 1; i++)
{
var batchOfObjects = orderedIncompleteObjects.Skip(i * batchSize)
.Take(batchSize);
var batchObjectsTaskList = batchOfObjects.Select(
pair => GetObjectDetailsAsync(baseUrl, pair.Key));
Task.WaitAll(batchObjectsTaskList.ToArray());
foreach (var objTask in batchObjectsTaskList)
objectTaskDict.Add(objTask.Result.id, objTask);
}
return objectTaskDict;
}
public void GetObjectsVersion1()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= GetAllObjectDetailsInBatches(baseUrl, incompleteObjects);
foreach (KeyValuePair<int, MyObject> pair in incompleteObjects)
{
ObjectDetails objectDetails = objectTaskDict[pair.Key].Result
.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
};
}
public void GetObjectsVersion2()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Parallel.ForEach(incompleteHosts, pair =>
{
ObjectDetails objectDetails = GetObjectDetailsAsync(
baseUrl, pair.Key).Result.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
});
}
A possible reason why Parallel.ForEach may run faster is because it creates the side-effect of throttling. Initially x threads are processing the first x elements (where x in the number of the available cores), and progressively more threads may be added depending on internal heuristics. Throttling IO operations is a good thing because it protects the network and the server that handles the requests from becoming overburdened. Your alternative improvised method of throttling, by making requests in batches of 100, is far from ideal for many reasons, one of them being that 100 concurrent requests are a lot of requests! Another one is that a single long running operation may delay the completion of the batch until long after the completion of the other 99 operations.
Note that Parallel.ForEach is also not ideal for parallelizing IO operations. It just happened to perform better than the alternative, wasting memory all along. For better approaches look here: How to limit the amount of concurrent async I/O operations?
https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netframework-4.8
Basically the parralel foreach allows iterations to run in parallel so you are not constraining the iteration to run in serial, on a host that is not thread constrained this will tend to lead to improved throughput
In short:
Parallel.Foreach() is most useful for CPU bound tasks.
Task.WaitAll() is more useful for IO bound tasks.
So in your case, you are getting information from webservers, which is IO. If the async methods are implemented correctly, it won't block any thread. (It will use IO Completion ports to wait on) This way the threads can do other stuff.
By running the async methods GetObjectDetailsAsync(baseUrl, pair.Key).Result synchroniced, it will block a thread. So the threadpool will be flood by waiting threads.
So I think the Task solution will have a better fit.

Run Two methods Parallel, but 1st method gets data in chunks and 2nd method need to process that data parallel

My process Gets the data through HTTP request and it will get the data in Chunks(100 records at a time). in my case I had 100,000 records.
and then I need to process that data and load it into DB..
MY Current Process..
GrabAllRecords()
{
GRAB all 100,000 records(i.e 1000 requests).. its big amount of time.
Load into ArrayData
}
then..
Process Data(ArrayData)
{
}
But I need some thing like this...
START:
step1:
Grab 100 Records load into arraylist..
repeat step1 until it reach 100,000
step2:
process arrayList
This screams for the producer - consumer design pattern: one producer produces something in its own pace, while one or more consumers wait until something is produced, grab the produced information and process it, possibly leading to new produced output that other consumers might process.
Microsoft has good support for this via Microsoft TPL Dataflow nuget package.
Implement a Producer-Consumer Dataflow Pattern
Also helpful to start: Walkthrough: Creating a Dataflow Pipeline
The producer produces output in processable units, in your case: chunks. The output will be sent to an object of class BufferBlock< T > , where T is your chunk. Code will be similar to:
public class ChunkProducer
{
private BufferBlock<Chunk> outputBuffer = new BufferBlock<Chunk>;
// whenever the ChunkProducer produces a chunk it is put in this buffer
// consumers will need access to this outputbuffer as source of data:
public ISourceBlock<Chunk> OutputBuffer
{get {return this.outputBuffer as ISourceBlock<Chunk>;} }
public async Task ProduceAsync()
{
while(someThingsToProcess)
{
Chunk chunk = CreateChunk(...);
await this.outputBuffer.SendAsync(chunk);
}
// if here: nothing to process anymore.
// notify consumers that all output has been produced
this.outputBuffer.Complete();
}
The efficiency of this can be enhanced by creating the next chunk while the previous one is being sent and await before sending the next chunk. This is a bit out of scope here. More info about this is available on Stackoverflow.
You'll also need a ChunkConsumer. The ChunkConsumer will wait for chunks on the buffer block and process them:
public class ChunkConsumer
{
private readonly ISourceBlock<Chunk> chunkSource;
// the chunkConsumer will wait for input at this source
public ChunkConsumer(ISourceBlock<Chunk> chunkSource)
{
this.chunkSource = chunkSource
}
public async Task ConsumeAsync()
{
// wait until there is some data in the buffer
while (await this.chunkSource.OutputAvailableAsync())
{
// get the chunk and process it:
Chunk chunk = this.chunkSource.Receive()
ProcessChunk(chunk);
}
// if here: chunkSource has been completed. No more data to expect
}
Put it all together:
private async Task ProcessAsync()
{
ChunkProducer producer = new ChunkProducer();
ChunkConsumer consumer = new ChunkConsumer(producer.OutputBuffer);
// start a thread for the consumer to consume:
Task consumeTask = Task.Run( () => consumer.ConsumeAsync());
// let this thread start producing, and await until it is completed
await producer.ProduceAsync();
// if here, I know the producer finished producing
// wait until the consumer finished consuming:
await consumeTask;
// finished, all produced data is consumed.
}
Possible enhancements:
If producing is faster than consuming, consider using multiple consumers listening to the same ISourceBlock. Check TPL to see which of the BufferBlock types can handle multiple listeners
If producing is slower than consuming, consider using multiple producers producing to the same ITargetBlock. Check which type of buffer block can handle this.
Consider enabling cancellation using CancellationToken
If your chunk is not always the same number of records, consider using a batch block: The consumer gets notified if the batch has enough records to process.
You can use the DataFlow library to do something like this:
ActionBlock<Record[]> action_block = new ActionBlock<Record[]>(
x => ConsumeRecords(x),
new ExecutionDataflowBlockOptions
{
//Use one thread to process data.
//You can increase it if you want
//That would make sense if you produce the records faster than you consume them
MaxDegreeOfParallelism = 1
});
for (int i = 0; i < 1000; i++)
{
action_block.Post(ProduceNext100Records());
}
I am assuming that you have a method called ProduceNext100Records that produces records (e.g. via web service call) and another method called ConsumeRecords that consumes the records.
The easy answer I think is to use Microsoft Reactive Extensions (NuGet "Rx-Main").
Then you can do something like this:
var query =
from records in Get100Records().ToObservable()
from record in records.ToObservable()
from result in Observable.Start(() => ProcessRecord(record))
select new { record, result };
IDisposable subscription =
query
.Subscribe(
rr =>
{
/* Process each `rr.record`/`rr.result`
as they are produced */
},
() => { /* Run when all completed */ });
This will process in parallel and you'll start getting results as soon as the first ProcessRecord call is completed.
If you need to stop the processing early you just call subscription.Dispose().

How do I detect all TransformManyBlocks have completed

I have a TransformManyBlock that creates many "actors". They flow through several TransformBlocks of processing. Once all of the actors have completed all of the steps I need to process everything as a whole again.
The final
var CopyFiles = new TransformBlock<Actor, Actor>(async actor =>
{
//Copy all fo the files and then wait until they are all done
await actor.CopyFiles();
//pass me along to the next process
return actor;
}, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = -1 });
How do I detect when all of the Actors have been processed? Completions seems to be propagated immediately and only tell you there are no more items to process. That doesn't tell me when the last Actor is finished processing.
Completion in TPL Dataflow is done by calling Complete which returns immediately and signals completion. That makes the block refuse further messages but it continues processing the items it already contain.
When a block completes processing all its items it completes its Completion task. You can await that task to be notified when all the block's work has been done.
When you link blocks together (with PropogateCompletion turned on) you only need to call Complete on the first and await the last one's Completion property:
var copyFilesBlock = new TransformBlock<Actor, Actor>(async actor =>
{
await actor.CopyFilesAsync();
return actor;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = -1 });
// Fill copyFilesBlock
copyFilesBlock.Complete();
await copyFilesBlock.Completion;

BrokeredMessage Automatically Disposed after calling OnMessage()

I am trying to queue up items from an Azure Service Bus so I can process them in bulk. I am aware that the Azure Service Bus has a ReceiveBatch() but it seems problematic for the following reasons:
I can only get a max of 256 messages at a time and even this then can be random based on message size.
Even if I peek to see how many messages are waiting I don't know how many RequestBatch calls to make because I don't know how many messages each call will give me back. Since messages will keep coming in I can't just continue to make requests until it's empty since it will never be empty.
I decided to just use the message listener which is cheaper than doing wasted peeks and will give me more control.
Basically I am trying to let a set number of messages build up and
then process them at once. I use a timer to force a delay but I need
to be able to queue my items as they come in.
Based on my timer requirement it seemed like the blocking collection was not a good option so I am trying to use ConcurrentBag.
var batchingQueue = new ConcurrentBag<BrokeredMessage>();
myQueueClient.OnMessage((m) =>
{
Console.WriteLine("Queueing message");
batchingQueue.Add(m);
});
while (true)
{
var sw = WaitableStopwatch.StartNew();
BrokeredMessage msg;
while (batchingQueue.TryTake(out msg)) // <== Object is already disposed
{
...do this until I have a thousand ready to be written to DB in batch
Console.WriteLine("Completing message");
msg.Complete(); // <== ERRORS HERE
}
sw.Wait(MINIMUM_DELAY);
}
However as soon as I access the message outside of the OnMessage
pipeline it shows the BrokeredMessage as already being disposed.
I am thinking this must be some automatic behavior of OnMessage and I don't see any way to do anything with the message other than process it right away which I don't want to do.
This is incredibly easy to do with BlockingCollection.
var batchingQueue = new BlockingCollection<BrokeredMessage>();
myQueueClient.OnMessage((m) =>
{
Console.WriteLine("Queueing message");
batchingQueue.Add(m);
});
And your consumer thread:
foreach (var msg in batchingQueue.GetConsumingEnumerable())
{
Console.WriteLine("Completing message");
msg.Complete();
}
GetConsumingEnumerable returns an iterator that consumes items in the queue until the IsCompleted property is set and the queue is empty. If the queue is empty but IsCompleted is False, it does a non-busy wait for the next item.
To cancel the consumer thread (i.e. shut down the program), you stop adding things to the queue and have the main thread call batchingQueue.CompleteAdding. The consumer will empty the queue, see that the IsCompleted property is True, and exit.
Using BlockingCollection here is better than ConcurrentBag or ConcurrentQueue, because the BlockingCollection interface is easier to work with. In particular, the use of GetConsumingEnumerable relieves you from having to worry about checking the count or doing busy waits (polling loops). It just works.
Also note that ConcurrentBag has some rather strange removal behavior. In particular, the order in which items are removed differs depending on which thread removes the item. The thread that created the bag removes items in a different order than other threads. See Using the ConcurrentBag Collection for the details.
You haven't said why you want to batch items on input. Unless there's an overriding performance reason to do so, it doesn't seem like a particularly good idea to complicate your code with that batching logic.
If you want to do batch writes to the database, then I would suggest using a simple List<T> to buffer the items. If you have to process the items before they're written to the database, then use the technique I showed above to process them. Then, rather writing directly to the database, add the item to a list. When the list gets 1,000 items, or a given amount of time elapses, allocate a new list and start a task to write the old list to the database. Like this:
// at class scope
// Flush every 5 minutes.
private readonly TimeSpan FlushDelay = TimeSpan.FromMinutes(5);
private const int MaxBufferItems = 1000;
// Create a timer for the buffer flush.
System.Threading.Timer _flushTimer = new System.Threading.Timer(TimedFlush, FlushDelay.TotalMilliseconds, Timeout.Infinite);
// A lock for the list. Unless you're getting hundreds of thousands
// of items per second, this will not be a performance problem.
object _listLock = new Object();
List<BrokeredMessage> _recordBuffer = new List<BrokeredMessage>();
Then, in your consumer:
foreach (var msg in batchingQueue.GetConsumingEnumerable())
{
// process the message
Console.WriteLine("Completing message");
msg.Complete();
lock (_listLock)
{
_recordBuffer.Add(msg);
if (_recordBuffer.Count >= MaxBufferItems)
{
// Stop the timer
_flushTimer.Change(Timeout.Infinite, Timeout.Infinite);
// Save the old list and allocate a new one
var myList = _recordBuffer;
_recordBuffer = new List<BrokeredMessage>();
// Start a task to write to the database
Task.Factory.StartNew(() => FlushBuffer(myList));
// Restart the timer
_flushTimer.Change(FlushDelay.TotalMilliseconds, Timeout.Infinite);
}
}
}
private void TimedFlush()
{
bool lockTaken = false;
List<BrokeredMessage> myList = null;
try
{
if (Monitor.TryEnter(_listLock, 0, out lockTaken))
{
// Save the old list and allocate a new one
myList = _recordBuffer;
_recordBuffer = new List<BrokeredMessage>();
}
}
finally
{
if (lockTaken)
{
Monitor.Exit(_listLock);
}
}
if (myList != null)
{
FlushBuffer(myList);
}
// Restart the timer
_flushTimer.Change(FlushDelay.TotalMilliseconds, Timeout.Infinite);
}
The idea here is that you get the old list out of the way, allocate a new list so that processing can continue, and then write the old list's items to the database. The lock is there to prevent the timer and the record counter from stepping on each other. Without the lock, things would likely appear to work fine for a while, and then you'd get weird crashes at unpredictable times.
I like this design because it eliminates polling by the consumer. The only thing I don't like is that the consumer has to be aware of the timer (i.e. it has to stop and then restart the timer). With a little more thought, I could eliminate that requirement. But it works well the way it's written.
Switching to OnMessageAsync solved the problem for me
_queueClient.OnMessageAsync(async receivedMessage =>
I reached out to Microsoft about the BrokeredMessage being disposed issue on MSDN, this is the response:
Very basic rule and I am not sure if this is documented. The received message needs to be processed in the callback function's life time. In your case, messages will be disposed when async callback completes, this is why your complete attempts are failing with ObjectDisposedException in another thread.
I don't really see how queuing messages for further processing helps on the throughput. This will add more burden to client for sure. Try processing the message in the async callback, that should be performant enough.
In my case that means I can't use ServiceBus in the way I wanted to, and I have to re-think how I wanted things to work. Bugger.
I had the same issue when started to work with Azure Service Bus service.
I have found that method OnMessage always dispose BrokedMessage object. The approach proposed by Jim Mischel didn't help me (but it was very interesting to read - thanks!).
After some investigation I have found that the whole approach is wrong. Let me explain the right way to do what you want.
Use BrokedMessage.Complete() method only inside OnMessage method handler.
If you need to process message outside of this method that you should use method QueueClient.Complete(Guid lockToken). "LockToken" is property of BrokeredMessage object.
Example:
var messageOptions = new OnMessageOptions {
AutoComplete = false,
AutoRenewTimeout = TimeSpan.FromMinutes( 5 ),
MaxConcurrentCalls = 1
};
var buffer = new Dictionary<string, Guid>();
// get message from queue
myQueueClient.OnMessage(
m => buffer.Add(key: m.GetBody<string>(), value: m.LockToken),
messageOptions // this option says to ServiceBus to "froze" message in he queue until we process it
);
foreach(var item in buffer){
try {
Console.WriteLine($"Process item: {item.Key}");
myQueueClient.Complete(item.Value);// you can also use method CompleteBatch(...) to improve performance
}
catch{
// "unfroze" message in ServiceBus. Message would be delivered to other listener
myQueueClient.Defer(item.Value);
}
}
My solution was to get the message SequenceNumber then defer the message and add the SequenceNumber the BlockingCollection. Once the BlockingCollection picks up a new item it can receive the deferred message by the SequenceNumber and mark the message as complete. If for some reason the BlockingCollection doesn't process the SequenceNumber it will remain in the queue as deferred so it can be picked up later when the process is restarted. This protects against loosing messages if the process abnormally terminates while there's still items in the BlockingCollection.
BlockingCollection<long> queueSequenceNumbers = new BlockingCollection<long>();
//This finds any deferred/unfinished messages on startup.
BrokeredMessage existingMessage = client.Peek();
while (existingMessage != null)
{
if (existingMessage.State == MessageState.Deferred)
{
queueSequenceNumbers.Add(existingMessage.SequenceNumber);
}
existingMessage = client.Peek();
}
//setup the message handler
Action<BrokeredMessage> processMessage = new Action<BrokeredMessage>((message) =>
{
try
{
//skip deferred messages if they are already in the queueSequenceNumbers collection.
if (message.State != MessageState.Deferred || (message.State == MessageState.Deferred && !queueSequenceNumbers.Any(x => x == message.SequenceNumber)))
{
message.Defer();
queueSequenceNumbers.Add(message.SequenceNumber);
}
}
catch (Exception ex)
{
// Indicates a problem, unlock message in queue
message.Abandon();
}
});
// Callback to handle newly received messages
client.OnMessage(processMessage, new OnMessageOptions() { AutoComplete = false, MaxConcurrentCalls = 1 });
//start the blocking loop to process messages as they are added to the collection
foreach (var queueSequenceNumber in queueSequenceNumbers.GetConsumingEnumerable())
{
var message = client.Receive(queueSequenceNumber);
//mark the message as complete so it's removed from the queue
message.Complete();
//do something with the message
}

Categories

Resources