We have a data processing pipeline that we're trying to use the TPL Dataflow framework for.
Basic gist of the pipeline:
Iterate through CSV files on the filesystem (10,000)
Verify we haven't imported contents, if we have ignore
Iterate through contents of a single CSV file (20,000-120,000 rows) and create a datastructure that fits to our needs.
Batch up 100 of these new dataStructured items and push them into a Database
Mark the CSV file as being imported.
Now we have an existing Python file that does all the above in a very slow & painful way - the code is a mess.
My thinking was the following looking at TPL Dataflow.
BufferBlock<string> to Post all the files into
TransformBlock<string, SensorDataDto> predicate to detect whether to import this file
TransformBlock<string, SensorDataDto> reads through the CSV file and creates SensorDataDto structure
BatchBlock<SensorDataDto> is used within the TransformBlock delegate to batch up 100 requests.
4.5. ActionBlock<SensorDataDto> to push the 100 records into the Database.
ActionBlock to mark the CSV as imported.
I've created the first few operations and they're working (BufferBlock -> TransformBlock + Predicate && Process if hasn't) but I'm unsure how to continue to the flow so that I can post 100 to the BatchBlock within the TransformBlock and wire up the following actions.
Does this look right - basic gist, and how do I tackle the BufferBlock bits in a TPL data flowy way?
bufferBlock.LinkTo(readCsvFile, ShouldImportFile)
bufferBlock.LinkTo(DataflowBlock.NullTarget<string>())
readCsvFile.LinkTo(normaliseData)
normaliseData.LinkTo(updateCsvImport)
updateCsvImport.LinkTo(completionBlock)
batchBlock.LinkTo(insertSensorDataBlock)
bufferBlock.Completion.ContinueWith(t => readCsvFile.Complete());
readCsvFile.Completion.ContinueWith(t => normaliseData.Complete());
normaliseData.Completion.ContinueWith(t => updateCsvImport.Complete());
updateCsvImport.Completion.ContinueWith(t => completionBlock.Complete());
batchBlock.Completion.ContinueWith(t => insertSensorDataBlock.Complete());
Inside the normaliseData method I'm calling BatchBlock.Post<..>(...), is that a good pattern or should it be structured differently? My problem is that I can only mark the file as being imported after all the records have been pushed through.
Task.WhenAll(bufferBlock.Completion, batchBlock.Completion).Wait();
If we have a batch of 100, what if 80 are pushed in, is there a way to drain the last 80?
I wasn't sure if I should Link the BatchBlock in the main pipeline, I do wait till both are finished.
First of all, you don't need to use the Completion in that matter, you can use the PropagateCompletion property during link:
// with predicate
bufferBlock.LinkTo(readCsvFile, new DataflowLinkOptions { PropagateCompletion = true }, ShouldImportFile);
// without predicate
readCsvFile.LinkTo(normaliseData, new DataflowLinkOptions { PropagateCompletion = true });
Now, back to your problem with batches. Maybe, you can use a JoinBlock<T1, T2> or BatchedJoinBlock<T1, T2> here, by attaching them into your pipeline and gathering the results of joins, so you got full picture of work being done. Maybe you can implement your own ITargetBlock<TInput> so you can consume the messages in your way.
According official docs, the blocks are greedy, and gather data from linked one as soon as it becomes available, so join blocks may stuck, if one target is ready and other is not, or batch block has 80% of batch size, so you need to put that in your mind. In case of your own implementation you can use ITargetBlock<TInput>.OfferMessage method to get information from your sources.
BatchBlock<T> is capable of executing in both greedy and non-greedy modes. In the default greedy mode, all messages offered to the block from any number of sources are accepted and buffered to be converted into batches.
In non-greedy mode, all messages are postponed from sources until enough sources have offered messages to the block to create a batch. Thus, a BatchBlock<T> can be used to receive 1 element from each of N sources, N elements from 1 source, and a myriad of options in between.
Related
I am researching the possibility of using pipelines for processing binary messages coming from network.
The binary messages i will be processing come with an payload and it is desirable to keep the payload in its binary form.
The idea is to read out the whole message and create a slice of message and its payload, once the message is completely read it will be passed to a channel chain for processing, the processing will not be instant and might take some time or be executed later and the goal is not to have the pipe reader wait until the processing is complete, then once the message processing is complete i would need to release the processed buffer region to the pipe writer.
Now of course i could just create a new byte array and copy the data coming from pipe writer but that would beat the purpose of no-copy? So as i understand i would need some buffer synchronization between the pipeline and the channel?
I observed the available apis (AdvanceTo) of pipe reader where its possible to tell the pipe reader what was consumed and what was examined but cant get around how this could be synced outside of the pipe reading method.
So the question would be whether there are some techniques or examples on how this can be achieved.
The buffer obtained from TryRead/ReadAsync is only valid until you call AdvanceTo, with the expectation that as soon as you've done that: anything you reported as consumed is available to be recycled for use elsewhere (which could be parallel/concurrent readers). Strictly speaking: even the bits you haven't reported as consumed: you still shouldn't treat as valid once you've called AdvanceTo (although in reality, it is likely that they'll still be the same segments - just: that isn't the concern of the caller; to the caller, it is only valid between the read and the advance).
This means that you explicitly can't do:
while (...)
{
var result = await pipe.ReadAsync();
if (TryIdentifyFrameBoundary(out var frame)) {
BeginProcessingInBackground(frame); // <==== THIS IS A PROBLEM!
reader.AdvanceTo(frame.End, frame.End);
}
else if { // take nothing
reader.AdvanceTo(buffer.Start, buffer.End);
if (result.IsCompleted) break; // that's all folks
}
}
because the "in background" bit, when it fires, could now be reading someone else's data (due to it being reused already).
So: either you need to process the frame contents as part of the read loop, or you're going to have to make a copy of the data, most likely by using:
c#
var len = checked ((int)buffer.Length);
var oversized = ArrayPool<byte>.Shared.Rent(len);
buffer.CopyTo(oversized);
and pass oversized to your background processing, remembering to only look at the first len bytes of it. You could pass this as a ReadOnlyMemory<byte>, but you need to consider that you're also going to want to return it to the array-pool afterwards (probably in a finally block), and passing it as a memory makes it a little more awkward (but not impossible, thanks to MemoryMarshal.TryGetArray).
Note: in early versions of the pipelines API, there was an element of reference-counting, which did allow you to preserve buffers, but it had a few problems:
it complicated the API hugely
it led to leaked buffers
it was ambiguous and confusing what "preserved" meant; is the count until it gets reused? or released completely?
so that feature was dropped.
I have C# list which contains around 8000 items (file paths). I want to run a method on all of these items in parallel. For this i have below 2 options:
1) Manually divide list into small-small chunks (say of 500 size each) and create array of actions for these small lists and then call Parallel.Invoke like below:
var partitionedLists = MainList.DivideIntoChunks(500);
List<Action> actions = new List<Action>();
foreach (var lst in partitionedLists)
{
actions.Add(() => CallMethod(lst));
}
Parallel.Invoke(actions.ToArray())
2) Second option is to run Parallel.ForEach like below
Parallel.ForEach(MainList, item => { CallMethod(item) });
What will the best option here?
How Parallel.Foreach divide the list
into small chunks?
Please suggest, thanks in advance.
The first option is a form of task-parallelism, in which you divide your task into group of sub-tasks and execute them in parallel. As is obvious from the code you provided, you are responsible for choosing the level of granularity [chunks] while creating the sub-tasks. The selected granularity might be too big or too low, if one does not rely on appropriate heuristics, and the resulting performance gain might not be significant. Task-parallelism is used in scenarios where the operation to be performed takes similar time for all input values.
The second option is a form of data-parallelism, in which the input data is divided into smaller chunks based on the number of hardware threads/cores/processors available, and then each individual chunk is processed in isolation. In this case, the .NET library chooses the right level of granularity for you and ensures better CPU utilization. Conventionally, data-parallelism is used in scenarios when the operation to be performed can vary in terms of time taken, depending on the input value.
In conclusion, if your operation is more or less uniform over the range of input values and you know the right granularity [chunk size], go ahead with the first option. If however that's not the case or if you are unsure about the above questions, go with the second option which usually pans out better in most scenarios.
NOTE: If this is a very performance critical component of your application, I will advise bench-marking the performances in production like environment with both approaches to get more data, in addition to the above recommendations.
I have a sequence of Images (IObservable<ImageSource>) that goes through this "pipeline".
Each image is recognized using OCR
If the results have valid values, the are uploaded to a service that can register a set of results at a given time (not concurrently).
If the results have any invalid value, they are presented to the user in order to fix them. After they are fixed, the process continues.
During the process, the UI should stay responsive.
The problem is that I don't know how to handle the case when the user has to interact. I just cannot do this
subscription = images
.Do(source => source.Freeze())
.Select(image => OcrService.Recognize(image))
.Subscribe(ocrResults => Upload(ocrResults));
...because when ocrResults have to be fixed by the user, the flow should be kept on hold until the valid values are accepted (ie. the user could execute a Command clicking a Button)
How do I say: if the results are NOT valid, wait until the user fixes them?
This seems to be a mix of UX, WPF and Rx all wrapped up in one problem. Trying to solve it with only Rx is probably going to send you in to a tail spin. I am sure you could solve it with just Rx, and no more thought about it, but would you want to? Would it be testable, loosely coupled and easy to maintain?
In my understanding of the problem you have to following steps
User Uploads/Selects some images
The system performs OCR on each image
If the OCR tool deems the image source to be valid, the result of the processing is uploaded
If the OCR tool deems the image source to be invalid, the user "fixes" the result and the result is uploaded
But this may be better described as
User Uploads/Selects some images
The system performs OCR on each image
The result of the OCR is placed in a validation queue
While the result is invalid, a user is required to manually update it to a valid state.
The valid result is uploaded
So this to me seem that you need a task/queue based UI so that a User can see invalid OCR results that they need to work on. This also then tells me that if a person is involved, that it should probably be outside of the Rx query.
Step 1 - Perform ORC
subscription = images
.Subscribe(image=>
{
//image.Freeze() --probably should be done by the source sequence
var result = _ocrService.Recognize(image);
_validator.Enqueue(result);
});
Step 2 - Validate Result
//In the Enqueue method, if queue is empty, ProcessHead();
//Else add to queue.
//When Head item is updated, ProcessHead();
//ProcessHead method checks if the head item is valid, and if it is uploads it and remove from queue. Once removed from queue, if(!IsEmpty) {ProcessHead();}
//Display Head of the Queue (and probably queue depth) to user so they can interact with it.
Step 3 - Upload result
Upload(ocrResults)
So here Rx is just a tool in our arsenal, not the one hammer that needs to solve all problems. I have found that with most "Rx" problems that grow in size, that Rx just acts as the entry and exit points for various Queue structures. This allows us to make the queuing in our system explicit instead of implicit (i.e. hidden inside of Rx operators).
I'm assuming your UploadAsync method returns a Task to allow you to wait for it to finished? If so, there are overloads of SelectMany that handle tasks.
images.Select(originalImage => ImageOperations.Resize(originalImage))
.SelectMany(resizedImg => imageUploader.UploadAsync(resizedImg))
.Subscribe();
Assuming you've got an async method which implements the "user fix process":
/* show the image to the user, which fixes it, returns true if fixed, false if should be skipped */
async Task UserFixesTheOcrResults(ocrResults);
Then your observable becomes:
subscription = images
.Do(source => source.Freeze())
.Select(image => OcrService.Recognize(image))
.Select(ocrResults=> {
if (ocrResults.IsValid)
return Observable.Return(ocrResults);
else
return UserFixesTheOcrResults(ocrResults).ToObservable().Select(_ => ocrResults)
})
.Concat()
.Subscribe(ocrResults => Upload(ocrResults));
Is there a way to use Aggregate function (Max, Count, ....) with Buffer before a sequence is completed.
When Completed this will produce results, but with continues stream it does not give
any results?
I was expecting there is some way to make this work with buffer?
IObservable<long> source;
IObservable<IGroupedObservable<long, long>> group = source
.Buffer(TimeSpan.FromSeconds(5))
.GroupBy(i => i % 3);
IObservable<long> sub = group.SelectMany(grp => grp.Max());
sub.Subscribe(l =>
{
Console.WriteLine("working");
});
Use Scan instead of Aggregate. Scan works just like Aggregate except that it sends out intermediate values as the stream advances. It is good for "running totals", which appears to be what you are asking for.
All the "statistical" operators in Rx (Min/Max/Sum/Count/Average) are using a mechanism that propagate the calculate value just when the subscription is completed, and that is the big difference between Scan and Aggregate, basically if you want to be notified when a new value is pushed in your subscription it is necessary to use Scan.
In your case if you want to keep the same logic, you should combine with GroupByUntil or Window operators, the conditions to use both can create and complete the group subscription regularly, and that will be used to push the next value.
You can get more info here: http://www.introtorx.com/content/v1.0.10621.0/07_Aggregation.html#BuildYourOwn
By the way I wrote a text related to what you want. Check in: http://www.codeproject.com/Tips/853256/Real-time-statistics-with-Rx-Statistical-Demo-App
TPL Dataflow block has .InputCount and .OutputCount properties. But it can perform execution over item right now, and there is no property like .Busy [Boolean]. So is there a way to know if block is now operating and one of item still there?
UPDATE:
Let me explain my issue. Here on pic is my current Dataflow network scheme.
BufferBlock holds URLs to load, number of TransformBlocks load pages through proxy servers and ActionBlock at the end performs work with loaded pages. TransformBlocks has predefined .BoundedCapacity, so BufferBlock waits for any of TransformBlocks becomes free and then post item into it.
Initially I post all URLs to Buffer Block. Also if one of TransformBlocks throw exception during loading HTML, it returns it's URL back to BufferBlock. So my goal is somehow wait until all of my URLs was guarantee loaded and parsed. For now I'm waiting like this:
Do While _BufferBlock.Count > 0 Or _
GetLoadBlocksTotalInputOutputCount(_TransformBlocks) > 0 Or _
_ActionBlock.InputCount > 0
Await Task.Delay(1000)
Loop
Then I call TransformBlock.Complete on all of them. But in this case, there still can be last URLs loading it TransformBlocks. If last URL was not successfully loaded, it becomes 'lost', because none of TransformBlocks wouldn't take it back. That's why I want to know if TransformBlocks are still operating. Sorry my bad English.
Even if you could find out whether a block is processing an item, it wouldn't really help you achieve your goal. That's because you would need to check the state of all the blocks at exactly the same moment, and there is no way to do that.
What I think you need is to somehow manually track how many items have been fully processed and compare that with the total number of items to process.
You should know the number of items to process from the start (it's you who sends them to the buffer block). To track the number of items that have been fully processed, you can add a counter to your parsing action block (don't forget to make the counter thread-safe, since your action block is parallel).
Then, if the counter reaches the total number of items to process, you know that all work is done.