Complex flow with Observables

Complex flow with Observables - c#

I have a sequence of Images (IObservable<ImageSource>) that goes through this "pipeline".
Each image is recognized using OCR
If the results have valid values, the are uploaded to a service that can register a set of results at a given time (not concurrently).
If the results have any invalid value, they are presented to the user in order to fix them. After they are fixed, the process continues.
During the process, the UI should stay responsive.
The problem is that I don't know how to handle the case when the user has to interact. I just cannot do this
subscription = images
.Do(source => source.Freeze())
.Select(image => OcrService.Recognize(image))
.Subscribe(ocrResults => Upload(ocrResults));
...because when ocrResults have to be fixed by the user, the flow should be kept on hold until the valid values are accepted (ie. the user could execute a Command clicking a Button)
How do I say: if the results are NOT valid, wait until the user fixes them?

This seems to be a mix of UX, WPF and Rx all wrapped up in one problem. Trying to solve it with only Rx is probably going to send you in to a tail spin. I am sure you could solve it with just Rx, and no more thought about it, but would you want to? Would it be testable, loosely coupled and easy to maintain?
In my understanding of the problem you have to following steps
User Uploads/Selects some images
The system performs OCR on each image
If the OCR tool deems the image source to be valid, the result of the processing is uploaded
If the OCR tool deems the image source to be invalid, the user "fixes" the result and the result is uploaded
But this may be better described as
User Uploads/Selects some images
The system performs OCR on each image
The result of the OCR is placed in a validation queue
While the result is invalid, a user is required to manually update it to a valid state.
The valid result is uploaded
So this to me seem that you need a task/queue based UI so that a User can see invalid OCR results that they need to work on. This also then tells me that if a person is involved, that it should probably be outside of the Rx query.
Step 1 - Perform ORC
subscription = images
.Subscribe(image=>
{
//image.Freeze() --probably should be done by the source sequence
var result = _ocrService.Recognize(image);
_validator.Enqueue(result);
});
Step 2 - Validate Result
//In the Enqueue method, if queue is empty, ProcessHead();
//Else add to queue.
//When Head item is updated, ProcessHead();
//ProcessHead method checks if the head item is valid, and if it is uploads it and remove from queue. Once removed from queue, if(!IsEmpty) {ProcessHead();}
//Display Head of the Queue (and probably queue depth) to user so they can interact with it.
Step 3 - Upload result
Upload(ocrResults)
So here Rx is just a tool in our arsenal, not the one hammer that needs to solve all problems. I have found that with most "Rx" problems that grow in size, that Rx just acts as the entry and exit points for various Queue structures. This allows us to make the queuing in our system explicit instead of implicit (i.e. hidden inside of Rx operators).

I'm assuming your UploadAsync method returns a Task to allow you to wait for it to finished? If so, there are overloads of SelectMany that handle tasks.
images.Select(originalImage => ImageOperations.Resize(originalImage))
.SelectMany(resizedImg => imageUploader.UploadAsync(resizedImg))
.Subscribe();

Assuming you've got an async method which implements the "user fix process":
/* show the image to the user, which fixes it, returns true if fixed, false if should be skipped */
async Task UserFixesTheOcrResults(ocrResults);
Then your observable becomes:
subscription = images
.Do(source => source.Freeze())
.Select(image => OcrService.Recognize(image))
.Select(ocrResults=> {
if (ocrResults.IsValid)
return Observable.Return(ocrResults);
else
return UserFixesTheOcrResults(ocrResults).ToObservable().Select(_ => ocrResults)
})
.Concat()
.Subscribe(ocrResults => Upload(ocrResults));

Related

Blocked on block design for data flow

We have a data processing pipeline that we're trying to use the TPL Dataflow framework for.
Basic gist of the pipeline:
Iterate through CSV files on the filesystem (10,000)
Verify we haven't imported contents, if we have ignore
Iterate through contents of a single CSV file (20,000-120,000 rows) and create a datastructure that fits to our needs.
Batch up 100 of these new dataStructured items and push them into a Database
Mark the CSV file as being imported.
Now we have an existing Python file that does all the above in a very slow & painful way - the code is a mess.
My thinking was the following looking at TPL Dataflow.
BufferBlock<string> to Post all the files into
TransformBlock<string, SensorDataDto> predicate to detect whether to import this file
TransformBlock<string, SensorDataDto> reads through the CSV file and creates SensorDataDto structure
BatchBlock<SensorDataDto> is used within the TransformBlock delegate to batch up 100 requests.
4.5. ActionBlock<SensorDataDto> to push the 100 records into the Database.
ActionBlock to mark the CSV as imported.
I've created the first few operations and they're working (BufferBlock -> TransformBlock + Predicate && Process if hasn't) but I'm unsure how to continue to the flow so that I can post 100 to the BatchBlock within the TransformBlock and wire up the following actions.
Does this look right - basic gist, and how do I tackle the BufferBlock bits in a TPL data flowy way?
bufferBlock.LinkTo(readCsvFile, ShouldImportFile)
bufferBlock.LinkTo(DataflowBlock.NullTarget<string>())
readCsvFile.LinkTo(normaliseData)
normaliseData.LinkTo(updateCsvImport)
updateCsvImport.LinkTo(completionBlock)
batchBlock.LinkTo(insertSensorDataBlock)
bufferBlock.Completion.ContinueWith(t => readCsvFile.Complete());
readCsvFile.Completion.ContinueWith(t => normaliseData.Complete());
normaliseData.Completion.ContinueWith(t => updateCsvImport.Complete());
updateCsvImport.Completion.ContinueWith(t => completionBlock.Complete());
batchBlock.Completion.ContinueWith(t => insertSensorDataBlock.Complete());
Inside the normaliseData method I'm calling BatchBlock.Post<..>(...), is that a good pattern or should it be structured differently? My problem is that I can only mark the file as being imported after all the records have been pushed through.
Task.WhenAll(bufferBlock.Completion, batchBlock.Completion).Wait();
If we have a batch of 100, what if 80 are pushed in, is there a way to drain the last 80?
I wasn't sure if I should Link the BatchBlock in the main pipeline, I do wait till both are finished.

First of all, you don't need to use the Completion in that matter, you can use the PropagateCompletion property during link:
// with predicate
bufferBlock.LinkTo(readCsvFile, new DataflowLinkOptions { PropagateCompletion = true }, ShouldImportFile);
// without predicate
readCsvFile.LinkTo(normaliseData, new DataflowLinkOptions { PropagateCompletion = true });
Now, back to your problem with batches. Maybe, you can use a JoinBlock<T1, T2> or BatchedJoinBlock<T1, T2> here, by attaching them into your pipeline and gathering the results of joins, so you got full picture of work being done. Maybe you can implement your own ITargetBlock<TInput> so you can consume the messages in your way.
According official docs, the blocks are greedy, and gather data from linked one as soon as it becomes available, so join blocks may stuck, if one target is ready and other is not, or batch block has 80% of batch size, so you need to put that in your mind. In case of your own implementation you can use ITargetBlock<TInput>.OfferMessage method to get information from your sources.
BatchBlock<T> is capable of executing in both greedy and non-greedy modes. In the default greedy mode, all messages offered to the block from any number of sources are accepted and buffered to be converted into batches.
In non-greedy mode, all messages are postponed from sources until enough sources have offered messages to the block to create a batch. Thus, a BatchBlock<T> can be used to receive 1 element from each of N sources, N elements from 1 source, and a myriad of options in between.

Aggregate function before IObservable sequence is completed

Is there a way to use Aggregate function (Max, Count, ....) with Buffer before a sequence is completed.
When Completed this will produce results, but with continues stream it does not give
any results?
I was expecting there is some way to make this work with buffer?
IObservable<long> source;
IObservable<IGroupedObservable<long, long>> group = source
.Buffer(TimeSpan.FromSeconds(5))
.GroupBy(i => i % 3);
IObservable<long> sub = group.SelectMany(grp => grp.Max());
sub.Subscribe(l =>
{
Console.WriteLine("working");
});

Use Scan instead of Aggregate. Scan works just like Aggregate except that it sends out intermediate values as the stream advances. It is good for "running totals", which appears to be what you are asking for.

All the "statistical" operators in Rx (Min/Max/Sum/Count/Average) are using a mechanism that propagate the calculate value just when the subscription is completed, and that is the big difference between Scan and Aggregate, basically if you want to be notified when a new value is pushed in your subscription it is necessary to use Scan.
In your case if you want to keep the same logic, you should combine with GroupByUntil or Window operators, the conditions to use both can create and complete the group subscription regularly, and that will be used to push the next value.
You can get more info here: http://www.introtorx.com/content/v1.0.10621.0/07_Aggregation.html#BuildYourOwn
By the way I wrote a text related to what you want. Check in: http://www.codeproject.com/Tips/853256/Real-time-statistics-with-Rx-Statistical-Demo-App

Questions regarding Appium

Please bear with me as I am relatively new to Appium. I am writing C# tests in Appium for my Android app. I am stuck finding answers to questions below.
1) How to check if a particular element exists? Is there any boolean property or function returning true or false? The methods driver.GetElementById, driver.GetElementByName etc. throw exceptions if element doesn't exists.
2) Suppose I want to write a test for login. The user enters username and password and hits login button. The requests goes to server and it checks whether username-password pair exists in database. Meanwhile the loading indicator (progress dialog in Android) is shown on screen. How shall make a test suspend it's execution until response comes from server assuming I don't want to use something like Thread.Sleep function?
3) Can I check whether textfield validation is failed on screen? A control with black background and white text is shown below textfield upon validation failure if we set validation for that textfield through setError function. Is there any way to check that validation has failed?
Anticipating answers. Thanks.

For the first 2 questions (This is what I do in java, definitely can be implemented in c#) -
1) Use the polling technique - In a loop check for the element return of the following
#param - By by , int time
driver.findElement(By by);
This must not be null or empty.
If within the threshhold time the element is not present then fail the test.
In appium mode - isVisible() will be same as the above, as an element not visible will not be present.
2) Check for the next activity to be awaited. Use the same polling technique to keep on comparing the current activity with the awaited activity, If the awaited activity does not start within the threshold time then fail the test.
#param int time, String awaitedActivity
1) Get the current activity.
2) Compare with the awaited activity.
3) If same then break the loop.
4) Else sleep for a second and then continue till the time is exhausted.

Way to know if TPL Dataflow Block busy?

TPL Dataflow block has .InputCount and .OutputCount properties. But it can perform execution over item right now, and there is no property like .Busy [Boolean]. So is there a way to know if block is now operating and one of item still there?
UPDATE:
Let me explain my issue. Here on pic is my current Dataflow network scheme.
BufferBlock holds URLs to load, number of TransformBlocks load pages through proxy servers and ActionBlock at the end performs work with loaded pages. TransformBlocks has predefined .BoundedCapacity, so BufferBlock waits for any of TransformBlocks becomes free and then post item into it.
Initially I post all URLs to Buffer Block. Also if one of TransformBlocks throw exception during loading HTML, it returns it's URL back to BufferBlock. So my goal is somehow wait until all of my URLs was guarantee loaded and parsed. For now I'm waiting like this:
Do While _BufferBlock.Count > 0 Or _
GetLoadBlocksTotalInputOutputCount(_TransformBlocks) > 0 Or _
_ActionBlock.InputCount > 0
Await Task.Delay(1000)
Loop
Then I call TransformBlock.Complete on all of them. But in this case, there still can be last URLs loading it TransformBlocks. If last URL was not successfully loaded, it becomes 'lost', because none of TransformBlocks wouldn't take it back. That's why I want to know if TransformBlocks are still operating. Sorry my bad English.

Even if you could find out whether a block is processing an item, it wouldn't really help you achieve your goal. That's because you would need to check the state of all the blocks at exactly the same moment, and there is no way to do that.
What I think you need is to somehow manually track how many items have been fully processed and compare that with the total number of items to process.
You should know the number of items to process from the start (it's you who sends them to the buffer block). To track the number of items that have been fully processed, you can add a counter to your parsing action block (don't forget to make the counter thread-safe, since your action block is parallel).
Then, if the counter reaches the total number of items to process, you know that all work is done.

Validate CSV file

I have a webpage that is used to submit a CSV file to the server. I have to validate the file, for stuff like correct number of columns, correct data type, cross field validations, data-range validations, etc. And finally either show a successful message or return a CSV with error messages and line numbers.
Currently every row and every column is looped through to find out all the errors in the CSV file. But it becomes very slow for bigger files, sometimes resulting in a server time-out. Can someone please suggest a better way to do this.
Thanks

To validate a CSV file you will surely need to check each column. The only best way if possible in your scenario is to validate the entry itself while appending to the CSV file..
Edit
As pinpointed an error by #accolaum, i have edited my code
It will only work provided each row is delimited with a `\n`
IF you only want to Validate number of Columns.. then its easier.. Just take the mod of all the entries with the num of columns
bool file_isvalid;
string data = streamreader.ReadLine();
while(data != null)
{
if(data.Split(',').Length % Num_Of_Columns == 0)
{
file_isvalid = true;
//Perform opertaion
}
else
{
file_isvalid = false;
//Perform Operation
}
data = streamreader.ReadLine();
}
Hope it helps

I would suggest a rule based approach, similar to unit tests. Think of every! error that can possibly occour and order them in increasing abstraction level
Correct file encoding
Correct number of lines/columns
correct column headers
correct number/text/date formats
correct number ranges
bussiness rules??
...
These rules could also have automatic fixes. So if you could automatically detect the encoding, you could correct it before testing all the rules.
Implementation could be done using the command pattern
public abstract class RuleBase
{
public abstract bool Test();
public virtual bool CanCorrect()
{
return false;
}
}
Then create a subclass for each test you want to make and put them in a list.
The timeout can be overcome by using a background thread only for test incoming files. The user has to wait till his file is validated and becomes "active". When finished you can forward him to the next page.

You may be able to optimize your code to perform faster, but what you really want to do is to spawn a worker thread to do the processing.
Two benefits of this
You can redirect the user to another page so that they know their request has submitted
The worker thread can be given a callback so that it can report its status - if you want to, you could put a progress bar or a percentage on the 'submitted' page so that the user can see as their file is being processed.
It is not good design to have the user waiting for long running processes to complete - they should be given updates or notifications, rather than just a 'loading' icon on their browser.
edit: This is my answer because (1) I can't recommend code improvements without seeing your code, and (2) efficiency improvements are probably only going to yield incremental improvements (unless you are doing something really wrong), which won't solve your problem long term.

Validation of csv data usually always needs to look at every single cell. Can you post some of your code, there may be ways to optimse it.
EDIT
in most cases this is the best solution
foreach(row) {
foreach (column) {
validate cell
}
}
if you were really keen, you could try something with regex's
foreach(row) {
validate row by regex
}
but then you are really just off loading the validation code from you to the regex, and i really hate using regexs

You could use XMLReader and parse against an XSD

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Complex flow with Observables - c#

Related

Blocked on block design for data flow

Aggregate function before IObservable sequence is completed

Questions regarding Appium

Way to know if TPL Dataflow Block busy?

Validate CSV file

Categories

Resources