Reusing a TransformBlock to enforce max download connections

Reusing a TransformBlock to enforce max download connections - c#

I am building a component that downloads information from given urls and parses it to my business classes.
This has to happen in two stages. The pages that are being downloaded contain URLs to a set of further pages which are downloaded in a second stage.
I want all of this to be as parallel as possible and am trying to reduce the overall complexity by using the TPL Dataflow framework.
This is my (simplified) setup:
I post URLs to the buffer block which moves them to the download block.
In the download block the HTML is downloaded.
The download block has a conditional link to both parse blocks, so the html of Page Type A is moved to "Parse Page A", which is a TransformManyBlock.
Parse Page A generates a set of URLs to pages of type B.
Those are posted to the Download Block again.
Finally the conditional link posts the HTML of Page type B to the last block.
I am reusing the Download Block because I want to limit the number of max connections to the server this way by setting MaxDegreeOfParallelization.
The setup would be a lot easier if I simply could use two separate download blocks, but then I would be unable to limit the number of connections this way and still have as many parallel connections as possible.
Now my problem with this setup:
How can I propagate the Completion correctly? I call Complete() on the Buffer Block when I am done posting all URLs. But I cannot propagate this to the download block directly, because it might still be needed for the URLs produced from "Parse Page A" block, even after the buffer block has posted all URLs to it.
But I also couple the Download Block Completion to both Buffer Block and Parse Page A Block Completion, because then Parse Page A will never become complete.
I also thought about calling Complete() of "Parse Page A" when the Buffer Block is done, but then there might still be data in the download block which will get rejected by "Parse Block A".
Is there a way out of this circular dilemma?
Or am I on the wrong track completely and should do it in some other fashion?

You logically have a linear pipeline, so I think that's how you should model it in code too. This means having a separate download block for each type of pages. This way, completion will work fine, but you'll have to deal with connection limiting separately.
There are two ways I can see how to solve that:
If you're always connecting to the same server, you can limit the number of connections to it by using ServicePoints. You can either set that limit globally at the start of the program:
ServicePointManager.DefaultConnectionLimit = limit;
or just for the one server:
ServicePointManager.FindServicePoint(new Uri("http://myserver.com"))
.ConnectionLimit = limit;
If using ServicePoints won't work for you (because you don't have just one server, because it affects the whole application, …), you can limit the requests manually using something like SemaphoreSlim. The semaphore would be set to your desired limit and it would be shared between the two download blocks.
MaxDegreeOfParallelism for each block would be set to the same limit (higher value won't add anything, lower value could be inefficient) and their code could look like this:
try
{
await semaphore.WaitAsync();
// perform the download
}
finally
{
semaphore.Release();
}
If you do need this kind of limiting often, you could create a helper class that encapsulates this logic. Its usage could look like this:
var factory = new SharedLimitBlockFactory<Input, Output>(
limit, input => Download(input));
var downloadBlock1 = factory.CreateBlock();
var downloadBlock2 = factory.CreateBlock();

Related

Long API Call - Async calls the answer?

I am calling an external API which is slow. Currently if I havent called the API to get some orders for a while the call can be broken up into pages (pagingation).
So therefore fetching orders could be making multiple calls rather than the 1 call. Sometimes each call can be around 10 seconds per call so this could be about a minute in total which is far too long.
GetOrdersCall getOrders = new GetOrdersCall();
getOrders.DetailLevelList.Add(DetailLevelCodeType.ReturnSummary);
getOrders.CreateTimeFrom = lastOrderDate;
getOrders.CreateTimeTo = DateTime.Now;
PaginationType paging = new PaginationType();
paging.EntriesPerPage = 20;
paging.PageNumber = 1;
getOrders.Pagination = paging;
getOrders.Execute();
var response = getOrders.ApiResponse;
OrderTypeCollection orders = new OrderTypeCollection();
while (response != null && response.OrderArray.Count > 0)
{
eBayConverter.ConvertOrders(response.OrderArray, 1);
if (response.HasMoreOrders)
{
getOrders.Pagination.PageNumber++;
getOrders.Execute();
response = getOrders.ApiResponse;
orders.AddRange(response.OrderArray);
}
}
This is a summary of my code above... The getOrders.Execute() is when the api fires.
After the 1st "getOrders.Execute()" there is a Pagination result which tells me how many pages of data there are. My thinking is that I should be able to start an asnychronous call for each page and to populate the OrderTypeCollection. When all the calls are made and the collection is fully loaded then I will commit to the database.
I have never done Asynchronous calls via c# before and I can kind of follow Async await but I think my scenario falls out of the reading I have done so far?
Questions:
I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db.
I've read somewhere that I want to avoid combining the API call and the db write to avoid locking in SQL server - Is this correct?
If someone can point me in the right direction - It would be greatly appreciated.

I think I can set it up to fire off the multiple calls asynchronously
but I'm not sure how to check when all tasks have been completed i.e.
ready to commit to db.
Yes you can break this up
The problem is ebay doesn't have an async Task Execute Method, so you are left with blocking threaded calls and no IO optimised async await pattern. If there were, you could take advantage of a TPL Dataflow pipeline which is async aware (and fun for the whole family to play), you could anyway, though i propose a vanilla TPL solution...
However, all is not lost, just fall back to Parallel.For and a ConcurrentBag<OrderType>
Example
var concurrentBag = new ConcurrentBag<OrderType>();
// make first call
// add results to concurrentBag
// pass the pageCount to the for
int pagesize = ...;
Parallel.For(1, pagesize,
page =>
{
// Set up
// add page
// make Call
foreach(var order in getOrders.ApiResponse)
concurrentBag.Add(order);
});
// all orders have been downloaded
// save to db
Note : There are MaxDegreeOfParallelism which you configure, maybe set it to 50, though it wont really matter how much you give it, the Task Scheduler is not going to aggressively give you threads, maybe 10 or so initially and grow slowly.
The other way you can do this, is create your own Task Scheduler, or just spin up your own Threads with the old fashioned Thread Class
I've read somewhere that I want to avoid combining the API call and
the db write to avoid locking in SQL server - Is this correct?
If you mean locking as in slow DB insert, use Sql Bulk Insert and update tools.
If you mean locking as in the the DB deadlock error message, then this is an entirely different thing, and worthy of its own question
Additional Resources
For(Int32, Int32, ParallelOptions, Action)
Executes a for (For in Visual Basic) loop in which iterations may run
in parallel and loop options can be configured.
ParallelOptions Class
Stores options that configure the operation of methods on the Parallel
class.
MaxDegreeOfParallelism
Gets or sets the maximum number of concurrent tasks enabled by this
ParallelOptions instance.
ConcurrentBag Class
Represents a thread-safe, unordered collection of objects.

Yes ConcurrentBag<T> Class can be used to server the purpose of one of your questions which was: "I think I can set it up to fire off the multiple calls asynchronously but I'm not sure how to check when all tasks have been completed i.e. ready to commit to db."
This generic class can be used to Run your every task and wait all your tasks to be completed to do further processing. It is thread safe and useful for parallel processing.

How do I monitor progress in a TPL Dataflow mesh?

I'm working on a C# app with a time-consuming sequential workflow that must be performed asynchronously. It starts when the user presses a button and the app receives a few images captured from a camera within just a few milliseconds. The work flow then.
Saves the images to disk
Aligns them.
Generates 3d data from them.
Groups them into a larger, collective object (called a "Scan").
Add optional analysis data to this scan and executes it.
Finally saves the scan itself is saved to an xml file alongside the images.
Some of these steps are optional and configurable.
Since the processing can take so long, there will often be a queue of "scans" awaiting processing So I need to present to a user a visual representation of the queue of captured scans, their current processing state (e.g. "Saving", "Analyzing", "Finished" etc.)
I've looked into using TPL DataFlow for this. But while the mesh is simple to create, I'm not getting just how I might monitor the status of what is going on so that I can update a user interface. Do I try to link custom action blocks that post back messages to the UI for that? Something else?
Is TPL Dataflow even the right tool for this job?

Reporting Overall Progress
When you consider that a TPL DataFlow graph has a beginning and end block and that you know how many items you posted into the graph, all you need do is track how many messages have reached the final block and compare it to the source count of messages that were posted into the head. This will allow you to report progress.
Now this works trivially if the blocks are 1:1 - that is, for any message in there is a single message out. If there is a one:many block, you will need to change your progress reporting accordingly.
Reporting Job Stage Progress
If you wish to present progress of a job as it travels throughout the graph, you will need to pass job details to each block, not just the data needed for the actual block. A job is a single task that must span all the steps 1-6 listed in your question.
So for example step 2 may require image data in order to perform alignment but it does not care about filenames; how many steps there are in the job or anything else job related. There is insufficient detail to know state about the current job or makes it difficult to lookup the original job based on the block input alone. You could refer to some external dictionary but graphs are best designed when they are isolated and deal only with data passed into each block.
So a simple example would be to change this minimal code from:
var alignmentBlock = new TransformBlock<Image, Image>(n => { ... });
...to:
var alignmentBlock = new TransformBlock<Job, Job>(x =>
{
job.Stage = Stages.Aligning;
// perform alignment here
job.Aligned = ImageAligner.Align (x.Image, ...);
// report progress
job.Stage = Stages.AlignmentComplete;
});
...and repeat the process for the other blocks.
The stage property could fire a PropertyChanged notification or use any other form of notification pattern suitable for your UI.
Notes
Now you will notice that I introduce a Job class that is passed as the only argument to each block. Job contains input data for the block as well as being a container for block output.
Now this will work, but the purist in me feels that it would be better to perhaps keep job metadata separate what is TPL block input and output otherwise there is potential state damage from multiple threads.
To get around this you may want to consider using Tuple<> and passing that into the block.
e.g.
var alignmentBlock = new TransformBlock<Tuple<Job, UnalignedImages>,
Tuple<Job, AlignedImages>>(n => { ... });

multithread read and process large text files

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

Avoiding BinaryReader.ReadString() in C#?

Good morning,
At the startup of the application I am writing I need to read about 1,600,000 entries from a file to a Dictionary<Tuple<String, String>, Int32>. It is taking about 4-5 seconds to build the whole structure using a BinaryReader (using a FileReader takes about the same time). I profiled the code and found that the function doing the most work in this process is BinaryReader.ReadString(). Although this process needs to be run only once and at startup, I would like to make it as quick as possible. Is there any way I can avoid BinaryReader.ReadString() and make this process faster?
Thank you very much.

Are you sure that you absolutely have to do this before continuing?
I would examine the possibility of hiving off the task to a separate thread which sets a flag when finished. Then your startup code simply kicks off that thread and continues on its merry way, pausing only when both:
the flag is not yet set; and
no more work can be done without the data.
Often, the illusion of speed is good enough, as anyone who has coded up a splash screen will tell you.
Another possibility, if you control the data, is to store it in a more binary form so you can just blat it all in with one hit (i.e., no interpretation of the data, just read in the whole thing). That, of course, makes it harder to edit the data from outside your application but you haven't stated that as a requirement.
If it is a requirement or you don't control the data, I'd still look into my first suggestion above.

If you think that reading the file line by line is the bottleneck, and depending on its size, you can try to read it all at once:
// read the entire file at once
string entireFile = System.IO.File.ReadAllText(path);
It this doesn't help, you can try to add a separate thread with a semaphore, which would start reading in background immediately when the program is started, but block the requesting thread at the moment you try to access the data.
This is called a Future, and you have an implementation in Jon Skeet's miscutil library.
You call it like this at the app startup:
// following line invokes "DoTheActualWork" method on a background thread.
// DoTheActualWork returns an instance of MyData when it's done
Future<MyData> calculation = new Future<MyData>(() => DoTheActualWork(path));
And then, some time later, you can access the value in the main thread:
// following line blocks the calling thread until
// the background thread completes
MyData result = calculation.Value;
If you look at the Future's Value property, you can see that it blocks at the AsyncWaitHandle if the thread is still running:
public TResult Value
{
get
{
if (!IsCompleted)
{
_asyncResult.AsyncWaitHandle.WaitOne();
_lock.WaitOne();
}
return _value;
}
}

If strings are repeated inside tuples you could reorganize your file to have all different involving strings at the start, and have references to those strings (integers) in the body of the file. Your main Dictionary does not have to change, but you would need a temporary Dictionary during startup with all different strings (values) and their references (keys).

How can two threads access a common array of buffers with minimal blocking ? (c#)

I'm working on an image processing application where I have two threads on top of my main thread:
1 - CameraThread that captures images from the webcam and writes them into a buffer
2 - ImageProcessingThread that takes the latest image from that buffer for filtering.
The reason why this is multithreaded is because speed is critical and I need to have CameraThread to keep grabbing pictures and making the latest capture ready to pick up by ImageProcessingThread while it's still processing the previous image.
My problem is about finding a fast and thread-safe way to access that common buffer and I've figured that, ideally, it should be a triple buffer (image[3]) so that if ImageProcessingThread is slow, then CameraThread can keep on writing on the two other images and vice versa.
What sort of locking mechanism would be the most appropriate for this to be thread-safe ?
I looked at the lock statement but it seems like it would make a thread block-waiting for another one to be finished and that would be against the point of triple buffering.
Thanks in advance for any idea or advice.
J.

This could be a textbook example of the Producer-Consumer Pattern.
If you're going to be working in .NET 4, you can use the IProducerConsumerCollection<T> and associated concrete classes to provide your functionality.
If not, have a read of this article for more information on the pattern, and this question for guidance in writing your own thread-safe implementation of a blocking First-In First-Out structure.

Personally I think you might want to look at a different approach for this, rather than writing to a centralized "buffer" that you have to manage access to, could you switch to an approach that uses events. Once the camera thread has "received" an image it could raise an event, that passed the image data off to the process that actually handles the image processing.
An alternative would be to use a Queue, which the queue is a FIFO (First in First Out) data structure, now it is not thread-safe for access so you would have to lock it, but your locking time would be very minimal to put the item in the queue. There are also other Queue classes out there that are thread-safe that you could use.
Using your approach there are a number of issues that you would have to contend with. Blocking as you are accessing the array, limitations as to what happens after you run out of available array slots, blocking, etc..

Given the amount of precessing needed for a picture, I don't think that a simple locking scheme would be your bottleneck. Measure before you start wasting time on the wrong problem.
Be very careful with 'lock-free' solutions, they are always more complicated than they look.
And you need a Queue, not an array.
If you can use dotNET4 I would use the ConcurrentQuue.

You will have to run some performance metrics, but take a look at lock free queues.
See this question and its associated answers, for example.
In your particular application, though, you processor is only really interested in the most recent image. In effect this means you only really want to maintain a queue of two items (the new item and the previous item) so that there is no contention between reading and writing. You could, for example, have your producer remove old entries from the queue once a new one is written.
Edit: having said all this, I think there is a lot of merit in what is said in Mitchel Sellers's answer.

I would look at using a ReaderWriterLockSlim which allows fast read and upgradable locks for writes.

This isn't a direct answer to your question, but it may be better to rethink your concurrency model. Locks are a terrible way to syncronize anything -- too low level, error prone, etc. Try to rethink your problem in terms of message passing concurrency:
The idea here is that each thread is its own tightly contained message loop, and each thread has a "mailbox" for sending and receiving messages -- we're going to use the term MailboxThread to distinguish these types of objects from plain jane threads.
So instead of having two threads accessing the same buffer, you instead have two MailboxThreads sending and receiving messages between one another (pseudocode):
let filter =
while true
let image = getNextMsg() // blocks until the next message is recieved
process image
let camera(filterMailbox) =
while true
let image = takePicture()
filterMailbox.SendMsg(image) // sends a message asyncronous
let filterMailbox = Mailbox.Start(filter)
let cameraMailbox = Mailbox.Start(camera(filterMailbox))
Now you're processing threads don't know or care about any buffers at all. They just wait for messages and process them whenever they're available. If you send to many message for the filterMailbox to handle, those messages get enqueued to be processed later.
The hard part here is actually implementing your MailboxThread object. Although it requires some creativity to get right, its wholly possible to implement these types of objects so that they only hold a thread open while processing a message, and release the executing thread back to the thread-pool when there are no messages left to handle (this implementation allows you to terminate your application without dangling threads).
The advantage here is how threads send and receive messages without worrying about locking or syncronization. Behind the scenes, you need to lock your message queue between enqueing or dequeuing a message, but that implementation detail is completely transparent to your client-side code.

Just an Idea.
Since we're talking about only two threads, we can make some assumptions.
Lets use your tripple buffer idea. Assuming there is only 1 writer and 1 reader thread, we can toss a "flag" back-and-forth in the form of an integer. Both threads will continuously spin but update their buffers.
WARNING: This will only work for 1 reader thread
Pseudo Code
Shared Variables:
int Status = 0; //0 = ready to write; 1 = ready to read
Buffer1 = New bytes[]
Buffer2 = New bytes[]
Buffer3 = New bytes[]
BufferTmp = null
thread1
{
while(true)
{
WriteData(Buffer1);
if (Status == 0)
{
BufferTmp = Buffer1;
Buffer1 = Buffer2;
Buffer2 = BufferTmp;
Status = 1;
}
}
}
thread2
{
while(true)
{
ReadData(Buffer3);
if (Status == 1)
{
BufferTmp = Buffer1;
Buffer2 = Buffer3;
Buffer3 = BufferTmp;
Status = 0;
}
}
}
just remember, you're writedata method wouldn't create new byte objects, but update the current one. Creating new objects is expensive.
Also, you may want a thread.sleep(1) in an ELSE statement to accompany the IF statements, otherwise one a single core CPU, a spinning thread will increase the latency before the other thread gets scheduled. eg. The write thread may run spin 2-3 times before the read thread gets scheduled, because the schedulers sees the write thread doing "work"

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.