PLINQ vs Tasks vs Async vs Producer/Consumer queue? What to use?

PLINQ vs Tasks vs Async vs Producer/Consumer queue? What to use? - c#

I was reading C# 5.0 in nutshell and after reading author's view(s), I am quite confused as to what should I adopt. My requirement is that say I have a really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file, or really any other thing is is computationally heavy and is likely to take some time, what should be my approach toward developing it (in winforms if that matters, using VS 2012, C# 5.0), so that I can also report progress to the user.
Following scenario(s) come to mind...
Create a Task (with LongRunning option that computes the hashes and report the progress to user either by implementing IProgess<T> or Progess<T> or letting the task capture the SynchronizationContext context and posting to the UI.
Create a Async method like
async CalculateHashesAsync()
{
// await here for tasks the calculate the hash
await Task.Rung(() => CalculateHash();
// how do I report progress???
}
Use TPL (or PLINQ) as
void CalcuateHashes()
{
Parallel.For(0, allFiles.Count, file => calcHash(file)
// how do I report progress here?
}
Use a producer / consumer Queue.
Don't really know how?
The author in the book says...
Running one long running task on a pooled thread won't cause
trouble. It's when you run multiple long running tasks in parallel
(particularly ones that block) that performance can suffer. In that
case, there are usually better solutions than
TaskCreationOptions.LongRunnging
If tasks are IO bound, TaskCompletionSource and asynchronous functions let you
implement concurrency with callbacks instead of threads.
If tasks are compute bound, a producer/consumer queue lets you throttle the concurrency for those tasks, avoiding starvation for
other threads and process.
About the Producer/Consumer the author says...
A producer/consumer queue is a useful structure, both in parallel
programming and general concurrency scenarios as it gives you precise
control over how many worker threads execute at once, which is useful
not only in limiting CPU consumption, but other resources as well.
So, should I not use task, meaning that first option is out? Is second one the best option? Are there any other options? And If I were to follow author's advice, and implement a producer/consumer, how would I do that (I don't even have an idea of how to get started with producer/consumer in my scenario, if that is the best approach!)
I'd like to know if someone has ever come across such a scenario, how would they implement? If not, what would be the most performance effective and/or easy to develop/maintain (I know the word performance is subjective, but let's just consider the very general case that it works, and works well!)

really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file
That example clearly has both heavy CPU (hashing) and I/O (file) components. Perhaps this is a non-representative example, but in my experience even a secure hash is far faster than reading the data from disk.
If you just have CPU-bound work, the best solution is either Parallel or PLINQ. If you just have I/O-bound work, the best solution is to use async. If you have a more realistic and complex scenario (with both CPU and I/O work), then you should either hook up your CPU and I/O parts with producer/consumer queues or use a more complete solution such as TPL Dataflow.
TPL Dataflow works well with both parallel (MaxDegreeOfParallelism) and async, and has a builtin producer/consumer queue in-between each block.
One thing to keep in mind when mixing massive amounts of I/O and CPU usage is that different situations can cause massively different performance characteristics. To be safe, you'll want to throttle the data going through your queues so you won't end up with memory usage issues. TPL Dataflow has built-in support for throttling via BoundedCapacity.

Related

Type of threading to use in c# for heavy IO operations

I am tasked with updating a c# application (non-gui) that is very single-threaded in it's operation and add multi-threading to it to get it to turn queues of work over quicker.
Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.
A couple of requirements will be:
Running on some limited hardware (that is, just a couple of cores). The current system, when it's being "pushed" only takes about 25% CPU. But, since it's mostly doing waits for the SQL Server to respond (different server), we would like to the capability to have more threads than cores.
Be able to limit the number of threads. I also can't just have an unlimited number of threads going either. I don't mind doing the limiting myself via an Array, List, etc.
Be able to keep track of when these threads complete so that I can do some post-processing.
It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task. I'm not sure if I should be using Task, Thread, ThreadPool, something else... It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.

I'm not sure if I should be using Task, Thread, ThreadPool, something else...
In your case it matters less than you would think. You can focus on what fits your (existing) code style and dataflow the best.
since it's mostly doing waits for the SQL Server to respond
Your main goal would be to get as many of those SQL queries going in parallel as possible.
Be able to limit the number of threads.
Don't worry about that too much. On 4 cores, with 25% CPU, you can easily have 100 threads going. More on 64bit. But you don't want 1000s of threads. A .net Thread uses 1MB minimum, estimate how much RAM you can spare.
So it depends on your application, how many queries can you get running at the same time. Worry about thread-safety first.
When the number of parallel queries is > 1000, you will need async/await to run on fewer threads.
As long as it is < 100, just let threads block on I/O. Parallel.ForEach() , Parallel.Invoke() etc look like good tools.
The 100 - 1000 range is the grey area.

add multi-threading to it to get it to turn queues of work over quicker.
Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.
With that kind of processing, it's not clear how multithreading will benefit you. Multithreading is one form of concurrency, and since your workload is primarily I/O-bound, asynchrony (and not multithreading) would be the first thing to consider.
It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task.
Indeed. For reference, Thread and ThreadPool are pretty much legacy these days; there are much better higher-level APIs. Task should also be rare if used as a delegate task (e.g., Task.Factory.StartNew).
It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.
await will wait on one task at a time, yes. Task.WhenAll can be used to combine
multiple tasks and then you can await on the combined task.
get it to turn queues of work over quicker.
Be able to limit the number of threads.
Be able to keep track of when these threads complete so that I can do some post-processing.
It sounds to me that TPL Dataflow would be the best approach for your system. Dataflow allows you to define a "pipeline" through which data flows, with some steps being asynchronous (e.g., querying SQL Server) and other steps being parallel (e.g., data processing).
I was asking a high-level question to try and get back a high-level answer.
You may be interested in my book.

The TPL Dataflow library is probably one of the best options for this job. Here is how you could construct a simple dataflow pipeline consisting of two blocks. The first block accepts a filepath and produces some intermediate data, that can be later inserted to the database. The second block consumes the data coming from the first block, by sending them to the database.
var inputBlock = new TransformBlock<string, IntermediateData>(filePath =>
{
return GetIntermediateDataFromFilePath(filePath);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount // What the local machine can handle
});
var databaseBlock = new ActionBlock<IntermediateData>(item =>
{
SaveItemToDatabase(item);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 20 // What the database server can handle
});
inputBlock.LinkTo(databaseBlock);
Now every time a user uploads a file, you just save the file in a temp path, and post the path to the first block:
inputBlock.Post(filePath);
And that's it. The data will flow from the first to the last block of the pipeline automatically, transformed and processed along the way, according to the configuration of each block.
This is an intentionally simplified example to demonstrate the basic functionality. A production-ready implementation will probably have more options defined, like the CancellationToken and BoundedCapacity, will watch the return value of inputBlock.Post to react in case the block can't accept the job, may have completion propagation, watch the databaseBlock.Completion property for errors etc.
If you are interested at following this route, it would be a good idea to study the library a bit, in order to become familiar with the options available. For example there is a TransformManyBlock available, suitable for producing multiple outputs from a single input. The BatchBlock may also be useful in some cases.
The TPL Dataflow is built-in the .NET Core, and available as a package for .NET Framework. It has some learning curve, and some gotchas to be aware of, but it's nothing terrible.

It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.
That is wrong. Async/await is just a syntax to simplify a state-machine mechanism for asynchronous code. It waits without consuming any thread. in other words async keyword does not create thread and await does not hold up any thread.
Be able to limit the number of threads
see How to limit the amount of concurrent async I/O operations?
Be able to keep track of when these threads complete so that I can do some post-processing.
If you don't use "fire and forget" pattern then you can keep track of the task and its exceptions just by writing await task
var task = MethodAsync();
await task;
PostProcessing();
async Task MethodAsync(){ ... }
Or for a similar approach you can use ContinueWith:
var task = MethodAsync();
await task.ContinueWith(() => PostProcessing());
async Task MethodAsync(){ ... }
read more:
Releasing threads during async tasks
https://learn.microsoft.com/en-us/dotnet/standard/asynchronous-programming-patterns/?redirectedfrom=MSDN

C# TPL Threading Task

What is the difference between Task class and parallel class which part of TPL at implementation point of view.?
I believe task class is having more benefits than threadpool and thread but still context switch happens in task class as well.
But parallel class is basically design to run program on multicore processor?

Your question is extremely wide and can contain lots of details as an answer, but let me restrict to specific details.
Task - Wrap a method for execution down the line, it use the Lambda (Action, Func Delegate) to do the same. You can wrap now and execute anytime later.
Parallel is an API which helps achieve the Data Parallelization, where you can divide a collection (IEnumerable type) into smaller chunks and each can be executed in parallel and finally aggregated to achieve the result
There are broadly two kinds of parallelism, in one you can subdivide the bigger task into smaller ones, wrap them in a Task type and wait for all or some of them to complete in parallel. This is task parallelism
In other one you take each data unit in a collection and work on it in a mutually exclusive manner, which is data parallelism achieved by Parallel.forEach or Parallel.For APIs
These are introduced from .Net 4.0 onward to make the parallelism easy for the developer, else we had to dabble with Thread and ThreadPool class, which require much more in-depth understanding of the working of threads, here lot of complexity is taken care of internally.
However, don't be under the impression that current mechanism doesn't use threads, both the above mentioned form of parallelism rely completely on ThreadPool threads, that's why we have all the stuff like context -switching happening, multiple threads getting invoked, just that microsoft has made developer life easy by doing it
You may want to go through following links for a better understanding, let me know if there's still a specific query:
Parallel.ForEach vs Task.Factory.StartNew
Past, Present and Future of Parallelism
Parallel.ForEach vs Task.Run and Task.WhenAll

TPL is designed to minimize pre-emptive context-switching (caused by thread oversubscription – having more threads than cores). Task abstractions, of which TPL is an implementation, are designed for cooperative parallelism, where the developer controls when a task will relinquish its execution (typically upon completion). If you schedule more tasks than you have cores, TPL will only execute concurrently approximately as many tasks as you have core; the rest will be queued. This promotes throughout since it avoids the overheads of context-switching, but reduces responsiveness, as each task may take longer to start being processed.
The Parallel class is yet a higher level of abstraction that builds on top of TPL. Implementation-wise, Parallel generates a graph of tasks, but can use heuristics for deciding the granularity of the said tasks depending on your work.

Optimum use of Concurrent Collections with Threads Vs. Tasks

I've been reading this article on MSDN about C# Concurrent Collections. It talks about the optimum threading to use for particular scenarios to get the most benefit out of the collections e.g:
ConcurrentQueue performs best when one dedicated thread is queuing and one dedicated thread is de-queuing. If you do not enforce this rule, then Queue might even perform slightly faster than ConcurrentQueue on computers that have multiple cores.
Is this advice still valid when one is using Tasks instead of raw Threads? From my (limited) understanding of C# Tasks, there is no guarantee that a particular Task will always run on the same thread between context switches, or does maintaining the stack frame mean that the same rules apply in terms of best usage?
Thanks.

One task always runs on the same thread. TPL is a user-mode library. User mode has no (practical) way of migrating executing code from thread to thread. Also there would be no point to doing that.
This advice applies exactly to tasks as it does to threads.
What that piece of advice means to say is that at the same time there should be one producer and one consumer only. You can have 100 threads enqueuing from time to time as long as they do not contend.
(I'm not questioning that advice here since that is out of scope for this question. But that is what's meant here.)

Task Parallel Library and IIS worker Threads?

I want to use Task Parallel Library for some calculation intensive tasks, but I have been told by a colleague there is a huge overhead for IIS creating worker threads.
I am not sure quite what is done when you call Task.Factory.StartNew()...say 100 times. How does IIS handle this? Is is a huge risk, or is there ways to make this very beneficial for an application?

First Tasks != Threads. You may have many tasks being serviced by few threads (which are already being pooled).
As a general rule, I'm against running long running processes on web servers. There are tons of problems keeping long running jobs up and you tend to reduce your web servers scalability, especially if you are paralellizing long running, cpu intensive jobs. Don't forget the optimal number of threads to have running on a machine is equal to the number of "logical" cores. You want to avoid creating excess threads (each managed thread eats something like a meg in overhead). Running cpu intensive jobs takes cpu time away from serving requests.
In my opinion the best way to use tpl on a web server, is to use it with the goal in the mind that you are making requests as non blocking as possible, which allows the greatest number of requests to be served with the smallest number of threads. Keep in mind that many people make the decision that the extra scale gained by having highly asynchronous request handing is not worth the extra complexity. Depends on your specific case.
So in short, running many long running cpu bound tasks on a web server risks your scalability. Doesn't really matter if you are using tasks, threads, backgroundworkers, or the threadpool. It boils down to the same thing.

One of the great things about the Task abstraction is that it abstracts creating threads away. What that means is that the TPL (actually, the ThreadPool) can decide what the best amount of actual threads is. Because of this, creating 100 Tasks most likely won't create 100 Threads. Because of that, you don't have to worry about the overhead of creating Threads.
But it also depends on what kind of Tasks they are. If you have 100 Tasks that perform some long IO-bound operations and so they block most of the time, that's not a good use of TPL and your code will be quite inefficient (and you may actually end up with 100 Threads).
On the other hand, if you have 100 CPU-bound, relatively short Tasks, that's the sweet spot of TPL and you will get good efficiency.
If you are really concerned about efficiency, you should also know that Tasks do have some overhead. Because of that, in some cases it might make sense to merge multiple Tasks into one larger one to make the overhead smaller. Or you can use something that already does that: Parallel.ForEach or Parallel.For, if they fit your use case. As another advantage, code using them will be more readable than using Tasks manually.

How about just creating a service to handle this work? You'll be much better off in terms of scaling and can isolate that unit of work nicely... even if the work is compute-bound.
In my opinion - don't use the Thread Pool/BackgroundWorker/Thread in ASP.NET. In your case, the TPL simply wraps the thread pool. It's usually more trouble than it's worth.

Threading overheads are the same for any host. Has nothing to do with IIS, at least when it comes to performance.
There are other concerns as well. For example, at application shutdown, user threads are rudely aborted.

Embarrassingly parallelizable tasks in .NET

I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?

Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.

I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.

Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.

Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx

In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?

Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.