I have some DB operations to perform and I tried using PLINQ:
someCollection.AsParallel()
.WithCancellation(token)
.ForAll(element => ExecuteDbOperation(element))
And I notice it is quite slow compared to:
var tasks = someCollection.Select(element =>
Task.Run(() => ExecuteDbOperation(element), token))
.ToList()
await Task.WhenAll(tasks)
I prefer the PLINQ syntax, but I am forced to use the second version for performances.
Can someone explain the big difference in performances?
My supposition that this is because of the number of threads created.
In the first example this number will be roughly equal to the number of cores of your computer. By contrast, the second example will create as many threads as someCollection has elements. For IO operation that's generally more efficient.
The Microsoft guide "Patterns_of_Parallel_Programming_CSharp" recommends for IO operation to create more threads than default (p. 33):
var addrs = new[] { addr1, addr2, ..., addrN };
var pings = from addr in addrs.AsParallel().WithDegreeOfParallelism(16)
select new Ping().Send(addr);
Both PLINQ and Parallel.ForEach() were primarily designed to deal with CPU-bound workloads, which is why they don't work so well for your IO-bound work. For some specific IO-bound work, there is an optimal degree of parallelism, but it doesn't depend on the number of CPU cores, while the degree of parallelism in PLINQ and Parallel.ForEach() does depend on the number of CPU cores, to a greater or lesser degree.
Specifically, the way PLINQ works is to use a fixed number of Tasks, by default based on the number of CPU cores on your computer. This is meant to work well for a chain of PLINQ methods. But it seems this number is smaller than the ideal degree of parallelism for your work.
On the other hand Parallel.ForEach() delegates deciding how many Tasks to run to the ThreadPool. And as long as its threads are blocked, ThreadPool slowly keeps adding them. The result is that, over time, Parallel.ForEach() might get closer to the ideal degree of parallelism.
The right solution is to figure out what the right degree of parallelism for your work is by measuring, and then using that.
Ideally, you would make your code asynchronous and then use some approach to limit the degree of parallelism fro async code.
Since you said you can't do that (yet), I think a decent solution might be to avoid the ThreadPool and run your work on dedicated threads (you can create those by using Task.Factory.StartNew() with TaskCreationOptions.LongRunning).
If you're okay with sticking to the ThreadPool, another solution would be to use PLINQ ForAll(), but also call WithDegreeOfParallelism().
I belive if you get let say more then 10000 elements it will be better to use PLINQ because it won't create task for each element of your collection because it uses a Partitioner inside it. Each task creation has some overhead data initialization inside it. Partitioner will create only as many tasks that are optimized for currently avaliable cores, so it will re-use this tasks with new data to process. You can read more about it here: http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx
Related
I am tasked with updating a c# application (non-gui) that is very single-threaded in it's operation and add multi-threading to it to get it to turn queues of work over quicker.
Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.
A couple of requirements will be:
Running on some limited hardware (that is, just a couple of cores). The current system, when it's being "pushed" only takes about 25% CPU. But, since it's mostly doing waits for the SQL Server to respond (different server), we would like to the capability to have more threads than cores.
Be able to limit the number of threads. I also can't just have an unlimited number of threads going either. I don't mind doing the limiting myself via an Array, List, etc.
Be able to keep track of when these threads complete so that I can do some post-processing.
It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task. I'm not sure if I should be using Task, Thread, ThreadPool, something else... It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.
I'm not sure if I should be using Task, Thread, ThreadPool, something else...
In your case it matters less than you would think. You can focus on what fits your (existing) code style and dataflow the best.
since it's mostly doing waits for the SQL Server to respond
Your main goal would be to get as many of those SQL queries going in parallel as possible.
Be able to limit the number of threads.
Don't worry about that too much. On 4 cores, with 25% CPU, you can easily have 100 threads going. More on 64bit. But you don't want 1000s of threads. A .net Thread uses 1MB minimum, estimate how much RAM you can spare.
So it depends on your application, how many queries can you get running at the same time. Worry about thread-safety first.
When the number of parallel queries is > 1000, you will need async/await to run on fewer threads.
As long as it is < 100, just let threads block on I/O. Parallel.ForEach() , Parallel.Invoke() etc look like good tools.
The 100 - 1000 range is the grey area.
add multi-threading to it to get it to turn queues of work over quicker.
Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.
With that kind of processing, it's not clear how multithreading will benefit you. Multithreading is one form of concurrency, and since your workload is primarily I/O-bound, asynchrony (and not multithreading) would be the first thing to consider.
It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task.
Indeed. For reference, Thread and ThreadPool are pretty much legacy these days; there are much better higher-level APIs. Task should also be rare if used as a delegate task (e.g., Task.Factory.StartNew).
It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.
await will wait on one task at a time, yes. Task.WhenAll can be used to combine
multiple tasks and then you can await on the combined task.
get it to turn queues of work over quicker.
Be able to limit the number of threads.
Be able to keep track of when these threads complete so that I can do some post-processing.
It sounds to me that TPL Dataflow would be the best approach for your system. Dataflow allows you to define a "pipeline" through which data flows, with some steps being asynchronous (e.g., querying SQL Server) and other steps being parallel (e.g., data processing).
I was asking a high-level question to try and get back a high-level answer.
You may be interested in my book.
The TPL Dataflow library is probably one of the best options for this job. Here is how you could construct a simple dataflow pipeline consisting of two blocks. The first block accepts a filepath and produces some intermediate data, that can be later inserted to the database. The second block consumes the data coming from the first block, by sending them to the database.
var inputBlock = new TransformBlock<string, IntermediateData>(filePath =>
{
return GetIntermediateDataFromFilePath(filePath);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount // What the local machine can handle
});
var databaseBlock = new ActionBlock<IntermediateData>(item =>
{
SaveItemToDatabase(item);
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 20 // What the database server can handle
});
inputBlock.LinkTo(databaseBlock);
Now every time a user uploads a file, you just save the file in a temp path, and post the path to the first block:
inputBlock.Post(filePath);
And that's it. The data will flow from the first to the last block of the pipeline automatically, transformed and processed along the way, according to the configuration of each block.
This is an intentionally simplified example to demonstrate the basic functionality. A production-ready implementation will probably have more options defined, like the CancellationToken and BoundedCapacity, will watch the return value of inputBlock.Post to react in case the block can't accept the job, may have completion propagation, watch the databaseBlock.Completion property for errors etc.
If you are interested at following this route, it would be a good idea to study the library a bit, in order to become familiar with the options available. For example there is a TransformManyBlock available, suitable for producing multiple outputs from a single input. The BatchBlock may also be useful in some cases.
The TPL Dataflow is built-in the .NET Core, and available as a package for .NET Framework. It has some learning curve, and some gotchas to be aware of, but it's nothing terrible.
It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.
That is wrong. Async/await is just a syntax to simplify a state-machine mechanism for asynchronous code. It waits without consuming any thread. in other words async keyword does not create thread and await does not hold up any thread.
Be able to limit the number of threads
see How to limit the amount of concurrent async I/O operations?
Be able to keep track of when these threads complete so that I can do some post-processing.
If you don't use "fire and forget" pattern then you can keep track of the task and its exceptions just by writing await task
var task = MethodAsync();
await task;
PostProcessing();
async Task MethodAsync(){ ... }
Or for a similar approach you can use ContinueWith:
var task = MethodAsync();
await task.ContinueWith(() => PostProcessing());
async Task MethodAsync(){ ... }
read more:
Releasing threads during async tasks
https://learn.microsoft.com/en-us/dotnet/standard/asynchronous-programming-patterns/?redirectedfrom=MSDN
I have a list of objects and I have to do some elaboration for each one of them, all of this in the least amount of time possible.
Since those elaborations are indipendent from each others, we've decided to do them in parallel with Parallel.ForEach.
Parallel.ForEach(hugeObjectList,
new ParallelOptions { MaxDegreeOfParallelism = 50 },
obj => DoSomeWork(obj)
);
Since it seems unreasonable to me setting a huge number on ParallelOptions.MaxDegreeOfParallelism (e.g. 50 or 100), how can we find the optimal number of parallel task to crunch this list?
Does Parallel.Foreach start a DoSomeWork on a different core? (so, since we have 4 cores, the correct degree of parallelism would be 4?)
I think this says it all
By default, For and ForEach will utilize however many threads the underlying scheduler provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used.
MSDN
Asking the platform should get you close to the optimum (for CPU bound work).
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
Doing nothing is another very good option, ie
//new ParallelOptions { MaxDegreeOfParallelism = 50 },
Edit
there's a lot of io with a database ...
That makes MaxDegreeOfParallelism = 1 another very good candidate. Or maybe 2.
What you really should be looking into is async/await and async database calls. Not the Parallel class.
The only way to know for sure is to test it. More threads does not equal better performance, and may often yield worse performance. Some thoughts:
Designing an algorithm for a single thread, and then adding Parallel.For around it is pointless. You must change your algorithm to take advantage of multiple threads or the benefits to parallel processing will be minor or negative.
If you are reading from disk or downloading data over a network connection where the server is able to feed you as fast as you get the data, you may find that a producer/consumer pattern performs best. If the processing is computationally expensive, use many consumer threads (I tend to use Num Cores - 2. One for the UI, one for the producer). If not computationally expensive, it won't matter how many consumer threads you use.
If you are downloading data from the Internet from a variety of sources, and the servers take time to respond, you should start up quite a few threads (50-100 is not crazy). This is because the threads will just sit there waiting for the server to respond.
What is the difference between Task class and parallel class which part of TPL at implementation point of view.?
I believe task class is having more benefits than threadpool and thread but still context switch happens in task class as well.
But parallel class is basically design to run program on multicore processor?
Your question is extremely wide and can contain lots of details as an answer, but let me restrict to specific details.
Task - Wrap a method for execution down the line, it use the Lambda (Action, Func Delegate) to do the same. You can wrap now and execute anytime later.
Parallel is an API which helps achieve the Data Parallelization, where you can divide a collection (IEnumerable type) into smaller chunks and each can be executed in parallel and finally aggregated to achieve the result
There are broadly two kinds of parallelism, in one you can subdivide the bigger task into smaller ones, wrap them in a Task type and wait for all or some of them to complete in parallel. This is task parallelism
In other one you take each data unit in a collection and work on it in a mutually exclusive manner, which is data parallelism achieved by Parallel.forEach or Parallel.For APIs
These are introduced from .Net 4.0 onward to make the parallelism easy for the developer, else we had to dabble with Thread and ThreadPool class, which require much more in-depth understanding of the working of threads, here lot of complexity is taken care of internally.
However, don't be under the impression that current mechanism doesn't use threads, both the above mentioned form of parallelism rely completely on ThreadPool threads, that's why we have all the stuff like context -switching happening, multiple threads getting invoked, just that microsoft has made developer life easy by doing it
You may want to go through following links for a better understanding, let me know if there's still a specific query:
Parallel.ForEach vs Task.Factory.StartNew
Past, Present and Future of Parallelism
Parallel.ForEach vs Task.Run and Task.WhenAll
TPL is designed to minimize pre-emptive context-switching (caused by thread oversubscription – having more threads than cores). Task abstractions, of which TPL is an implementation, are designed for cooperative parallelism, where the developer controls when a task will relinquish its execution (typically upon completion). If you schedule more tasks than you have cores, TPL will only execute concurrently approximately as many tasks as you have core; the rest will be queued. This promotes throughout since it avoids the overheads of context-switching, but reduces responsiveness, as each task may take longer to start being processed.
The Parallel class is yet a higher level of abstraction that builds on top of TPL. Implementation-wise, Parallel generates a graph of tasks, but can use heuristics for deciding the granularity of the said tasks depending on your work.
I need to scrape data from a website.
I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.
I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?
I have not worked with Parallel before so maybe my approach maybe wrong.
you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.
Heres the code snippet -
ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;
Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).
You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.
Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.
Something worth checking out is the TPL Dataflow library.
DataFlow on MSDN.
See Nesting await in Parallel.ForEach
The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.
Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.
Hard to say without looking at your code and how the collection is defined, I've found that Parallel.Invoke is the most flexible. try msdn? ... sounds like you are looking to use Parallel.For Method (Int32, Int32, Action<Int32, ParallelLoopState>)
I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.
I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.
Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.
Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx
In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?
Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.