best use of Parallel.ForEach / Multithreading

best use of Parallel.ForEach / Multithreading - c#

I need to scrape data from a website.
I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.
I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?
I have not worked with Parallel before so maybe my approach maybe wrong.

you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.
Heres the code snippet -
ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;
Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);

In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).
You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.
Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.

Something worth checking out is the TPL Dataflow library.
DataFlow on MSDN.
See Nesting await in Parallel.ForEach
The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.
Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.

Hard to say without looking at your code and how the collection is defined, I've found that Parallel.Invoke is the most flexible. try msdn? ... sounds like you are looking to use Parallel.For Method (Int32, Int32, Action<Int32, ParallelLoopState>)

Related

How to measure multithread performance/capacity in a Windows Form?

I'm using Windows Forms and HTMLAgilityPack to Load a website HTML into a variable, with Parallel.For to call it 10 at a time.
How can I calculate how many times I could do this in Parallel? Watching task manager CPU usage? I'm I calling to few or too many times?
Thanks

By default, For and ForEach will utilize however many threads the underlying scheduler provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used.
https://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism.aspx

Are you trying to determine what degree of parallelism (how many threads) yields the best performance?
I'd recommend
Don't worry about it unless you determine that performance is actually an issue. You can fine tune it but the perfect settings on the computer you're using might not be so great on another where the code executes.
Perform multiple tests and see which concludes faster. Back before we had Parallel.ForEach I would do that sometimes. I often found that adding more threads meant faster execution, and then adding some more produced diminishing returns or even slower performance.
Unless you're seeing performance issues and you think this is the fix I wouldn't worry about it.

Performances of PLINQ vs TPL

I have some DB operations to perform and I tried using PLINQ:
someCollection.AsParallel()
.WithCancellation(token)
.ForAll(element => ExecuteDbOperation(element))
And I notice it is quite slow compared to:
var tasks = someCollection.Select(element =>
Task.Run(() => ExecuteDbOperation(element), token))
.ToList()
await Task.WhenAll(tasks)
I prefer the PLINQ syntax, but I am forced to use the second version for performances.
Can someone explain the big difference in performances?

My supposition that this is because of the number of threads created.
In the first example this number will be roughly equal to the number of cores of your computer. By contrast, the second example will create as many threads as someCollection has elements. For IO operation that's generally more efficient.
The Microsoft guide "Patterns_of_Parallel_Programming_CSharp" recommends for IO operation to create more threads than default (p. 33):
var addrs = new[] { addr1, addr2, ..., addrN };
var pings = from addr in addrs.AsParallel().WithDegreeOfParallelism(16)
select new Ping().Send(addr);

Both PLINQ and Parallel.ForEach() were primarily designed to deal with CPU-bound workloads, which is why they don't work so well for your IO-bound work. For some specific IO-bound work, there is an optimal degree of parallelism, but it doesn't depend on the number of CPU cores, while the degree of parallelism in PLINQ and Parallel.ForEach() does depend on the number of CPU cores, to a greater or lesser degree.
Specifically, the way PLINQ works is to use a fixed number of Tasks, by default based on the number of CPU cores on your computer. This is meant to work well for a chain of PLINQ methods. But it seems this number is smaller than the ideal degree of parallelism for your work.
On the other hand Parallel.ForEach() delegates deciding how many Tasks to run to the ThreadPool. And as long as its threads are blocked, ThreadPool slowly keeps adding them. The result is that, over time, Parallel.ForEach() might get closer to the ideal degree of parallelism.
The right solution is to figure out what the right degree of parallelism for your work is by measuring, and then using that.
Ideally, you would make your code asynchronous and then use some approach to limit the degree of parallelism fro async code.
Since you said you can't do that (yet), I think a decent solution might be to avoid the ThreadPool and run your work on dedicated threads (you can create those by using Task.Factory.StartNew() with TaskCreationOptions.LongRunning).
If you're okay with sticking to the ThreadPool, another solution would be to use PLINQ ForAll(), but also call WithDegreeOfParallelism().

I belive if you get let say more then 10000 elements it will be better to use PLINQ because it won't create task for each element of your collection because it uses a Partitioner inside it. Each task creation has some overhead data initialization inside it. Partitioner will create only as many tasks that are optimized for currently avaliable cores, so it will re-use this tasks with new data to process. You can read more about it here: http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx

Making the most of the .NET Task Parallel Library

Question 1.
Is using Parallel.For and Parallel.ForEach better suited to working with tasks that are ordered or unordered?
My reason for asking is that I recently updated a serial loop where a StringBuilder was being used to generate a SQL statement based on various parameters. The result was that the SQL was a bit jumbled up (to the point it contained syntax errors) in comparison to when using a standard foreach loop, therefore my gut feeling is that TPL is not suited to performing tasks where the data must appear in a particular order.
Question 2.
Does the TPL automatically make use of multicore architectures of must I provision anything prior to execution?
My reason for asking this relates back to an eariler question I asked relating to performance profiling of TPL operations. An answer to the question enlightened me to the fact that TPL is not always more efficient than a standard serial loop as the application may not have access to multiple cores, and therefore the overhead of creating additional threads and loops would create a performance decrease in comparison to a standard serial loop.

my gut feeling is that TPL is not suited to performing tasks where the data must appear in a particular order.
Correct. If you expect things in order, you might have a misunderstanding about what's going to happen when you "parallelize" a loop.
Does the TPL automatically make use of multicore architectures of must I provision anything prior to execution?
See the following article on the msdn magazine:
http://msdn.microsoft.com/en-us/magazine/cc163340.aspx
Using the library, you can conveniently express potential parallelism
in existing sequential code, where the exposed parallel tasks will be
run concurrently on all available processors.

If the results must be ordered then for you to parallelize the loop you need to be able to do the actual work in any order and then sort the results. This may or may not be more efficient than doing the work serially in the first place, depending on the situation. If the benefit of parallelizing the work that can be done in any order is more than the cost of ordering the results then it's a net gain. If that task just isn't sufficiently complex, your hardware doesn't allow for a lot of parallelization, or if it doesn't parallelize well (i.e. you have a lot of waiting due to data dependencies) then sorting the results could take more time than you gain from parallelizing the loop (or worse yet, parallelizing the loop takes longer even without the sort, see question two) and so you shouldn't parallelize it.
Note that if the actual units of work need to be run in a certain order, rather than just needing the results in a certain order, then you either won't be able to parallelize it, or you won't be able to parallelize it nearly as effectively. If you don't properly synchronize access to share resources then you'll actually end up with the wrong result (as happened in your case). To that end you need to remember that performance optimizations are meaningless if you can't actually come up with the correct result.
You don't really need to worry much about your hardware with the TPL. You don't need to explicitly add or restrict tasks. While there are a few ways that you can, virtually anytime you do something like this you'll hurt performance. When you do stuff like that you're adding restrictions to the TPL so it can't do what it wants. Often it knows better than you.
You also touch on another point here, and that's that parallelizing a loop will often take longer than not (you just didn't give likely reasons to cause this behavior). Often the actual work that needs to be done is just very small, so small that the work of creating threads, managing them, dealing with context shifts and synchronizing data as needed can be more work than what you gain from doing some work in parallel. This is why it's important to actually do lots of testing when deciding to parallelize some work to ensure that it actually benefits from it.

It's not better or worse for unordered lists - your issue in #1 was that you had a shared dependency on a StringBuilder which is why the parallel query failed. TPL works wonderfully on independent units of work. Even then, there are simple tricks that you can use to force an evaluation of a parallel query and keep the original ordering of the results when the parallel operations have all finished.
TPL and PLINQ are technically distinct things; PLINQ uses TPL to accomplish it's goal. That said, PLINQ tries to inspect your architecture and structure the execution of the set as best it can. TPL is simply the wrapper around the task architecture. It's going to up to you to determine if the overhead of creating a task (which is something like 1MB of memory), and overhead of the context switching to execute the tasks is greater than simply running the task serially.

On point 1, if using TPL you don't know in which order which task is run. That's the beauty of parallel vs sequential. There are means to control the order of things but then you probably lose the benefits of parallel.
On 2: TPL makes use of multi-cores out of the box. But there is indeed always overhead in using multiple threads. Load on the scheduler increases, thread (context) switching is not for free. To keep data in sync to avoid race-conditions you'll probably need some locking mechanism that also adds to the overhead.
Crafting a fast parallel algorithm using TPL is made a lot easier but still some sort of art.

Clearly TPL is not a good tool for building an ordered set like a query.
If you have a series of tasks to perform on set of items then you can can use a BlockingCollection. The tasks can be performed in parallel but the order of the set is maintained.
BlockingCollection Class

C# Multithreading File IO (Reading)

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)

How do I pick the best number of threads for hyptherthreading/multicore?

I have some embarrassingly-parallelizable work in a .NET 3.5 console app and I want to take advantage of hyperthreading and multi-core processors. How do I pick the best number of worker threads to utilize either of these the best on an arbitrary system? For example, if it's a dual core I will want 2 threads; quad core I will want 4 threads. What I'm ultimately after is determining the processor characteristics so I can know how many threads to create.
I'm not asking how to split up the work nor how to do threading, I'm asking how do I determine the "optimal" number of the threads on an arbitrary machine this console app will run on.

I'd suggest that you don't try to determine it yourself. Use the ThreadPool and let .NET manage the threads for you.

You can use Environment.ProcessorCount if that's the only thing you're after. But usually using a ThreadPool is indeed the better option.
The .NET thread pool also has provisions for sometimes allocating more threads than you have cores to maximise throughput in certain scenarios where many threads are waiting for I/O to finish.

The correct number is obviously 42.
Now on the serious note. Just use the thread pool, always.
1) If you have a lengthy processing task (ie. CPU intensive) that can be partitioned into multiple work piece meals then you should partition your task and then submit all individual work items to the ThreadPool. The thread pool will pick up work items and start churning on them in a dynamic fashion as it has self monitoring capabilities that include starting new threads as needed and can be configured at deployment by administrators according to the deployment site requirements, as opposed to pre-compute the numbers at development time. While is true that the proper partitioning size of your processing task can take into account the number of CPUs available, the right answer depends so much on the nature of the task and the data that is not even worth talking about at this stage (and besides the primary concerns should be your NUMA nodes, memory locality and interlocked cache contention, and only after that the number of cores).
2) If you're doing I/O (including DB calls) then you should use Asynchronous I/O and complete the calls in ThreadPool called completion routines.
These two are the the only valid reasons why you should have multiple threads, and they're both best handled by using the ThreadPool. Anything else, including starting a thread per 'request' or 'connection' are in fact anti patterns on the Win32 API world (fork is a valid pattern in *nix, but definitely not on Windows).
For a more specialized and way, way more detailed discussion of the topic I can only recommend the Rick Vicik papers on the subject:
designing-applications-for-high-performance-part-1.aspx
designing-applications-for-high-performance-part-ii.aspx
designing-applications-for-high-performance-part-iii.aspx

The optimal number would just be the processor count. Optimally you would always have one thread running on a CPU (logical or physical) to minimise context switches and the overhead that has with it.
Whether that is the right number depends (very much as everyone has said) on what you are doing. The threadpool (if I understand it correctly) pretty much tries to use as few threads as possible but spins up another one each time a thread blocks.
The blocking is never optimal but if you are doing any form of blocking then the answer would change dramatically.
The simplest and easiest way to get good (not necessarily optimal) behaviour is to use the threadpool. In my opinion its really hard to do any better than the threadpool so thats simply the best place to start and only ever think about something else if you can demonstrate why that is not good enough.

A good rule of the thumb, given that you're completely CPU-bound, is processorCount+1.
That's +1 because you will always get some tasks started/stopped/interrupted and n tasks will almost never completely fill up n processors.

The only way is a combination of data and code analysis based on performance data.
Different CPU families and speeds vs. memory speed vs other activities on the system are all going to make the tuning different.
Potentially some self-tuning is possible, but this will mean having some form of live performance tuning and self adjustment.

Or even better than the ThreadPool, use .NET 4.0 Task instances from the TPL. The Task Parallel Library is built on a foundation in the .NET 4.0 framework that will actually determine the optimal number of threads to perform the tasks as efficiently as possible for you.

I read something on this recently (see the accepted answer to this question for example).
The simple answer is that you let the operating system decide. It can do a far better job of deciding what's optimal than you can.
There are a number of questions on a similar theme - search for "optimal number threads" (without the quotes) gives you a couple of pages of results.

I would say it also depends on what you are doing, if your making a server application then using all you can out of the CPU`s via either Environment.ProcessorCount or a thread pool is a good idea.
But if this is running on a desktop or a machine that not dedicated to this task, you might want to leave some CPU idle so the machine "functions" for the user.

It can be argued that the real way to pick the best number of threads is for the application to profile itself and adaptively change its threading behavior based on what gives the best performance.

I wrote a simple number crunching app that used multiple threads, and found that on my Quad-core system, it completed the most work in a fixed period using 6 threads.
I think the only real way to determine is through trialling or profiling.

In addition to processor count, you may want to take into account the process's processor affinity by counting bits in the affinity mask returned by the GetProcessAffinityMask function.

If there is no excessive i/o processing or system calls when the threads are running, then the number of thread (except the main thread) is in general equal to the number of processors/cores in your system, otherwise you can try to increase the number of threads by testing.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.