How to decrease CPU utilization on intensive multithread computation

How to decrease CPU utilization on intensive multithread computation - c#

I have implemented meta-heuristic solver and utilized .NET 4.0 Parallel.For and Parallel.Foreach.
It works fines on my medium-end machine. But the search is too intensified and consumes too much resources especially the CPU time on that on lower-end machine.
So I think I have to put an option to tune down intensity of the search when needed. I would like to get CPU utilization down without much touch on an algorithm. It is fine if the search completes slower as long as it won't lock up the machine and allows the other work aside.
I'm considering to put Thread.Sleep on methods as all threads are 100% CPU bound (no I/O). Does that gonna decrease intensity of CPU usage I need? Is there any better solution?

If you use diffrent threads I suggest the following:
Thread.Priority = ThreadPriority.BelowNormal
So the thread get less priority.
Edit
Okay, this solution is only accetpable, if it's not used with ThreadPool..
Here I'm not sure, but whats about:
Process.ProcessorAffinity
Gets or sets the processors on which the threads in this process can be scheduled to run.
If we limit the cores, we got more space for other work, right?

(sticks head above parapet, preparing to get shot down with many downvotes..:)
If you use diffrent threads I suggest the following:
Thread.Priority = ThreadPriority.BelowNormal
So the thread get less priority.
[Yes, this is a straight copy of C Sharper answer, but without the later edits about threadpool threads and affinity-bodging. If you think this is good idea - upvote C Sharper, not me].
Alternatively, reduce the priority of the whole process with the Task Manager or other means. This will reduce the 'real' scheduling priority of all the thread in the process, (including the 'really bad idea' threadpool threads), and improve the user experience for other apps that are not CPU-intensive but should respond quickly when they need to do something. The lower-priority heavy app will still soak up all the remaining CPU and, if you decide not to edit this month's sales figures or watch the latest illegally-downloaded copyrighted films, will perform as well as if run at normal priority.

Related

Slow worker thread performance with priority queue

I was attempting to use worker threads to speed up a larger algorithm when I noticed that using independent priority queue's on more threads actually slowed performance down. So I wrote a small test case.
In which I query how many threads to start, set each thread to its own processor, and the push and pop a lot of stuff from my priority queues. Each thread owns it's own priority queue, and they're allocated separately so I don't suspect false sharing.
I put the test case here, because it's longer than a snippet.
(The processor affinity bit comes from NCrunch)
The priority queue is of my own creation because .NET didn't have a built in queue. It uses a Pairing Heap if that makes any difference.
At any rate if I run the program with one thread and one core, it gets about 100% usage.
Usage drops with two threads / two cores
And eventually pittles down to 30% usage with all 8 cores.
Which is a problem because the drop in performance nullifies any would be gain from multithreading. What's causing the drop in performance? Each queue is completely independent of the other thread's

Some problems like solving pi are more suited to parallelization and hyperthreading can actually give you a speed up. When you are dealing with a heavy memory problem like you are, Hyperthreading can't help and can actually hurt. Check out "pipelining" in CPU architecture.
There are not many practical problems for which you can get a 2x speedup by using 2-cpus. The more cpus, the more overhead. In your test case algorithm, I suspect cores are having to wait for the memory subsystem. If you tweak the memory requirements, you will see an increase in performance (and utilization) as you move the memory requirements closer to the CPU cache-size.

The OS is assigning the processing to whichever CPU it wishes to at a moment. Therefore, you are seeing every processor do some amount of work.
Additionally, when you say "drop in performance", have you checked how many contentions the system is creating? You probably are relieving yourself of contentions amongst the threads as well.

C# ThreadPool Implementation / Performance Spikes

In an attempt to speed up processing of physics objects in C# I decided to change a linear update algorithm into a parallel algorithm. I believed the best approach was to use the ThreadPool as it is built for completing a queue of jobs.
When I first implemented the parallel algorithm, I queued up a job for every physics object. Keep in mind, a single job completes fairly quickly (updates forces, velocity, position, checks for collision with the old state of any surrounding objects to make it thread safe, etc). I would then wait on all jobs to be finished using a single wait handle, with an interlocked integer that I decremented each time a physics object completed (upon hitting zero, I then set the wait handle). The wait was required as the next task I needed to do involved having the objects all be updated.
The first thing I noticed was that performance was crazy. When averaged, the thread pooling seemed to be going a bit faster, but had massive spikes in performance (on the order of 10 ms per update, with random jumps to 40-60ms). I attempted to profile this using ANTS, however I could not gain any insight into why the spikes were occurring.
My next approach was to still use the ThreadPool, however instead I split all the objects into groups. I initially started with only 8 groups, as that was how any cores my computer had. The performance was great. It far outperformed the single threaded approach, and had no spikes (about 6ms per update).
The only thing I thought about was that, if one job completed before the others, there would be an idle core. Therefore, I increased the number of jobs to about 20, and even up to 500. As I expected, it dropped to 5ms.
So my questions are as follows:
Why would spikes occur when I made the job sizes quick / many?
Is there any insight into how the ThreadPool is implemented that would help me to understand how best to use it?

Using threads has a price - you need context switching, you need locking (the job queue is most probably locked when a thread tries to fetch a new job) - it all comes at a price. This price is usually small compared to the actual work your thread is doing, but if the work ends quickly, the price becomes meaningful.
Your solution seems correct. A reasonable rule of thumb is to have twice as many threads as there are cores.

As you probably expect yourself, the spikes are likely caused by the code that manages the thread pools and distributes tasks to them.
For parallel programming, there are more sophisticated approaches than "manually" distributing work across different threads (even if using the threadpool).
See Parallel Programming in the .NET Framework for instance for an overview and different options. In your case, the "solution" may be as simple as this:
Parallel.ForEach(physicObjects, physicObject => Process(physicObject));

Here's my take on your two questions:
I'd like to start with question 2 (how the thread pool works) because it actually holds the key to answering question 1. The thread pool is implemented (without going into details) as a (thread-safe) work queue and a group of worker threads (which may shrink or enlarge as needed). As the user calls QueueUserWorkItem the task is put into the work queue. The workers keep polling the queue and taking work if they are idle. Once they manage to take a task, they execute it and then return to the queue for more work (this is very important!). So the work is done by the workers on-demand: as the workers become idle they take more pieces of work to do.
Having said the above, it's simple to see what is the answer to question 1 (why did you see a performance difference with more fine-grained tasks): it's because with fine-grain you get more load-balancing (a very desirable property), i.e. your workers do more or less the same amount of work and all cores are exploited uniformly. As you said, with a coarse-grain task distribution, there may be longer and shorter tasks, so one or more cores may be lagging behind, slowing down the overall computation, while other do nothing. With small tasks the problem goes away. Each worker thread takes one small task at a time and then goes back for more. If one thread picks up a shorter task it will go to the queue more often, If it takes a longer task it will go to the queue less often, so things are balanced.
Finally, when the jobs are too fine-grained, and considering that the pool may enlarge to over 1K threads, there is very high contention on the queue when all threads go back to take more work (which happens very often), which may account for the spikes you are seeing. If the underlying implementation uses a blocking lock to access the queue, then context switches are very frequent which hurts performance a lot and makes it seem rather random.

answer of question 1:
this is because of Thread switching , thread switching (or context switching in OS concepts) is CPU clocks that takes to switch between each thread , most of times multi-threading increases the speed of programs and process but when it's process is so small and quick size then context switching will take more time than thread's self process so the whole program throughput decreases, you can find more information about this in O.S concepts books .
answer of question 2:
actually i have a overall insight of ThreadPool , and i cant explain what is it's structure exactly.

to learn more about ThreadPool start here ThreadPool Class
each version of .NET Framework adds more and more capabilities utilizing ThreadPool indirectly. such as Parallel.ForEach Method mentioned before added in .NET 4 along with System.Threading.Tasks which makes code more readable and neat. You can learn more on this here Task Schedulers as well.
At very basic level what it does is: it creates let's say 20 threads and puts them into a lits. Each time it receives a delegate to execute async it takes idle thread from the list and executes delegate. if no available threads found it puts it into a queue. every time deletegate execution completes it will check if queue has any item and if so peeks one and executes in the same thread.

ThreadPool behaviour: not growing from minimum size

I had set up my thread pool like this:
ThreadPool.SetMaxThreads(10000, 10000);
ThreadPool.SetMinThreads(20, 20);
However, my app started hanging under heavy load. This seemed to be because worker tasks were not executing: I had used ThreadPool.QueueUserWorkItem to run some tasks which in turn used the same method to queue further work. This is obviously dangerous with a limited thread pool (a deadlock situation), but I am using a thread pool not to limit maximum threads but to reduce thread creation overhead.
I can see the potential trap there, but I believed that setting a maximum of 10000 threads on the pool would mean that if an item was queued, all threads were busy, and there weren't 10000 threads in the pool, a new one would be created and the task processed there.
However, I changed to this:
ThreadPool.SetMaxThreads(10000, 10000);
ThreadPool.SetMinThreads(200, 200);
..and the app started working. If that made it start working, am I missing something about how/when the thread pool expands from minimum toward maximum size?

The job of the threadpool scheduler is to ensure there are no more executing TP threads than cpu cores. The default minimum is equal to the number of cores. A happy number since that minimizes the overhead due to thread context switching. Twice a second, the scheduler steps in and allows another thread to execute if the existing ones haven't completed.
It will therefore take a hour and twenty minutes of having threads that don't complete to get to your new maximum. It is fairly unlikely to ever get there, a 32-bit machine will keel over when 2000 threads have consumed all available virtual memory. You'd have a shot at it on a 64-bit operating system with a very large paging file. Lots of RAM required to avoid paging death, you'd need at least 12 gigabytes.
The generic diagnostic is that you are using TP threads inappropriately. They take too long, usually caused by blocking on I/O. A regular Thread is the proper choice for those kind of jobs. That's probably hard to fix right now, especially since you're happy with what you got. Raising the minimum is indeed a quick workaround. You'll have to hand-tune it since the TP scheduler can't do a reasonable job anymore.

Whenever you use the thread pool, you are at the mercy of its "thread injection and retirement algorithm".
The algorithm is not properly documented ( that I know of ) and not configurable.
If you're using Tasks, you can write your own Task Scheduler

The performance issue you described, is similar to what is documented in this ASP.NET KB article,
http://support.microsoft.com/kb/821268
To summarize, you need to carefully choose the parameters (this article mentions the typical settings for default ASP.NET thread pool, but you can apply the trick to your app), and further tune them based on performance testing and the characteristics of your app.
Notice that the more you learn about load, you will see that "heavy load" is no longer a good term to describe the situation. Sometimes you need to further categorize the cases, to include detailed term, such as burst load, and so on.

If your logic depends on having a minimum amount of threads you need to change that, urgently.
Setting a MinThreads of 200 (or even 20) is wasting quite a bit of memory. Note that the MaxThreads won't be relevant here, you probably don't have the 10 GB mem for that.
The fact that a min of 200 helps you out is suspicious and as a solution it is probably very brittle.
Take a look at normal Producer/Consumer patterns, and/or use a bounded queue to couple your tasks.

Threading cost - minimum execution time when threads would add speed

I am working on a C# application that works with an array. It walks through it (meaning that at one time only a narrow part of the array is used). I am considering adding threads in it to make it perform faster (it runs on a dualcore computer). The problem is that I do not know if it would actually help, because threads cost something and this cost could easily be more than the parallel gain... So how do I determine if threading would help?

Try writing some benchmarks that mimic, as closely as possible, the real-world conditions in which your software will actually be used.
Test and time the single-threaded version. Test and time the multi-threaded version. Compare the two sets of results.

If your application is CPU bound (i.e. it isn't spending time trying to read files or waiting for data from a device) and there is little to no sharing of live data (data being altered, if its read only its fine) between the threads then you can pretty much increase the speed by 50->75% by adding another thread (as long as it still remains CPU bound of course).
The main overhead in multithreading comes from 2 places.
Creation & initialization of the thread. Creating a thread requires quite a few resources to be allocated and involves swaps between kernel and user mode, this is expensive though a once off per thread so you can pretty much ignore it if the thread is running for any reasonable amount of time. The best way to mitigate this problem is to use a thread pool as it will keep the thread on hand and not need to be recreated.
Handling synchronization of data. If one thread is reading from data that another is writing, bad things will generally happen (worse if both are changing it). This requires you to lock your data before altering it so that no thread reads a half written value. These locks are generally quite slow as well. To mitigate this problem, you need to design your data layout so that the threads don't need to read or write to the same data as much as possible. If you do need a lot of these locks it can then become slower than the single thread option.
In short, if you are doing something that requires the CPU's to share a lot of data, then multi-threading it will be slower and if the program isn't CPU bound there will be little or no difference (could be a lot slower depending on what it is bound to, e.g. a cd/hard drive). If your program matches these conditions, then it will PROBABLY be worthwhile to add another thread (though the only way to be certain would be profiling).
One more little note, you should only create as many CPU bound threads as you have physical cores (threads that idle most of the time, such as a GUI message pump thread, can be ignored for this condition).
P.S. You can reduce the cost of locking data by using a methodology called "lock-free programming", though this something that should really only be attempted by people with a lot of experience with multi-threading and a clear understanding of their target architecture (including how the cache is treated and the memory bus).

I agree with Luke's answer. Benchmark it, it's the only way to be sure.
I can also give a prediction of the results - the fastest version will be when the number of threads matches the number of cores, EXCEPT if the array is very small and each thread would have to process just a few items, the setup/teardown times might get larger than the processing itself. How few - that depends on what you do. Again - benchmark.
I'd advise to find out a "minimum number of items for a thread to be useful". Then, when you are deciding how many threads to spawn (or take from a pool), check how many cores the computer has and how many items there are. Spawn as many threads as possible, but no more than the computer has cores, and not so many that each thread would have less than the minimum number of items to process.
For example if the minimum number of items is, say, 1000; and the computer has 4 cores; and your list contains 2500 items, you would spawn just 2 threads, because more threads would be inefficient (each would process less than 1000 items).

Making a step by step list for Luke's idea:
Make a single threaded test app
Download Sysinternals Process Monitor and run it
Run your test app and find it on the process list (remember to run it as a release build outside of Visual Studio)
Double click the process and select the Performance Graph tab
Observe the CPU time used by your process
If the CPU time is sittling flat 50% for more than a few seconds, you can probably speed your overall process up using threads (assuming the bunch of stuff Mr Peters refered to holds true)
(However, the best you can do on a duel core machine is to halve the time it takes to run. If your process only take 4 seconds, it might not be worth getting it to run in 2 seconds)
Using the task parallel library / Rx provides a friendlier interface than System.Threading.ThreadPool, which might make your world a bit easier.

You miss imho one item, which is that it is not always about execution time. There is:
The problem to koop a UI operational during an operation. Even if the UI is "dormant", a nonresponsive message pump makes a worse impression.
The possibility to use a thread pool to actually not ahve to start / stop threads all the time. I use thread pools very extensively, and various parts of the applications keep them busy.
Anyhow, ignoring my point 1 - where you may go multi threaded without speeding things up in order to keep your UI responsive - I would say it is always then faster when you can actually either split up work (so you can keep more than one core busy) or offload it for othe reasons.

How do I pick the best number of threads for hyptherthreading/multicore?

I have some embarrassingly-parallelizable work in a .NET 3.5 console app and I want to take advantage of hyperthreading and multi-core processors. How do I pick the best number of worker threads to utilize either of these the best on an arbitrary system? For example, if it's a dual core I will want 2 threads; quad core I will want 4 threads. What I'm ultimately after is determining the processor characteristics so I can know how many threads to create.
I'm not asking how to split up the work nor how to do threading, I'm asking how do I determine the "optimal" number of the threads on an arbitrary machine this console app will run on.

I'd suggest that you don't try to determine it yourself. Use the ThreadPool and let .NET manage the threads for you.

You can use Environment.ProcessorCount if that's the only thing you're after. But usually using a ThreadPool is indeed the better option.
The .NET thread pool also has provisions for sometimes allocating more threads than you have cores to maximise throughput in certain scenarios where many threads are waiting for I/O to finish.

The correct number is obviously 42.
Now on the serious note. Just use the thread pool, always.
1) If you have a lengthy processing task (ie. CPU intensive) that can be partitioned into multiple work piece meals then you should partition your task and then submit all individual work items to the ThreadPool. The thread pool will pick up work items and start churning on them in a dynamic fashion as it has self monitoring capabilities that include starting new threads as needed and can be configured at deployment by administrators according to the deployment site requirements, as opposed to pre-compute the numbers at development time. While is true that the proper partitioning size of your processing task can take into account the number of CPUs available, the right answer depends so much on the nature of the task and the data that is not even worth talking about at this stage (and besides the primary concerns should be your NUMA nodes, memory locality and interlocked cache contention, and only after that the number of cores).
2) If you're doing I/O (including DB calls) then you should use Asynchronous I/O and complete the calls in ThreadPool called completion routines.
These two are the the only valid reasons why you should have multiple threads, and they're both best handled by using the ThreadPool. Anything else, including starting a thread per 'request' or 'connection' are in fact anti patterns on the Win32 API world (fork is a valid pattern in *nix, but definitely not on Windows).
For a more specialized and way, way more detailed discussion of the topic I can only recommend the Rick Vicik papers on the subject:
designing-applications-for-high-performance-part-1.aspx
designing-applications-for-high-performance-part-ii.aspx
designing-applications-for-high-performance-part-iii.aspx

The optimal number would just be the processor count. Optimally you would always have one thread running on a CPU (logical or physical) to minimise context switches and the overhead that has with it.
Whether that is the right number depends (very much as everyone has said) on what you are doing. The threadpool (if I understand it correctly) pretty much tries to use as few threads as possible but spins up another one each time a thread blocks.
The blocking is never optimal but if you are doing any form of blocking then the answer would change dramatically.
The simplest and easiest way to get good (not necessarily optimal) behaviour is to use the threadpool. In my opinion its really hard to do any better than the threadpool so thats simply the best place to start and only ever think about something else if you can demonstrate why that is not good enough.

A good rule of the thumb, given that you're completely CPU-bound, is processorCount+1.
That's +1 because you will always get some tasks started/stopped/interrupted and n tasks will almost never completely fill up n processors.

The only way is a combination of data and code analysis based on performance data.
Different CPU families and speeds vs. memory speed vs other activities on the system are all going to make the tuning different.
Potentially some self-tuning is possible, but this will mean having some form of live performance tuning and self adjustment.

Or even better than the ThreadPool, use .NET 4.0 Task instances from the TPL. The Task Parallel Library is built on a foundation in the .NET 4.0 framework that will actually determine the optimal number of threads to perform the tasks as efficiently as possible for you.

I read something on this recently (see the accepted answer to this question for example).
The simple answer is that you let the operating system decide. It can do a far better job of deciding what's optimal than you can.
There are a number of questions on a similar theme - search for "optimal number threads" (without the quotes) gives you a couple of pages of results.

I would say it also depends on what you are doing, if your making a server application then using all you can out of the CPU`s via either Environment.ProcessorCount or a thread pool is a good idea.
But if this is running on a desktop or a machine that not dedicated to this task, you might want to leave some CPU idle so the machine "functions" for the user.

It can be argued that the real way to pick the best number of threads is for the application to profile itself and adaptively change its threading behavior based on what gives the best performance.

I wrote a simple number crunching app that used multiple threads, and found that on my Quad-core system, it completed the most work in a fixed period using 6 threads.
I think the only real way to determine is through trialling or profiling.

In addition to processor count, you may want to take into account the process's processor affinity by counting bits in the affinity mask returned by the GetProcessAffinityMask function.

If there is no excessive i/o processing or system calls when the threads are running, then the number of thread (except the main thread) is in general equal to the number of processors/cores in your system, otherwise you can try to increase the number of threads by testing.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.