Slow worker thread performance with priority queue - c#

I was attempting to use worker threads to speed up a larger algorithm when I noticed that using independent priority queue's on more threads actually slowed performance down. So I wrote a small test case.
In which I query how many threads to start, set each thread to its own processor, and the push and pop a lot of stuff from my priority queues. Each thread owns it's own priority queue, and they're allocated separately so I don't suspect false sharing.
I put the test case here, because it's longer than a snippet.
(The processor affinity bit comes from NCrunch)
The priority queue is of my own creation because .NET didn't have a built in queue. It uses a Pairing Heap if that makes any difference.
At any rate if I run the program with one thread and one core, it gets about 100% usage.
Usage drops with two threads / two cores
And eventually pittles down to 30% usage with all 8 cores.
Which is a problem because the drop in performance nullifies any would be gain from multithreading. What's causing the drop in performance? Each queue is completely independent of the other thread's

Some problems like solving pi are more suited to parallelization and hyperthreading can actually give you a speed up. When you are dealing with a heavy memory problem like you are, Hyperthreading can't help and can actually hurt. Check out "pipelining" in CPU architecture.
There are not many practical problems for which you can get a 2x speedup by using 2-cpus. The more cpus, the more overhead. In your test case algorithm, I suspect cores are having to wait for the memory subsystem. If you tweak the memory requirements, you will see an increase in performance (and utilization) as you move the memory requirements closer to the CPU cache-size.

The OS is assigning the processing to whichever CPU it wishes to at a moment. Therefore, you are seeing every processor do some amount of work.
Additionally, when you say "drop in performance", have you checked how many contentions the system is creating? You probably are relieving yourself of contentions amongst the threads as well.

Related

Threading issues with > 30 threads. CPU scales non-linearly

I am having some trouble with my C# application.
I made sure threads do not access any resources outside themselves.
Now I have threadpool thread that makes a tcp connection, creates the thread objects and runs, with 1 thread performance is great. With 50 threads it seems the same, maybe 5-10% slower, CPU 10-20%. With 100 threads, the CPU usage goes from 10-20% to 70-99%.
One of our developers said that windows threads suck compared to linux thread and the context switching is incurring huge penalties. He proposes to create multiplexing with 4-8 core threads running all the instances.
But I thought problems like this start happening once you have 1000+ threads. Can anyone comment with some good sources to read more about this topic, and about thread / cpu performance and correct practices?
EDIT: OK Many answers seem a little off point because some assumptions are being made so I will add some extra points:
Running 3 applications with 50 threads at 10-20% cpu usage makes them all use that much. 30-60% CPU usage total.
Running 1 application with 150 threads makes it cap cpu at 70-99%.
This is what i mean by threads not scaling.
To expand on my comment..
It's not that Windows threads "suck" in comparison to POSIX threads it's just that you're trying to do more things than your CPU can physically handle at a time. CPU Usage is not particularly a relevant performance indicator that you should be looking at here.
If your CPU has 4 cores, your optimum amount of constantly-running threads is 4. Any more and performance degradation is going to happen as yes, context switching will have a performance impact as it tries to process through the threads simultaneously with only 1 resource.
Think of your threads as giant stacks of books on your table, you've got to knock each individual book off the top of each stack and you want them all doing as fast as you can. You've got 4 of these book stacks (threads) but only 2 arms (cores), how do you do it? The most likely option is to alternate which stack you knock books off each time, so there's no real performance benefit as the time taken for a single stack is going to take as long as any other.
The only time when this would differ is if you're running a blocking (ie. waiting for I/O) operation and your threads are idle. In this idle time your cores are free to work on another thread which can give a perceived performance benefit. Of course, when the resource that your other thread is waiting for becomes available you're back in the same situation you are in currently.

Program executing slow when running many threads

I have written a program in C# that does a lot of parallel work using different threads. When i reach approx 300 threads the GUI of the program starts to become slow and the execution of threads is also slowing down drastically. The threads are reading and writing data from a mySQL Database runnning on a different machine.
The funny thing is that if i split the work between two processes on the same machine everything runs perfect. Is there a thread limit per process in the .net framework or in windows? Or why am I getting this behaviour? Could it be a network related problem? I am running Windows 7 Ultimate and i have tried both VS2010 and VS 2012 with the same behaviour.
The way processor time is allocated is that the Operating System gives processor time to every process, then every process gives time to every thread.
So two processes will get twice the processor time, and that's why it works faster if you divide the program into two processes.
If you want to make the GUI run smoother, just set the priority higher for that thread.
This way the GUI thread will get more processor time then the other threads, but not so much that it will noticeably slow down the other threads.
300 threads is silly.
The number of threads should be in the range of your number of cores (2..8) and/or the max simultaneous connections (sometimes only 4 over TCP) your system supports.
Get beyond that and you're only wasting memory, at 1 MB per thread. In a 32bit system, 300 MB is already consuming a lot of the available mem space. And I assume each thread has some buffers attached.
If 2 separate processes perform better than1 then it probably isn't the context switching but either memory usage or a connection limit that holds you back.
Use ThreadPool. That should automatically allocate the optimal number of threads based on your system by throttling the number of threads in existence. You can also set the maximum number of threads allowable at any one time.
Also, if you're allocating thread to parallelize tasks from within a for-loop, foreach-loop, or linq statment you should look at the Parallel Class or PLINQ.
The accepted answer to this question will probably explain what is happening, but 300 threads seems like to many to be a good idea for any normal application.
At first if you have 300 threads for an application then probably you should rethink about your program design.
Setting up GUI threads priority may give you a better performance of GUI. But if you run so much thread the OS have to allocate space in program stack. And the stack is a continuous segment of the memory. So each time you create a new thread the allocated memory space for the stack may be incapable to hold the new thread. And then the OS must have to allocate a larger continuous space in the memory and copy all the data from the old stack to new stack. So obviously this may cause performance slow of your program.

How to decrease CPU utilization on intensive multithread computation

I have implemented meta-heuristic solver and utilized .NET 4.0 Parallel.For and Parallel.Foreach.
It works fines on my medium-end machine. But the search is too intensified and consumes too much resources especially the CPU time on that on lower-end machine.
So I think I have to put an option to tune down intensity of the search when needed. I would like to get CPU utilization down without much touch on an algorithm. It is fine if the search completes slower as long as it won't lock up the machine and allows the other work aside.
I'm considering to put Thread.Sleep on methods as all threads are 100% CPU bound (no I/O). Does that gonna decrease intensity of CPU usage I need? Is there any better solution?
If you use diffrent threads I suggest the following:
Thread.Priority = ThreadPriority.BelowNormal
So the thread get less priority.
Edit
Okay, this solution is only accetpable, if it's not used with ThreadPool..
Here I'm not sure, but whats about:
Process.ProcessorAffinity
Gets or sets the processors on which the threads in this process can be scheduled to run.
If we limit the cores, we got more space for other work, right?
(sticks head above parapet, preparing to get shot down with many downvotes..:)
If you use diffrent threads I suggest the following:
Thread.Priority = ThreadPriority.BelowNormal
So the thread get less priority.
[Yes, this is a straight copy of C Sharper answer, but without the later edits about threadpool threads and affinity-bodging. If you think this is good idea - upvote C Sharper, not me].
Alternatively, reduce the priority of the whole process with the Task Manager or other means. This will reduce the 'real' scheduling priority of all the thread in the process, (including the 'really bad idea' threadpool threads), and improve the user experience for other apps that are not CPU-intensive but should respond quickly when they need to do something. The lower-priority heavy app will still soak up all the remaining CPU and, if you decide not to edit this month's sales figures or watch the latest illegally-downloaded copyrighted films, will perform as well as if run at normal priority.

What are the practical limit of threads per CPU?

I've been playing around with threading, attempting to push some limits to the extreme - for my own amusement. I know the threadpool defaults to 25 threads and can be pushed up to 1000 (according to MSDN). What though, is the practical limit of threads per CPU core? At some point, context switching is going to cause more of a bottleneck than threading saves. Does anyone have any best practices covering this? Are we talking 100, 200, 500? Does it depend on what the threads are doing? What determines, other than framework dictated architecture how many threads operate optimally per CPU core?
It's all dependent on what the threads are doing, of course. If they are CPU-bound (say sitting tight in an infinite loop) then one thread per core will be enough to saturate the CPU; any more than that (and you will already have more, from background processes etc) and you will start getting contention.
On the other extreme, if the threads are not eligible to run (e.g. blocked on some synchronization object), then the limit of how many you could have would be dictated by factors other than the CPU (memory for the stacks, OS internal limits, etc).
If your application is not CPU bound (like the majority), then context switches are not a big deal because every time your app has to wait, a context switch is necessary. The problem of having too many threads is about OS data structures and some synchronization anomalies like starvation, where a thread never (or very rarely) gets a chance to execute due to randomness of synchronization algorithms.
If your application is CPU bound (stays 99% of time working on memory, very rarely does I/O or wait for something else such as user input or another thread), then the optimal would be 1 thread per logical core, because in this case there will be no context switching.
Beware that the OS interrupts threads every time, even when there's only one thread for multiple CPUs. The OS interrupts threads not only to make task switching, but also for thread management purposes (like updating counters to show on Task Manager, or to allow a super user to kill it).

Threading cost - minimum execution time when threads would add speed

I am working on a C# application that works with an array. It walks through it (meaning that at one time only a narrow part of the array is used). I am considering adding threads in it to make it perform faster (it runs on a dualcore computer). The problem is that I do not know if it would actually help, because threads cost something and this cost could easily be more than the parallel gain... So how do I determine if threading would help?
Try writing some benchmarks that mimic, as closely as possible, the real-world conditions in which your software will actually be used.
Test and time the single-threaded version. Test and time the multi-threaded version. Compare the two sets of results.
If your application is CPU bound (i.e. it isn't spending time trying to read files or waiting for data from a device) and there is little to no sharing of live data (data being altered, if its read only its fine) between the threads then you can pretty much increase the speed by 50->75% by adding another thread (as long as it still remains CPU bound of course).
The main overhead in multithreading comes from 2 places.
Creation & initialization of the thread. Creating a thread requires quite a few resources to be allocated and involves swaps between kernel and user mode, this is expensive though a once off per thread so you can pretty much ignore it if the thread is running for any reasonable amount of time. The best way to mitigate this problem is to use a thread pool as it will keep the thread on hand and not need to be recreated.
Handling synchronization of data. If one thread is reading from data that another is writing, bad things will generally happen (worse if both are changing it). This requires you to lock your data before altering it so that no thread reads a half written value. These locks are generally quite slow as well. To mitigate this problem, you need to design your data layout so that the threads don't need to read or write to the same data as much as possible. If you do need a lot of these locks it can then become slower than the single thread option.
In short, if you are doing something that requires the CPU's to share a lot of data, then multi-threading it will be slower and if the program isn't CPU bound there will be little or no difference (could be a lot slower depending on what it is bound to, e.g. a cd/hard drive). If your program matches these conditions, then it will PROBABLY be worthwhile to add another thread (though the only way to be certain would be profiling).
One more little note, you should only create as many CPU bound threads as you have physical cores (threads that idle most of the time, such as a GUI message pump thread, can be ignored for this condition).
P.S. You can reduce the cost of locking data by using a methodology called "lock-free programming", though this something that should really only be attempted by people with a lot of experience with multi-threading and a clear understanding of their target architecture (including how the cache is treated and the memory bus).
I agree with Luke's answer. Benchmark it, it's the only way to be sure.
I can also give a prediction of the results - the fastest version will be when the number of threads matches the number of cores, EXCEPT if the array is very small and each thread would have to process just a few items, the setup/teardown times might get larger than the processing itself. How few - that depends on what you do. Again - benchmark.
I'd advise to find out a "minimum number of items for a thread to be useful". Then, when you are deciding how many threads to spawn (or take from a pool), check how many cores the computer has and how many items there are. Spawn as many threads as possible, but no more than the computer has cores, and not so many that each thread would have less than the minimum number of items to process.
For example if the minimum number of items is, say, 1000; and the computer has 4 cores; and your list contains 2500 items, you would spawn just 2 threads, because more threads would be inefficient (each would process less than 1000 items).
Making a step by step list for Luke's idea:
Make a single threaded test app
Download Sysinternals Process Monitor and run it
Run your test app and find it on the process list (remember to run it as a release build outside of Visual Studio)
Double click the process and select the Performance Graph tab
Observe the CPU time used by your process
If the CPU time is sittling flat 50% for more than a few seconds, you can probably speed your overall process up using threads (assuming the bunch of stuff Mr Peters refered to holds true)
(However, the best you can do on a duel core machine is to halve the time it takes to run. If your process only take 4 seconds, it might not be worth getting it to run in 2 seconds)
Using the task parallel library / Rx provides a friendlier interface than System.Threading.ThreadPool, which might make your world a bit easier.
You miss imho one item, which is that it is not always about execution time. There is:
The problem to koop a UI operational during an operation. Even if the UI is "dormant", a nonresponsive message pump makes a worse impression.
The possibility to use a thread pool to actually not ahve to start / stop threads all the time. I use thread pools very extensively, and various parts of the applications keep them busy.
Anyhow, ignoring my point 1 - where you may go multi threaded without speeding things up in order to keep your UI responsive - I would say it is always then faster when you can actually either split up work (so you can keep more than one core busy) or offload it for othe reasons.

Categories

Resources