What are the practical limit of threads per CPU? - c#

I've been playing around with threading, attempting to push some limits to the extreme - for my own amusement. I know the threadpool defaults to 25 threads and can be pushed up to 1000 (according to MSDN). What though, is the practical limit of threads per CPU core? At some point, context switching is going to cause more of a bottleneck than threading saves. Does anyone have any best practices covering this? Are we talking 100, 200, 500? Does it depend on what the threads are doing? What determines, other than framework dictated architecture how many threads operate optimally per CPU core?

It's all dependent on what the threads are doing, of course. If they are CPU-bound (say sitting tight in an infinite loop) then one thread per core will be enough to saturate the CPU; any more than that (and you will already have more, from background processes etc) and you will start getting contention.
On the other extreme, if the threads are not eligible to run (e.g. blocked on some synchronization object), then the limit of how many you could have would be dictated by factors other than the CPU (memory for the stacks, OS internal limits, etc).

If your application is not CPU bound (like the majority), then context switches are not a big deal because every time your app has to wait, a context switch is necessary. The problem of having too many threads is about OS data structures and some synchronization anomalies like starvation, where a thread never (or very rarely) gets a chance to execute due to randomness of synchronization algorithms.
If your application is CPU bound (stays 99% of time working on memory, very rarely does I/O or wait for something else such as user input or another thread), then the optimal would be 1 thread per logical core, because in this case there will be no context switching.
Beware that the OS interrupts threads every time, even when there's only one thread for multiple CPUs. The OS interrupts threads not only to make task switching, but also for thread management purposes (like updating counters to show on Task Manager, or to allow a super user to kill it).

Related

What is the advantage of creating a thread outside threadpool?

Okay, So I wanted to know what happens when I use TaskCreationOptions.LongRunning. By this answer, I came to know that for long running tasks, I should use this options because it creates a thread outside of threadpool.
Cool. But what advantage would I get when I create a thread outside threadpool? And when to do it and avoid it?
what advantage would I get when I create a thread outside threadpool?
The threadpool, as it name states, is a pool of threads which are allocated once and re-used throughout, in order to save the time and resources necessary to allocate a thread. The pool itself re-sizes on demand. If you queue more work than actual workers exist in the pool, it will allocate more threads in 500ms intervals, one at a time (this exists to avoid allocation of multiple threads simultaneously where existing threads may already finish executing and can serve requests). If many long running operations are performed on the thread-pool, it causes "thread starvation", meaning delegates will start getting queued and ran only once a thread frees up. That's why you'd want to avoid a large amount of threads doing lengthy work with thread-pool threads.
The Managed Thread-Pool docs also have a section on this question:
There are several scenarios in which it is appropriate to create and
manage your own threads instead of using thread pool threads:
You require a foreground thread.
You require a thread to have a particular priority.
You have tasks that cause the thread to block for long periods of time. The thread pool has a maximum number of threads, so a large
number of blocked thread pool threads might prevent tasks from
starting.
You need to place threads into a single-threaded apartment. All ThreadPool threads are in the multithreaded apartment.
You need to have a stable identity associated with the thread, or to dedicate a thread to a task.
For more, see:
Thread vs ThreadPool
When should I not use the ThreadPool in .Net?
Dedicated thread or thread-pool thread?
"Long running" can be quantified pretty well, a thread that takes more than half a second is running long. That's a mountain of processor instructions on a modern machine, you'd have to burn a fat five billion of them per second. Pretty hard to do in a constructive way unless you are calculating the value of Pi to thousands of decimals in the fraction.
Practical threads can only take that long when they are not burning core but are waiting a lot. Invariably on an I/O completion, like reading data from a disk, a network, a dbase server. And often the reason you'd start considering using a thread in the first place.
The threadpool has a "manager". It determines when a threadpool thread is allowed to start. It doesn't happen immediately when you start it in your program. The manager tries to limit the number of running threads to the number of CPU cores you have. It is much more efficient that way, context switching between too many active threads is expensive. And a good throttle, preventing your program from consuming too many resources in a burst.
But the threadpool manager has the very common problem with managers, it doesn't know enough about what is going on. Just like my manager doesn't know that I'm goofing off at Stackoverflow.com, the tp manager doesn't know that a thread is waiting for something and not actually performing useful work. Without that knowledge it cannot make good decisions. A thread that does a lot of waiting should be ignored and another one should be allowed to run in its place. Actually doing real work.
Just like you tell your manager that you go on vacation, so he can expect no work to get done, you tell the threadpool manager the same thing with LongRunning.
Do note that it isn't quite a bad as it, perhaps, sounds in this answer. Particularly .NET 4.0 hired a new manager that's a lot smarter at figuring out the optimum number of running threads. It does so with a feedback loop, collecting data to discover if active threads actually get work done. And adjusts the optimum accordingly. Only problem with this approach is the common one when you close a feedback loop, you have to make it slow so the loop cannot become unstable. In other words, it isn't particularly quick at driving up the number of active threads.
If you know ahead of time that the thread is pretty abysmal, running for many seconds with no real cpu load then always pick LongRunning. Otherwise it is a tuning job, observing the program when it is done and tinkering with it to make it more optimal.

Is my understanding of the C# threadpool correct?

I'm reading Essential C# 5.0 which says,
The thread pool also assumes that all the work will be relatively
short-running (that is, consuming milliseconds or seconds of processor
time, not hours or days). By making this assumption, it can ensure
that each processor is working full out on a task, and not
inefficiently time slicing multiple tasks. The thread pool attempts to
prevent excessive time slicing by ensuring that thread creation is
"throttled" and so that no one processor is "oversubscrived" with too
many threads.
I've always thought one of the benefits of multithreading was time slicing.
If you >1 processor, than you can concurrently run threads and achieve true multithreading. But unless that's the case, you'd have to resort to time slicing for applications with multiple threads right?
So, if the ThreadPool in C# doesn't time slice, then,
a. Does it mean the ThreadPool is only used to get around the overhead in creating new threads?
b. Does it mean the ThreadPool can't run multiple threads simultaneously unless the processor has multiple cores, where each core can run a single process?
The .NET thread pool will create multiple threads per core, but has heuristics to keep the number of threads as low as possible while performing the maximum amount of work.
This means if your code is CPU-bound, you may end up with a single thread per core. If your code is I/O-bound and blocks, or queues up a huge amount of work items, you may end up with many threads per core.
It's not just thread creation that is expensive: context switching between hundreds of threads takes up a lot of time that you'd rather be spent running your own code. More threads is almost never better.
The quote you mention refers to the rate at which the Threadpool creates new threads when all of the threads that it has already created are already allocated to a task. The new thread creation rate is throttled (and new tasks are queued) to avoid creating many threads (and swamping the CPU) when a large burst of tasks are put on the Threadpool.
The current algorithm does indeed create many threads per CPU core, but it creates them relatively slowly, in the hope that the current backlog of tasks will be quickly satisfied by the threads that it has already created, and adding threads will not be needed.
Q. Does it mean the ThreadPool is only used to get around the overhead in creating new threads?
A. No. Thread creation is usually not the problem. The goal is make your CPU work at 100%, while using as as few concurrently executing threads as it is possible. The low number of threads will avoid excessive context switching, and improve overall execution time.
Q. Does it mean the ThreadPool can't run multiple threads simultaneously unless the processor has multiple cores, where each core can run a single process?
A. It runs multiple threads simultaneously. Ideally it runs exactly one CPU-intensive task per core, in which case you CPU works at 100%

Threading issues with > 30 threads. CPU scales non-linearly

I am having some trouble with my C# application.
I made sure threads do not access any resources outside themselves.
Now I have threadpool thread that makes a tcp connection, creates the thread objects and runs, with 1 thread performance is great. With 50 threads it seems the same, maybe 5-10% slower, CPU 10-20%. With 100 threads, the CPU usage goes from 10-20% to 70-99%.
One of our developers said that windows threads suck compared to linux thread and the context switching is incurring huge penalties. He proposes to create multiplexing with 4-8 core threads running all the instances.
But I thought problems like this start happening once you have 1000+ threads. Can anyone comment with some good sources to read more about this topic, and about thread / cpu performance and correct practices?
EDIT: OK Many answers seem a little off point because some assumptions are being made so I will add some extra points:
Running 3 applications with 50 threads at 10-20% cpu usage makes them all use that much. 30-60% CPU usage total.
Running 1 application with 150 threads makes it cap cpu at 70-99%.
This is what i mean by threads not scaling.
To expand on my comment..
It's not that Windows threads "suck" in comparison to POSIX threads it's just that you're trying to do more things than your CPU can physically handle at a time. CPU Usage is not particularly a relevant performance indicator that you should be looking at here.
If your CPU has 4 cores, your optimum amount of constantly-running threads is 4. Any more and performance degradation is going to happen as yes, context switching will have a performance impact as it tries to process through the threads simultaneously with only 1 resource.
Think of your threads as giant stacks of books on your table, you've got to knock each individual book off the top of each stack and you want them all doing as fast as you can. You've got 4 of these book stacks (threads) but only 2 arms (cores), how do you do it? The most likely option is to alternate which stack you knock books off each time, so there's no real performance benefit as the time taken for a single stack is going to take as long as any other.
The only time when this would differ is if you're running a blocking (ie. waiting for I/O) operation and your threads are idle. In this idle time your cores are free to work on another thread which can give a perceived performance benefit. Of course, when the resource that your other thread is waiting for becomes available you're back in the same situation you are in currently.

ThreadPool behaviour: not growing from minimum size

I had set up my thread pool like this:
ThreadPool.SetMaxThreads(10000, 10000);
ThreadPool.SetMinThreads(20, 20);
However, my app started hanging under heavy load. This seemed to be because worker tasks were not executing: I had used ThreadPool.QueueUserWorkItem to run some tasks which in turn used the same method to queue further work. This is obviously dangerous with a limited thread pool (a deadlock situation), but I am using a thread pool not to limit maximum threads but to reduce thread creation overhead.
I can see the potential trap there, but I believed that setting a maximum of 10000 threads on the pool would mean that if an item was queued, all threads were busy, and there weren't 10000 threads in the pool, a new one would be created and the task processed there.
However, I changed to this:
ThreadPool.SetMaxThreads(10000, 10000);
ThreadPool.SetMinThreads(200, 200);
..and the app started working. If that made it start working, am I missing something about how/when the thread pool expands from minimum toward maximum size?
The job of the threadpool scheduler is to ensure there are no more executing TP threads than cpu cores. The default minimum is equal to the number of cores. A happy number since that minimizes the overhead due to thread context switching. Twice a second, the scheduler steps in and allows another thread to execute if the existing ones haven't completed.
It will therefore take a hour and twenty minutes of having threads that don't complete to get to your new maximum. It is fairly unlikely to ever get there, a 32-bit machine will keel over when 2000 threads have consumed all available virtual memory. You'd have a shot at it on a 64-bit operating system with a very large paging file. Lots of RAM required to avoid paging death, you'd need at least 12 gigabytes.
The generic diagnostic is that you are using TP threads inappropriately. They take too long, usually caused by blocking on I/O. A regular Thread is the proper choice for those kind of jobs. That's probably hard to fix right now, especially since you're happy with what you got. Raising the minimum is indeed a quick workaround. You'll have to hand-tune it since the TP scheduler can't do a reasonable job anymore.
Whenever you use the thread pool, you are at the mercy of its "thread injection and retirement algorithm".
The algorithm is not properly documented ( that I know of ) and not configurable.
If you're using Tasks, you can write your own Task Scheduler
The performance issue you described, is similar to what is documented in this ASP.NET KB article,
http://support.microsoft.com/kb/821268
To summarize, you need to carefully choose the parameters (this article mentions the typical settings for default ASP.NET thread pool, but you can apply the trick to your app), and further tune them based on performance testing and the characteristics of your app.
Notice that the more you learn about load, you will see that "heavy load" is no longer a good term to describe the situation. Sometimes you need to further categorize the cases, to include detailed term, such as burst load, and so on.
If your logic depends on having a minimum amount of threads you need to change that, urgently.
Setting a MinThreads of 200 (or even 20) is wasting quite a bit of memory. Note that the MaxThreads won't be relevant here, you probably don't have the 10 GB mem for that.
The fact that a min of 200 helps you out is suspicious and as a solution it is probably very brittle.
Take a look at normal Producer/Consumer patterns, and/or use a bounded queue to couple your tasks.

Are Socket.*Async methods threaded?

I'm currently trying to figure what is the best way to minimize the amount of threads I use in a TCP master server, in order to maximize performance.
As I've been reading a lot recently with the new async features of C# 5.0, asynchronous does not necessarily mean multithreaded. It could mean separated in smaller chunks of finite state objects, then processed alongside other operations, by alternating. However, I don't see how this could be done in networking, since I'm basically "waiting" for input (from the client).
Therefore, I wouldn't use ReceiveAsync() for all my sockets, it would just be creating and ending threads continuously (assuming it does create threads).
Consequently, my question is more or less: what architecture can a master server take without having one "thread" per connection?
Side question for bonus coolness points: Why is having multiple threads bad, considering that having an amount of threads that is over your amount of processing cores simply makes the machine "fake" multithreading, just like any other asynchronous method would?
No, you would not necessarily be creating threads. There are two possible ways you can do async without setting up and tearing down threads all the time:
You can have a "small" number of long-lived threads, and have them sleep when there's no work to do (this means that the OS will never schedule them for execution, so the resource drain is minimal). Then, when work arrives (i.e. Async method called), wake one of them up and tell it what needs to be done. Pleased to meet you, managed thread pool.
In Windows, the most efficient mechanism for async is I/O completion ports which synchronizes access to I/O operations and allows a small number of threads to manage massive workloads.
Regarding multiple threads:
Having multiple threads is not bad for performance, if
the number of threads is not excessive
the threads do not oversaturate the CPU
If the number of threads is excessive then obviously we are taxing the OS with having to keep track of and schedule all these threads, which uses up global resources and slows it down.
If the threads are CPU-bound, then the OS will need to perform much more frequent context switches in order to maintain fairness, and context switches kill performance. In fact, with user-mode threads (which all highly scalable systems use -- think RDBMS) we make our lives harder just so we can avoid context switches.
Update:
I just found this question, which lends support to the position that you can't say how many threads are too much beforehand -- there are just too many unknown variables.
Seems like the *Async methods use IOCP (by looking at the code with Reflector).
Jon's answer is great. As for the 'side question'... See http://en.wikipedia.org/wiki/Amdahl%27s_law. Amdel's law says that serial code quickly diminishes the gains to be had from parallel code. We also know that thread coordination (scheduling, context switching, etc) is serial - so at some point more threads means there are so many serial steps that parallelization benefits are lost and you have a net negative performance. This is tricky stuff. That's why there is so much effort going into letting .NET manage threads while we define 'tasks' for the framework to decide what thread to run on. The framework can switch between tasks much more efficiently than the OS can switch between threads because the OS has a lot of extra things it needs to worry about when doing so.
Asynchronous work can be done without one-thread-per-connection or a thread pool with OS support for select or poll (and Windows supports this and it is exposed via Socket.Select). I am not sure of the performance on windows, but this is a very common idiom elsewhere.
One thread is the "pump" that manages the IO connections and monitors changes to the streams and then dispatches messages to/from other threads (conceivably 0 ... n depending upon model). Approaches with 0 or 1 additional threads may fall into the "Event Machine" category like twisted (Python) or POE (Perl). With >1 threads the callers form an "implicit thread pool" (themselves) and basically just offload the blocking IO.
There are also approaches like Actors, Continuations or Fibres exposed in the underlying models of some languages which alter how the basic problem is approached -- don't wait, react.
Happy coding.

Categories

Resources