Threading cost - minimum execution time when threads would add speed

Threading cost - minimum execution time when threads would add speed - c#

I am working on a C# application that works with an array. It walks through it (meaning that at one time only a narrow part of the array is used). I am considering adding threads in it to make it perform faster (it runs on a dualcore computer). The problem is that I do not know if it would actually help, because threads cost something and this cost could easily be more than the parallel gain... So how do I determine if threading would help?

Try writing some benchmarks that mimic, as closely as possible, the real-world conditions in which your software will actually be used.
Test and time the single-threaded version. Test and time the multi-threaded version. Compare the two sets of results.

If your application is CPU bound (i.e. it isn't spending time trying to read files or waiting for data from a device) and there is little to no sharing of live data (data being altered, if its read only its fine) between the threads then you can pretty much increase the speed by 50->75% by adding another thread (as long as it still remains CPU bound of course).
The main overhead in multithreading comes from 2 places.
Creation & initialization of the thread. Creating a thread requires quite a few resources to be allocated and involves swaps between kernel and user mode, this is expensive though a once off per thread so you can pretty much ignore it if the thread is running for any reasonable amount of time. The best way to mitigate this problem is to use a thread pool as it will keep the thread on hand and not need to be recreated.
Handling synchronization of data. If one thread is reading from data that another is writing, bad things will generally happen (worse if both are changing it). This requires you to lock your data before altering it so that no thread reads a half written value. These locks are generally quite slow as well. To mitigate this problem, you need to design your data layout so that the threads don't need to read or write to the same data as much as possible. If you do need a lot of these locks it can then become slower than the single thread option.
In short, if you are doing something that requires the CPU's to share a lot of data, then multi-threading it will be slower and if the program isn't CPU bound there will be little or no difference (could be a lot slower depending on what it is bound to, e.g. a cd/hard drive). If your program matches these conditions, then it will PROBABLY be worthwhile to add another thread (though the only way to be certain would be profiling).
One more little note, you should only create as many CPU bound threads as you have physical cores (threads that idle most of the time, such as a GUI message pump thread, can be ignored for this condition).
P.S. You can reduce the cost of locking data by using a methodology called "lock-free programming", though this something that should really only be attempted by people with a lot of experience with multi-threading and a clear understanding of their target architecture (including how the cache is treated and the memory bus).

I agree with Luke's answer. Benchmark it, it's the only way to be sure.
I can also give a prediction of the results - the fastest version will be when the number of threads matches the number of cores, EXCEPT if the array is very small and each thread would have to process just a few items, the setup/teardown times might get larger than the processing itself. How few - that depends on what you do. Again - benchmark.
I'd advise to find out a "minimum number of items for a thread to be useful". Then, when you are deciding how many threads to spawn (or take from a pool), check how many cores the computer has and how many items there are. Spawn as many threads as possible, but no more than the computer has cores, and not so many that each thread would have less than the minimum number of items to process.
For example if the minimum number of items is, say, 1000; and the computer has 4 cores; and your list contains 2500 items, you would spawn just 2 threads, because more threads would be inefficient (each would process less than 1000 items).

Making a step by step list for Luke's idea:
Make a single threaded test app
Download Sysinternals Process Monitor and run it
Run your test app and find it on the process list (remember to run it as a release build outside of Visual Studio)
Double click the process and select the Performance Graph tab
Observe the CPU time used by your process
If the CPU time is sittling flat 50% for more than a few seconds, you can probably speed your overall process up using threads (assuming the bunch of stuff Mr Peters refered to holds true)
(However, the best you can do on a duel core machine is to halve the time it takes to run. If your process only take 4 seconds, it might not be worth getting it to run in 2 seconds)
Using the task parallel library / Rx provides a friendlier interface than System.Threading.ThreadPool, which might make your world a bit easier.

You miss imho one item, which is that it is not always about execution time. There is:
The problem to koop a UI operational during an operation. Even if the UI is "dormant", a nonresponsive message pump makes a worse impression.
The possibility to use a thread pool to actually not ahve to start / stop threads all the time. I use thread pools very extensively, and various parts of the applications keep them busy.
Anyhow, ignoring my point 1 - where you may go multi threaded without speeding things up in order to keep your UI responsive - I would say it is always then faster when you can actually either split up work (so you can keep more than one core busy) or offload it for othe reasons.

Related

How can I get started creating a multithreaded load balancer?

I have an interesting exercise to solve from my professor. But I need a little bit of help so it does not become boring during the holidays.
The exercise is to
create a multithreaded load balancer, that reads 1 measuring point from 5 sensors every second. (therefore 5 values every second).
Then do some "complex" calculations with those values.
Printing results of the calculations on the screen. (like max value or average value of sensor 1-5 and so on, of course multithreaded)
As an additional task I also have to ensure that if in the future for example 500 sensors would be read every second the computer doesn't quit the job.(load balancing).
I have a csv textfile with ~400 measuring points from 5 imaginary sensors.
What I think I have to do:
Read the measuring points into an array
Ensure thread safe access to that array
Spawn a new thread for every value that calculates some math stuff
Set a max value for maximum concurrent working threads
I am new to multithreading applications in c# but I think using threadpool is the right way. I am currently working on a queue and maybe starting it inside a task so it wont block the application.
What would you recommend?

There are a couple of environment dependencies here:
What version of .NET are you using?
What UI are you using - desktop (WPF/WinForms) or ASP.NET?
Let's assume that it's .NET 4.0 or higher and a desktop app.
Reading the sensors
In a WPF or WinForms application, I would use a single BackgroundWorker to read data from the sensors. 500 reads per second is trivial - even 500,00 is usually trivial. And the BackgroundWorker type is specifically designed for interacting with desktop apps, for example handing-off results to the UI without worrying about thread interactions.
Processing the calculations
Then you need to process the "complex" calculations. This depends on how long-lived these calculations are. If we assume they're short-lived (say less than 1 second each), then I think using the TaskScheduler and the standard ThreadPool will be fine. So you create a Task for each calculation, and then let the TaskScheduler take care of allocating tasks to threads.
The job of the TaskScheduler is to load-balance the work by queuing lightweight tasks to more heavyweight threads, and managing the ThreadPool to best balance the workload vs the number of cores on the machine. You can even override the default TaskScheduler to schedule tasks in whatever manner you want.
The ThreadPool is a FIFO queue of work items that need to be processed. In .NET 4.0, the ThreadPool has improved performance by making the work queue a thread-safe ConcurrentQueue collection.
Measuring task throughput and efficiency
You can use PerformanceCounter to measure both CPU and memory usage. This will give you a good idea of whether the cores and memory are being used efficiently. The task throughput is simply measured by looking at the rate at which tasks are being processed and supplying results.
Note that I haven't included any code here, as I assume you want to deal with the implementation details for your professor :-)

C# ThreadPool Implementation / Performance Spikes

In an attempt to speed up processing of physics objects in C# I decided to change a linear update algorithm into a parallel algorithm. I believed the best approach was to use the ThreadPool as it is built for completing a queue of jobs.
When I first implemented the parallel algorithm, I queued up a job for every physics object. Keep in mind, a single job completes fairly quickly (updates forces, velocity, position, checks for collision with the old state of any surrounding objects to make it thread safe, etc). I would then wait on all jobs to be finished using a single wait handle, with an interlocked integer that I decremented each time a physics object completed (upon hitting zero, I then set the wait handle). The wait was required as the next task I needed to do involved having the objects all be updated.
The first thing I noticed was that performance was crazy. When averaged, the thread pooling seemed to be going a bit faster, but had massive spikes in performance (on the order of 10 ms per update, with random jumps to 40-60ms). I attempted to profile this using ANTS, however I could not gain any insight into why the spikes were occurring.
My next approach was to still use the ThreadPool, however instead I split all the objects into groups. I initially started with only 8 groups, as that was how any cores my computer had. The performance was great. It far outperformed the single threaded approach, and had no spikes (about 6ms per update).
The only thing I thought about was that, if one job completed before the others, there would be an idle core. Therefore, I increased the number of jobs to about 20, and even up to 500. As I expected, it dropped to 5ms.
So my questions are as follows:
Why would spikes occur when I made the job sizes quick / many?
Is there any insight into how the ThreadPool is implemented that would help me to understand how best to use it?

Using threads has a price - you need context switching, you need locking (the job queue is most probably locked when a thread tries to fetch a new job) - it all comes at a price. This price is usually small compared to the actual work your thread is doing, but if the work ends quickly, the price becomes meaningful.
Your solution seems correct. A reasonable rule of thumb is to have twice as many threads as there are cores.

As you probably expect yourself, the spikes are likely caused by the code that manages the thread pools and distributes tasks to them.
For parallel programming, there are more sophisticated approaches than "manually" distributing work across different threads (even if using the threadpool).
See Parallel Programming in the .NET Framework for instance for an overview and different options. In your case, the "solution" may be as simple as this:
Parallel.ForEach(physicObjects, physicObject => Process(physicObject));

Here's my take on your two questions:
I'd like to start with question 2 (how the thread pool works) because it actually holds the key to answering question 1. The thread pool is implemented (without going into details) as a (thread-safe) work queue and a group of worker threads (which may shrink or enlarge as needed). As the user calls QueueUserWorkItem the task is put into the work queue. The workers keep polling the queue and taking work if they are idle. Once they manage to take a task, they execute it and then return to the queue for more work (this is very important!). So the work is done by the workers on-demand: as the workers become idle they take more pieces of work to do.
Having said the above, it's simple to see what is the answer to question 1 (why did you see a performance difference with more fine-grained tasks): it's because with fine-grain you get more load-balancing (a very desirable property), i.e. your workers do more or less the same amount of work and all cores are exploited uniformly. As you said, with a coarse-grain task distribution, there may be longer and shorter tasks, so one or more cores may be lagging behind, slowing down the overall computation, while other do nothing. With small tasks the problem goes away. Each worker thread takes one small task at a time and then goes back for more. If one thread picks up a shorter task it will go to the queue more often, If it takes a longer task it will go to the queue less often, so things are balanced.
Finally, when the jobs are too fine-grained, and considering that the pool may enlarge to over 1K threads, there is very high contention on the queue when all threads go back to take more work (which happens very often), which may account for the spikes you are seeing. If the underlying implementation uses a blocking lock to access the queue, then context switches are very frequent which hurts performance a lot and makes it seem rather random.

answer of question 1:
this is because of Thread switching , thread switching (or context switching in OS concepts) is CPU clocks that takes to switch between each thread , most of times multi-threading increases the speed of programs and process but when it's process is so small and quick size then context switching will take more time than thread's self process so the whole program throughput decreases, you can find more information about this in O.S concepts books .
answer of question 2:
actually i have a overall insight of ThreadPool , and i cant explain what is it's structure exactly.

to learn more about ThreadPool start here ThreadPool Class
each version of .NET Framework adds more and more capabilities utilizing ThreadPool indirectly. such as Parallel.ForEach Method mentioned before added in .NET 4 along with System.Threading.Tasks which makes code more readable and neat. You can learn more on this here Task Schedulers as well.
At very basic level what it does is: it creates let's say 20 threads and puts them into a lits. Each time it receives a delegate to execute async it takes idle thread from the list and executes delegate. if no available threads found it puts it into a queue. every time deletegate execution completes it will check if queue has any item and if so peeks one and executes in the same thread.

Multithreading takes more time than Sequential Execution in C#

I am using Multithreading in my while loop ,
as
while(reader.read())
{
string Number= (reader["Num"] != DBNull.Value) ? reader["Num"].ToString() : string.Empty;
threadarray[RowCount] = new Thread(() =>
{
object ID= (from r in Datasetds.Tables[0].AsEnumerable()
where r.Field<string>("Num") == Number
select r.Field<int>("ID")).First<int>();
});
threadarray[RowCount].Start();
RowCount++;
}
But with sequential execution ,for 200 readers it just takes 0.4 s
but with threading it takes 1.1 sec... This is an example but i have same problem when i execute it with number of lines of code in threading with multiple database operations.
for sequential it takes 10 sec for threading it takes more...
can anyone please suggest me?
Thanks...

Threading is not always quicker and in many cases can be slower (such as the case seen here). There are plenty of reasons why but the two most significant are
Creating a thread is a relatively expensive OS operation
Context switching (where the CPU stops working on one thread and starts working on another) is again a relatively expensive operation
Creating 200 threads will take a fair amount of time (with the default stack size this will allocate 200MB of memory for stacks alone), and unless you have a machine with 200 cores the OS will also need to spend a fair amount of time context switching between those threads too.
The end result is that the time that the machine spends creating threads and switching between them simply outstrips the amount of time that the machine spends doing any work. You may see a performance improvement if you reduce the number of threads being used. Try starting with 1 thread for each core that your machine has.
Multithreading where you have more threads than cores is generally only useful in scenarios where the CPU is hanging around waiting for things to happen (like disk IO or network communication). This isn't the case here.

Threading isn't always the solution and the way you're using it is definitely not thread-safe. Things like disk I/O or other bottlenecks won't benefit from threading in certain circumstances.
Also, there is a cost for starting up threads. Not that I would recommend it for your situation, but check out the TPL. http://msdn.microsoft.com/en-us/library/dd460717.aspx

Multithreading usually is a choice for non blocking execution. Like everything on earth it has its associated costs.
For the commodity of parallel execution, we pay with performance.
Usually there is nothing faster then sequential execution of a single task.
It's hard to suggest something real, in you concrete scenario.
May be you can think about multiple process execution, instead of multiple threads execution.
But I repeast it's hard to tell if you will get benefits from this, without knowing application complete architecture and requirements.

it seems you are creating a thread for each read(). so if it has 200 read(), you have 200 threads running (maybe fewer since some may finished quickly). depends on what you are doing in the thread, 200 threads running at the same time may actually slow down the system because of overheads like others mentioned.
multuthread helps you when 1) the job in the thread takes some time to finish; 2) you have a control on how many threads running at the same time.
in your case, you need try, say 10 threads. if 10 threads are running, wait until 1 of them finished, then allocate the thread to a new read().
if the job in the thread does not take much time, then better use single thread.

Sci Fi author and technologist Jerry Pournelle once said that in an ideal world every process should have its own processor. This is not an ideal world, and your machine has probably 1 - 4 processors. Your Windows system alone is running scads of processes even when you yourself are daydreaming. I just counted the processes running on my Core 2 Quad XP machine, and SYSTEM is running 65 processes. That's 65 processes that have to be shared between 4 processors. Add more threads and each one gets only a slice of processor power.
If you had a Beowulf Cluster you could share threads off into individual machines and you'd probably get very good timings. But your machine can't do this with only 4 processors. The more you ask it to do the worse performance is going to be.

Do I need a Multi-threaded WPF application for this scenario?

I need an application/service which runs in the background and generate bills on a particular date of every month.
I went through many articles explaining the difference between Windows Service and Scheduled task Application and came to a conclusion that an application would suit my scenario.
Having said this, I wonder if I need to use Multi-threading in my application as I understand Multi-threading is basically to create a responsive UI while doing long running tasks but since my application will have no UI , do I need to have multi-threading actually?
Is there any difference in performance for a single threaded application to get the data from various sources (say database,webservice) and a multi-threaded application where we distribute each task to a thread and finally integrate all the output?

Typically, an application like this will have no user interface at all, in which case your rationale for multi-threading may be meaningless in this case.
That being said, whether or not to use multiple threads for processing your data is another issue entirely. You could, if it makes sense to do so. If this is an application that's going to run once per month, it may be just as easy to leave it single threaded, as there's likely not a time constraint for completion.
If you need to process the items quickly, though, it may make sense to thread portions of the application.
Is there any difference in performance for a single threaded application to get the data from various sources (say database,webservice) and a multi-threaded application where we distribute each task to a thread and finally integrate all the output?
Typically, yes. That's the most common reason to introduce threading - it allows you to do more work in less time. It does add a fair amount of complexity (depends on the scenario), however.

You would, presumably, get a faster response time from a multi-threaded program if these two cases are true: You have a multi core processor, which almost everyone does these days. Pulling data from all of your sources could be done in any order, and accessing that source with one thread would not lock it up from another.
The best reason to use multiple threads in this cause would be if you spend a lot of time blocking; waiting for something else to respond. If you're reading tons of data from your hard drive as fast as the disk can give it, then having two threads that read data shouldn't give you anything faster. In fact, I think it would be a bit slower. But if you're getting a lot of data from, say, sockets (the internet), and your threads spend a fair amount of time waiting for external servers to respond (and you're not using all of your bandwidth), then a multi threaded program would give you a boost in speed.

Whats a good architecture for a life simulator

I was thinking the other day about creating a little life simulator. I have only brushed over the idea and I was wondering the best way to implement so it will run efficiently.
At the lowest level there will be male and female entities wondering around and when they meet they will produce offspring.
I would like to use multithreading to manage the entities but im not sure of number of threads etc.
Would it be best to have a couple of threads that manage males and females, or would it be ok to start a thread for each entity so each instance is running on its own thread?
I have read a few posts about maximum number of threads and the limit ranges from 20-1000 threads in an app.
Anyone have any suggestions on architecture?
N

DO NOT HAVE ONE THREAD PER ENTITY. That is a recipe for disaster. That is not what threads were designed for. Threads are extremely heavyweight; remember, every thread consumes one million bytes of virtual address space immediately. Also, remember that every time two threads interact on a shared data structure -- like your simulated world -- they need to take out locks to prevent corruption. Your program will be a mass of hundreds of blocked threads, and blocked threads are useless; they can't work. High contention is anathema to good performance, and lots of threads sharing one data structure is nothing but constant contention.
The number of threads in your program should be one unless you have an extremely good reason to have two. In general the number of threads in a program should be as small as you can possibly make it while still getting the performance characteristics you need.
If your program really is "embarrassingly parallel" -- that is, it is extremely easy to do calculations in parallel without locking a shared data structure -- then the correct number of threads is equal to the number of processor cores in the machine. Remember, threads slow each other down. If you have four bank tellers, and each one is servicing one customer, things go fast. You are describing a situation where you have four bank tellers (CPU cores) each servicing a hundred people at the same time by handing out pennies round-robin.
Simulations are rarely embarrassingly parallel because it is hard to break up the work into independent parts. Ray tracing, for example, is embarrassingly parallel; the contents of one pixel do not depend on the contents of any other pixel, so they can be calculated in parallel. But you would not have one thread per pixel! You'd have one thread per processor, and let each of four processors work on a quarter of the pixels. It's hard to do that with simulations because the entities do interact with each other.
High quality professional simulators such as physics engines that need to deal with interacting entities do not solve their problems with threads. Typically they'll have one thread doing the simulation and one thread running the user interface, so that a costly simulation calculation does not hang the UI.
The right architecture for you is probably to have one thread going like blazes just doing the simulation. Work out all the interactions of the entities for a single frame and then signal the UI thread that it needs to update. This will allow you to figure out what your maximum frame rate is, by measuring how many microseconds it takes to work out all the interactions of every entity.

Using 100s of Threads (that would Sleep() a lot) is not a good solution. You will quickly run out of memory.
The TPL might make this idea workable but it wasn't really designed for this either.
Look into Discrete Event Simulation and Fibers. There are pseudo-Fiber libraries for .NET

I assume that your application manages the timeline in delta-Ts.
You could use C# 4.0's parallel framework and work with TASKs not threads.
Each delta-T run a parallel-for to update the entities.
The parallel framework will decide how many threads to open and manage the threads.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.