I am interning for a company this summer, and I got passed down this program which is a total piece. It does very computationally intensive operations throughout most of its duration. It takes about 5 minutes to complete a run on a small job, and the guy I work with said that the larger jobs have taken up to 4 days to run. My job is to find a way to make it go faster. My idea was that I could split the input in half and pass the halves to two new threads or processes, I was wondering if I could get some feedback on how effective that might be and whether threads or processes are the way to go.
Any inputs would be welcomed.
Hunter
I'd take a strong look at TPL that was introduced in .net4 :) PLINQ might be especially useful for easy speedups.
Genereally speaking, splitting into diffrent processes(exefiles) is inadvicable for perfomance since starting processes is expensive. It does have other merits such as isolation(if part of a program crashes) though, but i dont think they are applicable for your problem.
If the jobs are splittable, then going multithreaded/multiprocessed will bring better speed. That is assuming, of course, that the computer they run on actually has multiple cores/cpus.
Threads or processes doesn't really matter regarding speed (if the threads don't share data). The only reason to use processes that I know of is when a job is likely to crash an entire process, which is not likely in .NET.
Use threads if theres lots of memory sharing in your code but if you think you'd like to scale the program to run across multiple computers (when required cores > 16) then develop it using processes with a client/server model.
Best way when optimising code, always, is to Profile it to find out where the Logjam's are IMO.
Sometimes you can find non obvious huge speed increases with little effort.
Eqatec, and SlimTune are two free C# profilers which may be worth trying out.
(Of course the other comments about which parallelization architecture to use are spot on - it's just I prefer analysis first....
Have a look at the Task Parallel Library -- this sounds like a prime candidate problem for using it.
As for the threads vs processes dilemma: threads are fine unless there is a specific reason to use processes (e.g. if you were using buggy code that you couldn't fix, and you did not want a bad crash in that code to bring down your whole process).
Well if the problem has a parallel solution then this is the right way to (ideally) significantly (but not always) increase performance.
However, you don't control making additional processes except for running an app that launches multiple mini apps ... which is not going to help you with this problem.
You are going to need to utilize multiple threads. There is a pretty cool library added to .NET for parallel programming you should take a look at. I believe its namespace is System.Threading.Tasks or System.Threading with the Parallel class.
Edit: I would definitely suggest though, that you think about whether or not a linear solution may fit better. Sometimes parallel solutions would taken even longer. It all depends on the problem in question.
If you need to communicate/pass data, go with threads (and if you can go .Net 4, use the Task Parallel Library as others have suggested). If you don't need to pass info that much, I suggest processes (scales a bit better on multiple cores, you get the ability to do multiple computers in a client/server setup [server passes info to clients and gets a response, but other than that not much info passing], etc.).
Personally, I would invest my effort into profiling the application first. You can gain a much better awareness of where the problem spots are before attempting a fix. You can parallelize this problem all day long, but it will only give you a linear improvement in speed (assuming that it can be parallelized at all). But, if you can figure out how to transform the solution into something that only takes O(n) operations instead of O(n^2), for example, then you have hit the jackpot. I guess what I am saying is that you should not necessarily focus on parallelization.
You might find spots that are looping through collections to find specific items. Instead you can transform these loops into hash table lookups. You might find spots that do frequent sorting. Instead you could convert those frequent sorting operations into a single binary search tree (SortedDictionary) which maintains a sorted collection efficiently through the many add/remove operations. And maybe you will find spots that repeatedly make the same calculations. You can cache the results of already made calculations and look them up later if necessary.
Related
The more I use Parallel.ForEach and PLINQ in my code, the more faces and code review push backs I am getting. So I wonder is there any reason for me NOT to use PLINQ, at extreme, on each LINQ statement? Can the runtime not be smart enough to start spawning so many threads (or consuming so many threads from the thread pool) that the app performance would actually degrade instead of improve? The same question applies to Parallel library.
I do understand implications related to thread-safety and overhead of using multi-threading. I also realize not everything is good for parallelizing. All I am wondering about if I should stop defending my approaches and just give up on these two fine things because my peers think I'd better do thread control myself instead of relying on .NET facilities?
UPDATE: please assume the hardware is sufficiently good to satisfy prerequisites for use of multithreading.
It all comes down to two things:
Is the extra work required to partition the collection and synchronize the threads greater than the performance gain compared to a regular foreach?
Are all the threads going to use a shared resource that will become a bottle neck?
An example of the second case is doing a Parallel.ForEach over the results of a Linq to Sql statement. In that case, if your results are coming from the DB very slowly, each thread may spend more time waiting for data to process than actually doing something.
See: http://msdn.microsoft.com/en-us/library/dd997392.aspx
To set the number of worker threads you can use .WithDegreeOfParallelism(N)
eg
var query = from item in source.AsParallel().WithDegreeOfParallelism(2)
where Compute(item) > 42
select item;
See http://msdn.microsoft.com/en-us/library/dd997425.aspx
When dig into performance questions this deep, I think the best thing to do is... measure, measure and measure. Even if somebody answered that PLINK is great and will boost the performance of your application, would you trust that without verifing it with profiling? Although general answers may exists you cannot spare the effort to measure the performance in your exact case. The overall performance depends on so many things and it can be that PLINK helps in one case but not in the other.My personal experiences with PLINK is that after swicthing every LINQ query into PLINK the response times are way better when the load is small, and there is no difference when the load is around its maximum. But I can imagine a case where PLINK hurts the overall performance under a huge load. Have to check it for your own particular case.Well... and if you want to convince other people that you are walking the right path, what else would be better than measurement results?
Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.
The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.
The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.
It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.
Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.
Any advice or great websites or books? After much Googling, I'm coming up empty.
Sincerely,
Wayne
This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:
http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901
And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.
my own question, with answer, on stackoverflow's sister site: https://softwareengineering.stackexchange.com/questions/126986/where-can-i-find-an-overview-of-known-multithreading-design-patterns/126993#126993
I will copy the answer to avoid the need for click-through:
Quote Boris:
Parallel Programming with Microsoft .NET: Design Patterns for
Decomposition and Coordination on Multicore Architectures https://rads.stackoverflow.com/amzn/click/0735651590
This is a book, I recommend wholeheartedly.
It is:
New - published last year. Means you are not reading somewhat outdated
practices.
Short - about 200+ pages, dense with information. These
days there is too much to read and too little time to read 1000+ pages
books.
Easy to read - not only it is very well written but it
introduces hard to grasps concepts in really simple to read way.
Intended to teach - each chapter gives exercises to do. I know it is
always beneficial to do these, but rarely do. This book gives very
compelling and interesting tasks. Surprisingly I did most of them and
enjoyed doing them.
additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.
PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks
My answer on "Are you concerned about multicores"
Herb Sutter's articles
Video Series on Parallel Programming
One specific cause of unexpected poor results in parallelized code is false sharing, you won't see that coming if you dont know what's going on down there (I didn't). Here a two articles that dicuss the cause and remedy for .Net:
http://msdn.microsoft.com/en-us/magazine/cc872851.aspx
http://www.codeproject.com/KB/threads/FalseSharing.aspx
Rgds GJ
There are different aspects to multi-threading requiring different approaches.
On a webserver, for example, the use of thread-pools is widely used since it supposedly is "good for" performance. Such pools may contain hundreds of threads waiting to be put to work. Using that many threads will cause the scheduler to work overtime which is detrimental to performance but can't be avoided on Linux systems. For Windows the method of choice is the IOCP mechanism which recommends a number of threads not greater than the number of cores installed. It causes an application to become (I/O completion) event driven which means that no cycles are wasted on polling. The few threads involved reduce scheduler work to a minimum.
If the object is to implement a functionality that is scaleable (more cores <=> higher performance) then the main issue will be memory bus saturation. The saturation will occur due to code fetching, data reading and data writing. An incorrectly implemented code will run slower with two threads than with one. The only way around this is to reduce the memory bus work by actively:
tailoring the code to a minimal memory footprint (= fits in the code cache) and which doesn't call other functions or jump all over the place.
tailoring memory reads and writes to a minimum size.
informing the prefetch mechanism of coming RAM reads.
tailoring the work such that the ratio of work performed inside the core's own caches (L1 & L2) is as great as possible in relation to the work outside them (L3 & RAM).
To put this in another way: fit the applicable code and data chunks into as few cache lines (# 64 bytes each) as possible because ultimately this is what will decide the scaleability. If the cache/memory system is capable of x cache line operations every second your code will run faster if its requirements are five cache lines per unit of work (=> x/5) rather than eleven (x/11) or fifty-two (x/52).
Achieving this is not trivial since it requires a more or less unique solution every time. Some compilers do a good job of instruction ordering to take advantage of the host processor's pipelining. This does not necessarily mean that it will be a good ordering for multiple cores.
An efficient implementation of a scaleable code will not necessarily be a pretty one. Recommended coding techniques and styles may, in the end, hinder the code's execution.
My advice is to test how this works by writing a simple multi-threaded application in a low-level language (such as C) that can be adjusted to run in single or multi-threaded mode and then profiling the code for the different modes. You will need to analyze the code at the instruction level. Then you experiment using different (C) code constructs, data organization, etc. You may have to think outside the box and rethink the algorithm to make it more cache-friendly.
The first time will require lots of work. You will not learn what will work for all multi-threaded solutions but you will perhaps get an inkling of what not to do and what indications to look for when analyzing profiled code.
I found this link that specifically explains the issues of
multicore cache handling on CPUs that was affecting my
multithreaded program.
http://www.multicoreinfo.com/research/intel/mem-issues.pdf
The site multicoreinfo.com in general has lots of good
information and references about multicore programming.
I have a small list of rather large files that I want to process, which got me thinking...
In C#, I was thinking of using Parallel.ForEach of TPL to take advantage of modern multi-core CPUs, but my question is more of a hypothetical character;
Does the use of multi-threading in practicality mean that it would take longer time to load the files in parallel (using as many CPU-cores as possible), as opposed to loading each file sequentially (but with probably less CPU-utilization)?
Or to put it in another way (:
What is the point of multi-threading? More tasks in parallel but at a slower rate, as opposed to focusing all computing resources on one task at a time?
In order to not increase latency, parallel computational programs typically only create one thread per core. Applications which aren't purely computational tend to add more threads so that the number of runnable threads is the number of cores (the others are in I/O wait, and not competing for CPU time).
Now, parallelism on disk-I/O bound programs may well cause performance to decrease, if the disk has a non-negligible seek time then much more time will be wasted performing seeks and less time actually reading. This is called "churning" or "thrashing". Elevator sorting helps somewhat, true random access (such as solid state memories) helps more.
Parallelism does almost always increase the total raw work done, but this is only important if battery life is of foremost importance (and by the time you account for power used by other components, such as the screen backlight, completing quicker is often still more efficient overall).
You asked multiple questions, so I've broken up my response into multiple answers:
Multithreading may have no effect on loading speed, depending on what your bottleneck during loading is. If you're loading a lot of data off disk or a database, I/O may be your limiting factor. On the other hand if 'loading' involves doing a lot of CPU work with some data, you may get a speed up from using multithreading.
Generally speaking you can't focus "all computing resources on one task." Some multicore processors have the ability to overclock a single core in exchange for disabling other cores, but this speed boost is not equal to the potential performance benefit you would get from fully utilizing all of the cores using multithreading/multiprocessing. In other words it's asymmetrical -- if you have a 4 core 1Ghz CPU, it won't be able to overclock a single core all the way to 4ghz in exchange for disabling the others. In fact, that's the reason the industry is going multicore in the first place -- at least for now we've hit limits on how fast we can make a single CPU run, so instead we've gone the route of adding more CPUs.
There are 2 reasons for multithreading. The first is that you want to tasks to run at the same time simply because it's desirable for both to be able to happen simultaneously -- e.g. you want your GUI to continue to respond to clicks or keyboard presses while it's doing other work (event loops are another way to accomplish this though). The second is to utilize multiple cores to get a performance boost.
For loading files from disk, this is likely to make things much slower. What happens is the operating system tries to lay out files on disk such that you should only need to do an expensive disk seek once for each file. If you have a lot of threads reading a lot of files, you're gonna have contention over which thread has access to the disk, and you'll have to seek back to the right place in the file every time the next thread gets a turn.
What you can do is use exactly two threads. Set one to load all of the files in the background, and let the other remain available for other tasks, like handling user input. In C# winforms, you can do this easily with a BackgroundWorker control.
Multi-threading is useful for highly parallelizable tasks. CPU intensive tasks are perfect. Your CPU has many cores, many threads can use many cores. They'll use more CPU time, but in the end they'll use less "user" time. If your app is I/O bounded, then multithreading isn't always the solution (but it COULD help)
It might be helpful to first understand the difference between Multithreading and Parallelism, as more often than not I see them being used rather interchangeably. Joseph Albahari has written a quite interesting guide about the subject: Threading in C# - Part 5 - Parallelism
As with all great programming endeavors, it depends. By and large, you'll be requesting files from one physical store, or one physical controller which will serialize the requests anyhow (or worse, cause a LOT of head back-and-forth on a classical hard drive) and slow down the already slow I/O.
OTOH, if the controllers and the medium are separate, multiple cores loading data from them should be improved over a sequential method.
Basically, I'm wondering if threading is useful or necessary, or possibly more specifically the uses and situations in which you would use it. I don't know much about threading, and have never used it (I primarily use C#) and have wondered if there are any gains to performance or stability if you use them. If anyone would be so kind to explain, I would be grateful.
In the world of desktop applications (my domain), threading is a vital construct in creating responsive user interfaces. Whenever a time-or-computationally-intensive operation needs to run, it's almost essential to run that operation in a separate thread. Otherwise, the user interface locks up and, in some cases, Windows will decide that the whole application has become unresponsive.
Threading is also a vital tool in animation, audio and communications. Basically, any situation in which you find yourself needing to do several things at once lends itself to the use of threads.
there is definitely no gains to stability :). I would suggest you get a basic understanding of threading but don't jump to use it in any real production application until you have a real need. you have C# so not sure if you are building websites or winforms.
Usually the firsty threading use case for winforms is when a user click a button and you want to run some expensive operation (database or webservice call) but you dont want the screen to freeze up . .
a good tutorial to deal with that situation is to look at the backgroundworker class in c# as this will give you a first flavor into this space and then you can go from there
There was a time when our applications would speed up when we deploy them on new CPU. And that speed up was by large extent because CPU speed (clock) was incremented by large factors.
But several years ago, CPU manufacturers stopped increasing CPU clocks because of physical limits (e.g. heat dissipation). And instead they started adding additional cores to CPUs.
Now, if your application runs only on one thread it cannot take advantage of complete CPU (e.g. of 4 cores it uses only 1).
So today to fully utilize CPU we must take effort and divide task on multiple treads.
For ASP.NET this is already done for us by ASP.NET architecture and IIS.
Look here The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
Here is a simple example of how threading can improve performance. You have a n numbers that all needed to be added together. In a single threaded application, it will take a n time units to add all of the numbers together for the final sum. However, if you broke your numbers into 2 groups, you could have the same operation running side by side with, each with a group of n/2 numbers. Each would take n/2 time units to find their respective sums, and then an additional unit to find the full sum. By creating two threads, you have effectively cut the compute time in half.
Technically on a single core processor, there is no such thing as multi-threading, just the illusion that multiple tasks are happening in parallel since each task gets a small amount of time.
However, that being said, threading is very useful if you have to do some work that takes a long time but you want your application to be responsive (i.e. be able to do other things) while you wait for that task to finish. A good example is GUI applications.
On multi-core / multi-processor systems, you can have one process doing many things at once so the performance gain there is obvious :)
I am getting ready to perform a series of performance comparisons of various of the shelf products.
What do I need to do to show credibility in the tests? How do I design my benchmark tests so that they are respectable?
I am also interested in any suggestions on the actual design of the tests. Ways to load data without effecting the tests (Heisenberg Uncertainty Principle), or ways to monitor... etc
This is a bit tricky to answer without knowing what sort of "off the shelf" products you are trying to assess. Are you looking for UI responsiveness, throughput (e.g. email, transactions/sec), startup time, etc - all of these have different criteria for what measures you should track and different tools for testing or evaluating. But to answer some of your general questions:
Credibility - this is important. Try to make sure that whatever you are measuring has little run to run variance. Utilize the technique of doing several runs of the same scenario, get rid of outliers (i.e. your lowest and highest), and evaluate your avg/max/min/median values. If you're doing some sort of throughput test, consider making it long running so you have a good sample set. For example, if you are looking at something like Microsoft Exchange and thus are using their perf counters, try to make sure you are taking frequent samples (once per sec or every few secs) and have the test run for 20mins or so. Again, chop off the first few mins and the last few mins to eliminate any startup/shutdown noise.
Heisenburg - tricky. In most modern systems, depending on what application/measures you are measuring, you can minimize this impact by being smart about what/how you are measuring. Sometimes (like in the Exchange example), you'll see near 0 impact. Try to use as least invasive tools as possible. For example, if you're measuring startup time, consider using xperfinfo and utilize the events built into the kernel. If you're using perfmon, don't flood the system with extraneous counters that you don't care about. If you're doing some exteremely long running test, ratchet down your sampling interval.
Also try to eliminate any sources of environment variability or possible sources of noise. If you're doing something network intensive, consider isolating the network. Try to disable any services or applications that you don't care about. Limit any sort of disk IO, memory intensive operations, etc. If disk IO might introduce noise in something that is CPU bound, consider using SSD.
When designing your tests, keep repeatability in mind. If you doing some sort of microbenchmark type testing (e.g. perf unit test) then have your infrastructure support running the same operation n times exactly the same. If you're driving UI, try not to physically drive the mouse and instead use the underlying accessibility layer (MSAA, UIAutomation, etc) to hit controls directly programmatically.
Again, this is just general advice. If you have more specifics then I can try to follow up with more relavant guidance.
Enjoy!
Your question is very interesting, but a bit vague, because without knowing what to test it is not easy to give you some clues.
You can test performance from many different angles, then, depending on the use or target of the library you should try one approach or another; I will try to enumerate some of the things you may have to consider for measurement:
Multithreading: if the library uses
it or your software will use the
library in a multithreaded context
then you may have to test it with
many different processor and
multiprocessor configurations to see
how it reacts.
Startup time: its
importance depends on how intensively
will you use the library and what’s
the nature of the product being built
with it (client, server …).
Response time: for this do not take
the first execution, try to execute
the same call many times after the
first one and do an average. Using
System.Diagnostics.StopWatch could be
very useful for that.
Memory
consumption: analyze the growth,
beware of exponential ones ;). Go a
step further and measure quantity of
objects being created and disposed.
Responsiveness: you should not only
measure raw performance, how the user
feels the speed of the product it is
very important too.
Network: if the
library uses resources on the network
you may have to test it with
different bandwidth and latency
configurations, there is software to
simulate these situations.
Data:
try to create many different testing
data packages, trying to cover, for
example: a big bunch of raw data,
then a large set made of many smaller
chunks, a long iteration with small
pieces of data, …
Tools:
System.Diagnostics.Stopwatch: essential for benchmarking method calls
Performance counters: whenever available they are very useful to know what’s happening inside, allowing you to monitor the software without affecting its performance.
Profilers: there are some good memory and performance profilers in the market, but as you said, they always affect the measurements. They are good for finding bottlenecks in your software, but I don’t think you can use them for a comparison test.
Why do you care about the performance? In both cases the time taken to write the message to wherever you a storing your log will be a lot slower than anything else.
If you are really doing that match logging, then you are likely to need to index your log files so you can find the log entry you need, at that point you are not doing standard logging.