How to measure multithread performance/capacity in a Windows Form? - c#

I'm using Windows Forms and HTMLAgilityPack to Load a website HTML into a variable, with Parallel.For to call it 10 at a time.
How can I calculate how many times I could do this in Parallel? Watching task manager CPU usage? I'm I calling to few or too many times?
Thanks

By default, For and ForEach will utilize however many threads the underlying scheduler provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used.
https://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism.aspx

Are you trying to determine what degree of parallelism (how many threads) yields the best performance?
I'd recommend
Don't worry about it unless you determine that performance is actually an issue. You can fine tune it but the perfect settings on the computer you're using might not be so great on another where the code executes.
Perform multiple tests and see which concludes faster. Back before we had Parallel.ForEach I would do that sometimes. I often found that adding more threads meant faster execution, and then adding some more produced diminishing returns or even slower performance.
Unless you're seeing performance issues and you think this is the fix I wouldn't worry about it.

Related

best use of Parallel.ForEach / Multithreading

I need to scrape data from a website.
I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.
I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?
I have not worked with Parallel before so maybe my approach maybe wrong.
you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.
Heres the code snippet -
ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;
Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).
You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.
Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.
Something worth checking out is the TPL Dataflow library.
DataFlow on MSDN.
See Nesting await in Parallel.ForEach
The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.
Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.
Hard to say without looking at your code and how the collection is defined, I've found that Parallel.Invoke is the most flexible. try msdn? ... sounds like you are looking to use Parallel.For Method (Int32, Int32, Action<Int32, ParallelLoopState>)

At which time and in which context should I call ThreadPool.SetMinThreads

I have an ASP .NET MVC 2 - application, which calls functions of a C#-DLL.
The DLL itself is multithreaded. In the worst case it uses up to 200 threads, which do not run very long.
I use asynchronous delegates in order to generate the threads. In order to speed up the initialization of the delegates, I calculate the number of threads I need in advance and give it to the ThreadPool:
ThreadPool.SetMinThreads(my_num_threads, ...);
I just wonder, if I need to do this early enough, such that the ThreadPool has enough time to create the threads? Do I have to consider, when I set the size of the ThreadPool or are the threads available immediately after I call SetMinThreads?
Furthermore, if I set the size outside the DLL in my ASP .NET MVC-application (before I call the DLL), will this setting be available/visible for the DLL?
They share the same application domain so that setting ThreadPool anywhere affects all. Note that this also impacts on the ASP.NET framework which will use ThreadPool itself for all its own async tasks etc.
With that in mind, if you knew roughly the minimum number of threads you wanted you could set that on application startup ready for use later on.
However, 200 threads seems somewhat excessive, to put it into context my Chrome with about 8 tabs open uses about 35 and my SQL Server ~50. What are you doing that demands so many?
Also realise that eventually you'll reach a limit where performance will degrade as there are so many threads that must be serviced. Microsoft says such on MSDN:
You can use the SetMinThreads method to increase the minimum number of threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorithm for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.

When to use Multithread?

When do you use threads in a application? For example, in simple CRUD operations, use of smtp, calling webservices that may take a few time if the server is facing bandwith issues, etc.
To be honest, i don't know how to determine if i need to use a thread (i know that it must be when we're excepting that a operation will take a few time to be done).
This may be a "noob" question but it'll be great if you share with me your experience in threads.
Thanks
I added C# and .NET tags to your question because you mention C# in your title. If that is not accurate, feel free to remove the tags.
There are different styles of multithreading. For example, there are asynchronous operations with callback functions. .NET 4 introduces the parallel Linq library. The style of multithreading you would use, or whether to use any at all, depends on what you are trying to accomplish.
Parallel execution, such as what parallel Linq would generally be trying to do, takes advantage of multiple processor cores executing instructions that do not need to wait for data from each other. There are many sources for such algorithms outside Linq, such as this. However, it is possible that parallel execution may be unable to you or that it does not suit your application.
More traditional multithreading takes advantage of threading within the .NET library (in this case) as provided by System.Thread. Remember that there is some overhead in starting processes on threads, so only use threads when the advantages of doing so outweigh this overhead. Generally speaking, you would only want to use this type of single-processor multithreading when the task running under the thread will have long gaps in which the processor could be doing something else. For example, I/O from hard disk (and, consequently, from a database system that uses one) is many orders of magnitude slower than memory access. Network access can also be slow, as another example. Multithreading could allow another process to be running while waiting for these slow (compared to the processor) operations to complete.
Another example when I have used traditional multithreading is to cache some values the first time a particular ASP.NET page is accessed within a session. I kick off a thread so that the user does not have to wait for the caching to complete before interacting with the page. I also regulate the behavior when the caching does not complete before the user requests another page so that, if the caching does not complete, it is not a problem. It simply makes some further requests faster that were previously too slow.
Consider also the cost that multithreading has to the maintainability of your application. Threaded applications can be harder to debug, for example.
I hope this answers your question at least somewhat.
Joseph Albahari summarized it very well here:
Maintaining a responsive user interface
Making efficient use of an otherwise blocked CPU
Parallel programming
Speculative execution
Allowing requests to be processed simultaneously
One reason to use threads is to split large, CPU-bound tasks across a number of CPUs/cores, to finish faster. Another is to let an extended task execute asynchronously, so the foreground can remain responsive while it runs.
Your examples seem to be concentrating on the second of these. While it can be a good reason, if you can use asynchronous I/O instead, that's usually preferable (e.g., almost anything using sockets can/will be better off using the socket(s) asynchronously). Asynchronous I/O is easier to cancel, and it'll usually have lower CPU overhead as well.
You can use threads when you need different execution paths. This leads(when done correctly) to more responsive and/or faster applications but also leads to more complex code and debugging.
In a simple CRUD scenario maybe is not that useful, but maybe your UI is consuming a slow web service. If you your code is tied to your UI thread you will have unresponsive UI between the service calls.
In that case, using System.Threading.Threads maybe be overkill because you don't need so much control. Using a BackgrounWorker maybe a better choice.
Threading is something difficult to master, but the benefits when used correctly are huge, performance is the most common.
Somehow you have answered your question by yourself. Using threads whenever you execute time consuming operations is right choice. Also you should it in situations when you want to make things faster. For example you want to process some amount of files - each file can be processed by different thread.
By using threads you can better utilize power of multi-core/processor machines.
Monitoring some data in background of your application.
There are dozens of such scenarios.
Realising my comment might suffice as an answer ...
I like to view multi-threading scenarios from a resource perspective. In other words, UI (graphics), networking, disk IO, CPU (cores), RAM etc. I find that helps when deciding where to use multi-threading in the general sense at least.
The reasoning behind this is simply that I can take advantage of one resource on a specific thread (eg. Disk IO) while at the same time using another thread to accomplish something else using a different resource.

Appropriate Multi-Threading Option

Scenario
I have a very heavy number-crunching process that pools large datasets from 3 different databases and then does a bit of processing on each to eventually produce a result.
This process is fine if it is only used by a single asset. However I now have 3500 assets that I need to process, which takes about 1hr30mins in the state of the current process.
Question
What is my best option for speeding this process up in terms of a multi-threaded c# application? Realistically I don't have to share anything between the processing of each asset, so I'm confident that being able to run process multiple assets at a time shouldn't cause too many issues.
Thoughts
I've heard good things about thread pools, but I guess realistically I want something that isn't too huge to implement, is easily understandable and can run off a decent number of threads at a time.
Help would be greatly appreciated.
In .net you can use the existing Thread Pool, no need to implement one yourself. Here is the relevant MSDN.
You should take care not to run too many processes at once (3500 are a bit much), but using the supplied queuing mechanism should get you started in the right direction.
Another thing to try is using PLINQ.
If you don't have a multi-core processor, multiple machines, and/or the thread processes are not I/O bound, multithreading will not help. Start by profiling the current processing to see where the time is going.
Thread pools are fine, and you can use a task queue to do simple load-balancing, but if there's no spare CPU cycles in the current application this would be a waste of time.
The nicest option would be to use the new Task Parallel Library in .NET 4, if you can do this using VS 2010 RC. This has built-in load balancing and work stealing queues, so it will make this task easy to thread, and very scalable.
However, if you need to do this in .NET 3.5, I would recommend using the ThreadPool, and just using ThreadPool.QueueUserWorkItem to start each task.
If your tasks are all very computationally intensive for their entire lifetime, you may want to prevent having too many running concurrently. Some form of queue, which you pull work from and execute, can be beneficial in this case. Just place all of your work items into a queue, and have threads pull work from the queue (with appropriate locking), and process.
If you have a multi-core system, and CPU cycles are your bottleneck, this should scale very well.
The .Net built in ThreadPool will solve both of your requirements of running a decent number of threads as well as being simple to work with. I have previously written an article on the subject which you can find here.
With using SQL Server 2005 or later, you can create user-defined functions in C# and use them from within T-SQL procedures, which can give a marked speedup for number crunching. SQL Server is multi-threaded and does a good job with it, so consider keeping as much of the processing in the database engine as you can.

How do I pick the best number of threads for hyptherthreading/multicore?

I have some embarrassingly-parallelizable work in a .NET 3.5 console app and I want to take advantage of hyperthreading and multi-core processors. How do I pick the best number of worker threads to utilize either of these the best on an arbitrary system? For example, if it's a dual core I will want 2 threads; quad core I will want 4 threads. What I'm ultimately after is determining the processor characteristics so I can know how many threads to create.
I'm not asking how to split up the work nor how to do threading, I'm asking how do I determine the "optimal" number of the threads on an arbitrary machine this console app will run on.
I'd suggest that you don't try to determine it yourself. Use the ThreadPool and let .NET manage the threads for you.
You can use Environment.ProcessorCount if that's the only thing you're after. But usually using a ThreadPool is indeed the better option.
The .NET thread pool also has provisions for sometimes allocating more threads than you have cores to maximise throughput in certain scenarios where many threads are waiting for I/O to finish.
The correct number is obviously 42.
Now on the serious note. Just use the thread pool, always.
1) If you have a lengthy processing task (ie. CPU intensive) that can be partitioned into multiple work piece meals then you should partition your task and then submit all individual work items to the ThreadPool. The thread pool will pick up work items and start churning on them in a dynamic fashion as it has self monitoring capabilities that include starting new threads as needed and can be configured at deployment by administrators according to the deployment site requirements, as opposed to pre-compute the numbers at development time. While is true that the proper partitioning size of your processing task can take into account the number of CPUs available, the right answer depends so much on the nature of the task and the data that is not even worth talking about at this stage (and besides the primary concerns should be your NUMA nodes, memory locality and interlocked cache contention, and only after that the number of cores).
2) If you're doing I/O (including DB calls) then you should use Asynchronous I/O and complete the calls in ThreadPool called completion routines.
These two are the the only valid reasons why you should have multiple threads, and they're both best handled by using the ThreadPool. Anything else, including starting a thread per 'request' or 'connection' are in fact anti patterns on the Win32 API world (fork is a valid pattern in *nix, but definitely not on Windows).
For a more specialized and way, way more detailed discussion of the topic I can only recommend the Rick Vicik papers on the subject:
designing-applications-for-high-performance-part-1.aspx
designing-applications-for-high-performance-part-ii.aspx
designing-applications-for-high-performance-part-iii.aspx
The optimal number would just be the processor count. Optimally you would always have one thread running on a CPU (logical or physical) to minimise context switches and the overhead that has with it.
Whether that is the right number depends (very much as everyone has said) on what you are doing. The threadpool (if I understand it correctly) pretty much tries to use as few threads as possible but spins up another one each time a thread blocks.
The blocking is never optimal but if you are doing any form of blocking then the answer would change dramatically.
The simplest and easiest way to get good (not necessarily optimal) behaviour is to use the threadpool. In my opinion its really hard to do any better than the threadpool so thats simply the best place to start and only ever think about something else if you can demonstrate why that is not good enough.
A good rule of the thumb, given that you're completely CPU-bound, is processorCount+1.
That's +1 because you will always get some tasks started/stopped/interrupted and n tasks will almost never completely fill up n processors.
The only way is a combination of data and code analysis based on performance data.
Different CPU families and speeds vs. memory speed vs other activities on the system are all going to make the tuning different.
Potentially some self-tuning is possible, but this will mean having some form of live performance tuning and self adjustment.
Or even better than the ThreadPool, use .NET 4.0 Task instances from the TPL. The Task Parallel Library is built on a foundation in the .NET 4.0 framework that will actually determine the optimal number of threads to perform the tasks as efficiently as possible for you.
I read something on this recently (see the accepted answer to this question for example).
The simple answer is that you let the operating system decide. It can do a far better job of deciding what's optimal than you can.
There are a number of questions on a similar theme - search for "optimal number threads" (without the quotes) gives you a couple of pages of results.
I would say it also depends on what you are doing, if your making a server application then using all you can out of the CPU`s via either Environment.ProcessorCount or a thread pool is a good idea.
But if this is running on a desktop or a machine that not dedicated to this task, you might want to leave some CPU idle so the machine "functions" for the user.
It can be argued that the real way to pick the best number of threads is for the application to profile itself and adaptively change its threading behavior based on what gives the best performance.
I wrote a simple number crunching app that used multiple threads, and found that on my Quad-core system, it completed the most work in a fixed period using 6 threads.
I think the only real way to determine is through trialling or profiling.
In addition to processor count, you may want to take into account the process's processor affinity by counting bits in the affinity mask returned by the GetProcessAffinityMask function.
If there is no excessive i/o processing or system calls when the threads are running, then the number of thread (except the main thread) is in general equal to the number of processors/cores in your system, otherwise you can try to increase the number of threads by testing.

Categories

Resources