C# Multithreading File IO (Reading) - c#

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)

Related

Difference between Task (System.Threading.Task) and Thread

From what I understand about the difference between Task & Thread is that task happened in the thread-pool while the thread is something that I need to managed by myself .. ( and that task can be cancel and return to the thread-pool in the end of his mission )
But in some blog I read that if the operating system need to create task and create thread => it will be easier to create ( and destroy ) task.
Someone can explain please why creating task is simple that thread ?
( or maybe I missing something here ... )
I think that what you are talking about when you say Task is a System.Threading.Task. If that's the case then you can think about it this way:
A program can have many threads, but a processor core can only run one Thread at a time.
Threads are very expensive, and switching between the threads that are running is also very expensive.
So... Having thousands of threads doing stuff is inefficient. Imagine if your teacher gave you 10,000 tasks to do. You'd spend so much time cycling between them that you'd never get anything done. The same thing can happen to the CPU if you start too many threads.
To get around this, the .NET framework allows you to create Tasks. Tasks are a bit of work bundled up into an object, and they allow you to do interesting things like capture the output of that work and chain pieces of work together (first go to the store, then buy a magazine).
Tasks are scheduled on a pool of threads. The specific number of threads depends on the scheduler used, but the default scheduler tries to pick a number of threads that is optimal for the number of CPU cores that you have and how much time your tasks are spending actually using CPU time. If you want to, you can even write your own scheduler that does something specific like making sure that all Tasks for that scheduler always operate on a single thread.
So think of Tasks as items in your to-do list. You might be able to do 5 things at once, but if your boss gives you 10000, they will pile up in your inbox until the first 5 that you are doing get done. The difference between Tasks and the ThreadPool is that Tasks (as I mentioned earlier) give you better control over the relationship between different items of work (imagine to-do items with multiple instructions stapled together), whereas the ThreadPool just allows you to queue up a bunch of individual, single-stage items (Functions).
You are hearing two different notions of task. The first is the notion of a job, and the second is the notion of a process.
A long time ago (in computer terms), there were no threads. Each running instance of a program was called a process, since it simply performed one step after another after another until it exited. This matches the intuitive idea of a process as a series of steps, like that of a factory assembly line. The operating system manages the process abstraction.
Then, developers began to add multiple assembly lines to the factories. Now a program could do more than one thing at once, and either a library or (more commonly today) the operating system would manage the scheduling of the steps within each thread. A thread is kind of a lightweight process, but a thread belongs to a process, and all the threads in a process share memory. On the other hand, multiple processes can't mess with each others' memory. So, the multiple threads in your web server can each access the same information about the connection, but Word can't access Excel's in-memory data structures because Word and Excel are running as separate processes. The idea of a process as a series of steps doesn't really match the model of a process with threads, so some people took to calling the "abstraction formerly known as a process" a task. This is the second definition of task that you saw in the blog post. Note that plenty of people still use the word process to mean this thing.
Well, as threads became more commmon, developers added even more abstractions over top of them to make them easier to use. This led to the rise of the thread pool, which is a library-managed "pool" of threads. You pass the library a job, and the library picks a thread and runs the job on that thread. The .NET framework has a thread pool implementation, and the first time you heard about a "task" the documentation really meant a job that you pass to the thread pool.
So in a sense, both the documentation and the blog post are right. The overloading of the term task is the unfortunate source of confusion.
Threads have been a part of .Net from v1.0, Tasks were introduced in the Task Parallel Library TPL which was released in .Net 4.0.
You can consider a Task as a more sophisticated version of a Thread. They are very easy to use and have a lot of advantages over Threads as follows:
You can create return types to Tasks as if they are functions.
You can the "ContinueWith" method, which will wait for the previous task and then start the execution. (Abstracting wait)
Abstracts Locks which should be avoided as per guidlines of my company.
You can use Task.WaitAll and pass an array of tasks so you can wait till all tasks are complete.
You can attach task to the parent task, thus you can decide whether the parent or the child will exist first.
You can achieve data parallelism with LINQ queries.
You can create parallel for and foreach loops
Very easy to handle exceptions with tasks.
*Most important thing is if the same code is run on single core machine it will just act as a single process without any overhead of threads.
Disadvantage of tasks over threads:
You need .Net 4.0
Newcomers who have learned operating systems can understand threads better.
New to the framework so not much assistance available.
Some tips:-
Always use Task.Factory.StartNew method which is semantically perfect and standard.
Take a look at Task Parallel Libray for more information
http://msdn.microsoft.com/en-us/library/dd460717.aspx
Expanding on the comment by Eric Lippert:
Threads are a way that allows your application to do several things in parallel. For example, your application might have one thread that processes the events from the user, like button clicks, and another thread that performs some long computation. This way, you can do two different things “at the same time”. If you didn't do that, the user wouldn't be to click buttons until the computation finished. So, Thread is something that can execute some code you wrote.
Task, on the other hand represents an abstract notion of some job. That job can have a result, and you can wait until the job finishes (by calling Wait()) or say that you want to do something after the job finishes (by calling ContinueWith()).
The most common job that you want to represent is to perform some computation in parallel with the current code. And Task offers you a simple way to do that. How and when the code actually runs is defined by TaskScheduler. The default one uses a ThreadPool: a set of threads that can run any code. This is done because creating and switching threads in inefficient.
But Task doesn't have to be directly associated with some code. You can use TaskCompletionSource to create a Task and then set its result whenever you want. For example, you could create a Task and mark it as completed when the user clicks a button. Some other code could wait on that Task and while it's waiting, there is no code executing for that Task.
If you want to know when to use Task and when to use Thread: Task is simpler to use and more efficient that creating your own Threads. But sometimes, you need more control than what is offered by Task. In those cases, it makese sense to use Thread directly.
Tasks really are just a wrapper for the boilerplate code of spinning up threads manually. At the root, there is no difference. Tasks just make the management of threads easier, as well as they are generally more expressive due to the lessening of the boilerplate noise.

Using multithreading for loop

I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie
Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.
I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/
One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.
You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.
A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.
Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.
Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.

Alternative to Threads

I've read that threads are very problematic. What alternatives are available? Something that handles blocking and stuff automatically?
A lot of people recommend the background worker, but I've no idea why.
Anyone care to explain "easy" alternatives? The user will be able to select the number of threads to use (depending on their speed needs and computer power).
Any ideas?
To summarize the problems with threads:
if threads share memory, you can get
race conditions
if you avoid races by liberally using locks, you
can get deadlocks (see the dining philosophers problem)
An example of a race: suppose two threads share access to some memory where a number is stored. Thread 1 reads from the memory address and stores it in a CPU register. Thread 2 does the same. Now thread 1 increments the number and writes it back to memory. Thread 2 then does the same. End result: the number was only incremented by 1, while both threads tried to increment it. The outcome of such interactions depend on timing. Worse, your code may seem to work bug-free but once in a blue moon the timing is wrong and bad things happen.
To avoid these problems, the answer is simple: avoid sharing writable memory. Instead, use message passing to communicate between threads. An extreme example is to put the threads in separate processes and communicate via TCP/IP connections or named pipes.
Another approach is to share only read-only data structures, which is why functional programming languages can work so well with multiple threads.
This is a bit higher-level answer, but it may be useful if you want to consider other alternatives to threads. Anyway, most of the answers discussed solutions based on threads (or thread pools) or maybe tasks from .NET 4.0, but there is one more alternative, which is called message-passing. This has been successfuly used in Erlang (a functional language used by Ericsson). Since functional programming is becoming more mainstream in these days (e.g. F#), I thought I could mention it. In genral:
Threads (or thread pools) can usually used when you have some relatively long-running computation. When it needs to share state with other threads, it gets tricky (you have to correctly use locks or other synchronization primitives).
Tasks (available in TPL in .NET 4.0) are very lightweight - you can split your program into thousands of tasks and then let the runtime run them (it will use optimal number of threads). If you can write your algorithm using tasks instead of threads, it sounds like a good idea - you can avoid some synchronization when you run computation using smaller steps.
Declarative approaches (PLINQ in .NET 4.0 is a great option) if you have some higher-level data processing operation that can be encoded using LINQ primitives, then you can use this technique. The runtime will automatically parallelize your code, because LINQ doesn't specify how exactly should it evaluate the results (you just say what results you want to get).
Message-passing allows you two write program as concurrently running processes that perform some (relatively simple) tasks and communicate by sending messages to each other. This is great, because you can share some state (send messages) without the usual synchronization issues (you just send a message, then do other thing or wait for messages). Here is a good introduction to message-passing in F# from Robert Pickering.
Note that the last three techniques are quite related to functional programming - in functional programming, you desing programs differently - as computations that return result (which makes it easier to use Tasks). You also often write declarative and higher-level code (which makes it easier to use Declarative approaches).
When it comes to actual implementation, F# has a wonderful message-passing library right in the core libraries. In C#, you can use Concurrency & Coordination Runtime, which feels a bit "hacky", but is probably quite powerful too (but may look too complicated).
Won't the parallel programming options in .Net 4 be an "easy" way to use threads? I'm not sure what I'd suggest for .Net 3.5 and earlier...
This MSDN link to the Parallel Computing Developer Center has links to lots of info on Parellel Programming including links to videos, etc.
I can recommend this project. Smart Thread Pool
Project Description
Smart Thread Pool is a thread pool written in C#. It is far more advanced than the .NET built-in thread pool.
Here is a list of the thread pool features:
The number of threads dynamically changes according to the workload on the threads in the pool.
Work items can return a value.
A work item can be cancelled.
The caller thread's context is used when the work item is executed (limited).
Usage of minimum number of Win32 event handles, so the handle count of the application won't explode.
The caller can wait for multiple or all the work items to complete.
Work item can have a PostExecute callback, which is called as soon the work item is completed.
The state object, that accompanies the work item, can be disposed automatically.
Work item exceptions are sent back to the caller.
Work items have priority.
Work items group.
The caller can suspend the start of a thread pool and work items group.
Threads have priority.
Can run COM objects that have single threaded apartment.
Support Action and Func delegates.
Support for WindowsCE (limited)
The MaxThreads and MinThreads can be changed at run time.
Cancel behavior is imporved.
"Problematic" is not the word I would use to describe working with threads. "Tedious" is a more appropriate description.
If you are new to threaded programming, I would suggest reading this thread as a starting point. It is by no means exhaustive but has some good introductory information. From there, I would continue to scour this website and other programming sites for information related to specific threading questions you may have.
As for specific threading options in C#, here's some suggestions on when to use each one.
Use BackgroundWorker if you have a single task that runs in the background and needs to interact with the UI. The task of marshalling data and method calls to the UI thread are handled automatically through its event-based model. Avoid BackgroundWorker if (1) your assembly does not already reference the System.Windows.Form assembly, (2) you need the thread to be a foreground thread, or (3) you need to manipulate the thread priority.
Use a ThreadPool thread when efficiency is desired. The ThreadPool helps avoid the overhead associated with creating, starting, and stopping threads. Avoid using the ThreadPool if (1) the task runs for the lifetime of your application, (2) you need the thread to be a foreground thread, (3) you need to manipulate the thread priority, or (4) you need the thread to have a fixed identity (aborting, suspending, discovering).
Use the Thread class for long-running tasks and when you require features offered by a formal threading model, e.g., choosing between foreground and background threads, tweaking the thread priority, fine-grained control over thread execution, etc.
Any time you introduce multiple threads, each running at once, you open up the potential for race conditions. To avoid these, you tend to need to add synchronization, which adds complexity, as well as the potential for deadlocks.
Many tools make this easier. .NET has quite a few classes specifically meant to ease the pain of dealing with multiple threads, including the BackgroundWorker class, which makes running background work and interacting with a user interface much simpler.
.NET 4 is going to do a lot to ease this even more. The Task Parallel Library and PLINQ dramatically ease working with multiple threads.
As for your last comment:
The user will be able to select the number of threads to use (depending on their speed needs and computer power).
Most of the routines in .NET are built upon the ThreadPool. In .NET 4, when using the TPL, the work load will actually scale at runtime, for you, eliminating the burden of having to specify the number of threads to use. However, there are ways to do this now.
Currently, you can use ThreadPool.SetMaxThreads to help limit the number of threads generated. In TPL, you can specify ParallelOptions.MaxDegreesOfParallelism, and pass an instance of the ParallelOptions into your routine to control this. The default behavior scales up with more threads as you add more processing cores, which is usually the best behavior in any case.
Threads are not problematic if you understand what causes problems with them.
For ex. if you avoid statics, you know which API's to use (e.g. use synchronized streams), you will avoid many of the issues that come up for their bad utilization.
If threading is a problem (this can happen if you have unsafe/unmanaged 3rd party dll's that cannot support multithreading. In this can an option is to create a meachism to queue the operations. ie store the parameters of the action to a database and just run through them one at a time. This can be done in a windows service. Obviously this will take longer but in some cases is the only option.
Threads are indispensable tools for solving many problems, and it behooves the maturing developer to know how to effectively use them. But like many tools, they can cause some very difficult-to-find bugs.
Don't shy away from some so useful just because it can cause problems, instead study and practice until you become the go-to guy for multi-threaded apps.
A great place to start is Joe Albahari's article: http://www.albahari.com/threading/.

Embarrassingly parallelizable tasks in .NET

I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.
I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.
Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.
Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx
In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?
Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.

When to use thread pool in C#? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have been trying to learn multi-threaded programming in C# and I am confused about when it is best to use a thread pool vs. create my own threads. One book recommends using a thread pool for small tasks only (whatever that means), but I can't seem to find any real guidelines.
What are some pros and cons of thread pools vs creating my own threads? And what are some example use cases for each?
I would suggest you use a thread pool in C# for the same reasons as any other language.
When you want to limit the number of threads running or don't want the overhead of creating and destroying them, use a thread pool.
By small tasks, the book you read means tasks with a short lifetime. If it takes ten seconds to create a thread which only runs for one second, that's one place where you should be using pools (ignore my actual figures, it's the ratio that counts).
Otherwise you spend the bulk of your time creating and destroying threads rather than simply doing the work they're intended to do.
If you have lots of logical tasks that require constant processing and you want that to be done in parallel use the pool+scheduler.
If you need to make your IO related tasks concurrently such as downloading stuff from remote servers or disk access, but need to do this say once every few minutes, then make your own threads and kill them once you're finished.
Edit: About some considerations, I use thread pools for database access, physics/simulation, AI(games), and for scripted tasks ran on virtual machines that process lots of user defined tasks.
Normally a pool consists of 2 threads per processor (so likely 4 nowadays), however you can set up the amount of threads you want, if you know how many you need.
Edit: The reason to make your own threads is because of context changes, (thats when threads need to swap in and out of the process, along with their memory). Having useless context changes, say when you aren't using your threads, just leaving them sit around as one might say, can easily half the performance of your program (say you have 3 sleeping threads and 2 active threads). Thus if those downloading threads are just waiting they're eating up tons of CPU and cooling down the cache for your real application
Here's a nice summary of the thread pool in .Net: http://blogs.msdn.com/pedram/archive/2007/08/05/dedicated-thread-or-a-threadpool-thread.aspx
The post also has some points on when you should not use the thread pool and start your own thread instead.
I highly recommend reading the this free e-book:
Threading in C# by Joseph Albahari
At least read the "Getting Started" section. The e-book provides a great introduction and includes a wealth of advanced threading information as well.
Knowing whether or not to use the thread pool is just the beginning. Next you will need to determine which method of entering the thread pool best suits your needs:
Task Parallel Library (.NET Framework
4.0)
ThreadPool.QueueUserWorkItem
Asynchronous Delegates
BackgroundWorker
This e-book explains these all and advises when to use them vs. create your own thread.
The thread pool is designed to reduce context switching among your threads. Consider a process that has several components running. Each of those components could be creating worker threads. The more threads in your process, the more time is wasted on context switching.
Now, if each of those components were queuing items to the thread pool, you would have a lot less context switching overhead.
The thread pool is designed to maximize the work being done across your CPUs (or CPU cores). That is why, by default, the thread pool spins up multiple threads per processor.
There are some situations where you would not want to use the thread pool. If you are waiting on I/O, or waiting on an event, etc then you tie up that thread pool thread and it can't be used by anyone else. Same idea applies to long running tasks, though what constitutes a long running task is subjective.
Pax Diablo makes a good point as well. Spinning up threads is not free. It takes time and they consume additional memory for their stack space. The thread pool will re-use threads to amortize this cost.
Note: you asked about using a thread pool thread to download data or perform disk I/O. You should not use a thread pool thread for this (for the reasons I outlined above). Instead use asynchronous I/O (aka the BeginXX and EndXX methods). For a FileStream that would be BeginRead and EndRead. For an HttpWebRequest that would be BeginGetResponse and EndGetResponse. They are more complicated to use, but they are the proper way to perform multi-threaded I/O.
Beware of the .NET thread pool for operations that may block for any significant, variable or unknown part of their processing, as it is prone to thread starvation. Consider using the .NET parallel extensions, which provide a good number of logical abstractions over threaded operations. They also include a new scheduler, which should be an improvement on ThreadPool. See here
One reason to use the thread pool for small tasks only is that there are a limited number of thread pool threads. If one is used for a long time then it stops that thread from being used by other code. If this happens many times then the thread pool can become used up.
Using up the thread pool can have subtle effects - some .NET timers use thread pool threads and will not fire, for example.
If you have a background task that will live for a long time, like for the entire lifetime of your application, then creating your own thread is a reasonable thing. If you have short jobs that need to be done in a thread, then use thread pooling.
In an application where you are creating many threads, the overhead of creating the threads becomes substantial. Using the thread pool creates the threads once and reuses them, thus avoiding the thread creation overhead.
In an application that I worked on, changing from creating threads to using the thread pool for the short lived threads really helpped the through put of the application.
For the highest performance with concurrently executing units, write your own thread pool, where a pool of Thread objects are created at start up and go to blocking (formerly suspended), waiting on a context to run (an object with a standard interface implemented by your code).
So many articles about Tasks vs. Threads vs. the .NET ThreadPool fail to really give you what you need to make a decision for performance. But when you compare them, Threads win out and especially a pool of Threads. They are distributed the best across CPUs and they start up faster.
What should be discussed is the fact that the main execution unit of Windows (including Windows 10) is a thread, and OS context switching overhead is usually negligible. Simply put, I have not been able to find convincing evidence of many of these articles, whether the article claims higher performance by saving context switching or better CPU usage.
Now for a bit of realism:
Most of us won’t need our application to be deterministic, and most of us do not have a hard-knocks background with threads, which for instance often comes with developing an operating system. What I wrote above is not for a beginner.
So what may be most important is to discuss is what is easy to program.
If you create your own thread pool, you’ll have a bit of writing to do as you’ll need to be concerned with tracking execution status, how to simulate suspend and resume, and how to cancel execution – including in an application-wide shut down. You might also have to be concerned with whether you want to dynamically grow your pool and also what capacity limitation your pool will have. I can write such a framework in an hour but that is because I’ve done it so many times.
Perhaps the easiest way to write an execution unit is to use a Task. The beauty of a Task is that you can create one and kick it off in-line in your code (though caution may be warranted). You can pass a cancellation token to handle when you want to cancel the Task. Also, it uses the promise approach to chaining events, and you can have it return a specific type of value. Moreover, with async and await, more options exist and your code will be more portable.
In essence, it is important to understand the pros and cons with Tasks vs. Threads vs. the .NET ThreadPool. If I need high performance, I am going to use threads, and I prefer using my own pool.
An easy way to compare is start up 512 Threads, 512 Tasks, and 512 ThreadPool threads. You’ll find a delay in the beginning with Threads (hence, why write a thread pool), but all 512 Threads will be running in a few seconds while Tasks and .NET ThreadPool threads take up to a few minutes to all start.
Below are the results of such a test (i5 quad core with 16 GB of RAM), giving each 30 seconds to run. The code executed performs simple file I/O on an SSD drive.
Test Results
Thread pools are great when you have more tasks to process than available threads.
You can add all the tasks to a thread pool and specify the maximum number of threads that can run at a certain time.
Check out this page on MSDN:
http://msdn.microsoft.com/en-us/library/3dasc8as(VS.80).aspx
Always use a thread pool if you can, work at the highest level of abstraction possible. Thread pools hide creating and destroying threads for you, this is usually a good thing!
Most of the time you can use the pool as you avoid the expensive process of creating the thread.
However in some scenarios you may want to create a thread. For example if you are not the only one using the thread pool and the thread you create is long-lived (to avoid consuming shared resources) or for example if you want to control the stacksize of the thread.
Don't forget to investigate the Background worker.
I find for a lot of situations, it gives me just what i want without the heavy lifting.
Cheers.
I usually use the Threadpool whenever I need to just do something on another thread and don't really care when it runs or ends. Something like logging or maybe even background downloading a file (though there are better ways to do that async-style). I use my own thread when I need more control. Also what I've found is using a Threadsafe queue (hack your own) to store "command objects" is nice when I have multiple commands that I need to work on in >1 thread. So you'd may split up an Xml file and put each element in a queue and then have multiple threads working on doing some processing on these elements. I wrote such a queue way back in uni (VB.net!) that I've converted to C#. I've included it below for no particular reason (this code might contain some errors).
using System.Collections.Generic;
using System.Threading;
namespace ThreadSafeQueue {
public class ThreadSafeQueue<T> {
private Queue<T> _queue;
public ThreadSafeQueue() {
_queue = new Queue<T>();
}
public void EnqueueSafe(T item) {
lock ( this ) {
_queue.Enqueue(item);
if ( _queue.Count >= 1 )
Monitor.Pulse(this);
}
}
public T DequeueSafe() {
lock ( this ) {
while ( _queue.Count <= 0 )
Monitor.Wait(this);
return this.DeEnqueueUnblock();
}
}
private T DeEnqueueUnblock() {
return _queue.Dequeue();
}
}
}
I wanted a thread pool to distribute work across cores with as little latency as possible, and that didn't have to play well with other applications. I found that the .NET thread pool performance wasn't as good as it could be. I knew I wanted one thread per core, so I wrote my own thread pool substitute class. The code is provided as an answer to another StackOverflow question over here.
As to the original question, the thread pool is useful for breaking repetitive computations up into parts that can be executed in parallel (assuming they can be executed in parallel without changing the outcome). Manual thread management is useful for tasks like UI and IO.

Categories

Resources