I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.
I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.
Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.
Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx
In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?
Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.
Related
I am creating a data cleansing application that loops through the tables in a database and cleanses NPI data from the different columns. I have created a class for each of the tables that has a method that performs the cleansing operation. What I want to do is loop over the table classes and using reflection call each class and its cleansing method. I would like to do 10 tables at a time and as one table completes, spawn a new thread/task on the next table in the list.
I have a treeview where the user can select one or more tables from the database to cleanse. I have been able to loop over the selected tables and invoke the cleanse method for each table on its own thread, but I end up with over 100 threads (if all tables are selected) executing at the same time. Not an ideal situation.
Any suggestions on how to do this? I am using C# and .NET 4.6, so Task code would be preferred.
One simple approach is to use Parallel.ForEach with the MaxDegreeOfParallelism option set to the maximum threads you would like.
A more advanced and rather elegant framework for this kind of problem is the Dataflow Task Parallel Library (TPL). Use an ActionBlock to perform the work and set it's degree of parallelism as desired.
Take a look at the Task Parallel Library, it has a type Task and Task that I think will suite you just fine.
You could create 10 Tasks, put them in a collection and do a Tasks.WhenAny(myTasks).Result. At that point you can figure out how many tasks are done and put more into the collection (IsCompleted property).
Instead of .WhenAny you could probably make it less complicated and use WhenAll(myTasks), and just do this all in batches of x. The parallel foreach answer is also an excellent option, there's a whole world in the TPL for you to explore.
Very basic example, not sure your full context here:
var myTasks = new List<Task>();
myTasks.Add(Task.Run(() => someLongProcess))
myTasks.Add(Task.Run(() => someLongProcess))
Task.WhenAny(myTasks).Result
//Check to see how many tasks are done, and then add more to your collection and repeat until you're done
You better using Task, as it implements inside Work sharing: which basically means that Task=Work, which maps on hardware thread of your OS, via special Task Scheduler.
Threads may not be available always
There may be more work to do than threads, so if you have a queue to process, the same thread will be used to work over a new data (spawning a thread has its own costs)
Possible successful False-Sharing management (on CPU cache lines). You may be concerned less, but still worth knowing what it is.
..More.
A lot of that possibly thought, scheduled and processed by thread scheduler to get optimal generic performance without much hustle. To be clear, you will not get the best possible multi-threaded performance, but you, most likely, do not need that either.
What you are asking about is Data Parallelism.
The simple example of how to use it you can find in: How to: Write a Simple Parallel.For Loop
I've been reading this article on MSDN about C# Concurrent Collections. It talks about the optimum threading to use for particular scenarios to get the most benefit out of the collections e.g:
ConcurrentQueue performs best when one dedicated thread is queuing and one dedicated thread is de-queuing. If you do not enforce this rule, then Queue might even perform slightly faster than ConcurrentQueue on computers that have multiple cores.
Is this advice still valid when one is using Tasks instead of raw Threads? From my (limited) understanding of C# Tasks, there is no guarantee that a particular Task will always run on the same thread between context switches, or does maintaining the stack frame mean that the same rules apply in terms of best usage?
Thanks.
One task always runs on the same thread. TPL is a user-mode library. User mode has no (practical) way of migrating executing code from thread to thread. Also there would be no point to doing that.
This advice applies exactly to tasks as it does to threads.
What that piece of advice means to say is that at the same time there should be one producer and one consumer only. You can have 100 threads enqueuing from time to time as long as they do not contend.
(I'm not questioning that advice here since that is out of scope for this question. But that is what's meant here.)
I was reading C# 5.0 in nutshell and after reading author's view(s), I am quite confused as to what should I adopt. My requirement is that say I have a really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file, or really any other thing is is computationally heavy and is likely to take some time, what should be my approach toward developing it (in winforms if that matters, using VS 2012, C# 5.0), so that I can also report progress to the user.
Following scenario(s) come to mind...
Create a Task (with LongRunning option that computes the hashes and report the progress to user either by implementing IProgess<T> or Progess<T> or letting the task capture the SynchronizationContext context and posting to the UI.
Create a Async method like
async CalculateHashesAsync()
{
// await here for tasks the calculate the hash
await Task.Rung(() => CalculateHash();
// how do I report progress???
}
Use TPL (or PLINQ) as
void CalcuateHashes()
{
Parallel.For(0, allFiles.Count, file => calcHash(file)
// how do I report progress here?
}
Use a producer / consumer Queue.
Don't really know how?
The author in the book says...
Running one long running task on a pooled thread won't cause
trouble. It's when you run multiple long running tasks in parallel
(particularly ones that block) that performance can suffer. In that
case, there are usually better solutions than
TaskCreationOptions.LongRunnging
If tasks are IO bound, TaskCompletionSource and asynchronous functions let you
implement concurrency with callbacks instead of threads.
If tasks are compute bound, a producer/consumer queue lets you throttle the concurrency for those tasks, avoiding starvation for
other threads and process.
About the Producer/Consumer the author says...
A producer/consumer queue is a useful structure, both in parallel
programming and general concurrency scenarios as it gives you precise
control over how many worker threads execute at once, which is useful
not only in limiting CPU consumption, but other resources as well.
So, should I not use task, meaning that first option is out? Is second one the best option? Are there any other options? And If I were to follow author's advice, and implement a producer/consumer, how would I do that (I don't even have an idea of how to get started with producer/consumer in my scenario, if that is the best approach!)
I'd like to know if someone has ever come across such a scenario, how would they implement? If not, what would be the most performance effective and/or easy to develop/maintain (I know the word performance is subjective, but let's just consider the very general case that it works, and works well!)
really long running (computationally heavy) task, say for example, calculate SHA1 (or some other) hash of millions of file
That example clearly has both heavy CPU (hashing) and I/O (file) components. Perhaps this is a non-representative example, but in my experience even a secure hash is far faster than reading the data from disk.
If you just have CPU-bound work, the best solution is either Parallel or PLINQ. If you just have I/O-bound work, the best solution is to use async. If you have a more realistic and complex scenario (with both CPU and I/O work), then you should either hook up your CPU and I/O parts with producer/consumer queues or use a more complete solution such as TPL Dataflow.
TPL Dataflow works well with both parallel (MaxDegreeOfParallelism) and async, and has a builtin producer/consumer queue in-between each block.
One thing to keep in mind when mixing massive amounts of I/O and CPU usage is that different situations can cause massively different performance characteristics. To be safe, you'll want to throttle the data going through your queues so you won't end up with memory usage issues. TPL Dataflow has built-in support for throttling via BoundedCapacity.
From what I understand about the difference between Task & Thread is that task happened in the thread-pool while the thread is something that I need to managed by myself .. ( and that task can be cancel and return to the thread-pool in the end of his mission )
But in some blog I read that if the operating system need to create task and create thread => it will be easier to create ( and destroy ) task.
Someone can explain please why creating task is simple that thread ?
( or maybe I missing something here ... )
I think that what you are talking about when you say Task is a System.Threading.Task. If that's the case then you can think about it this way:
A program can have many threads, but a processor core can only run one Thread at a time.
Threads are very expensive, and switching between the threads that are running is also very expensive.
So... Having thousands of threads doing stuff is inefficient. Imagine if your teacher gave you 10,000 tasks to do. You'd spend so much time cycling between them that you'd never get anything done. The same thing can happen to the CPU if you start too many threads.
To get around this, the .NET framework allows you to create Tasks. Tasks are a bit of work bundled up into an object, and they allow you to do interesting things like capture the output of that work and chain pieces of work together (first go to the store, then buy a magazine).
Tasks are scheduled on a pool of threads. The specific number of threads depends on the scheduler used, but the default scheduler tries to pick a number of threads that is optimal for the number of CPU cores that you have and how much time your tasks are spending actually using CPU time. If you want to, you can even write your own scheduler that does something specific like making sure that all Tasks for that scheduler always operate on a single thread.
So think of Tasks as items in your to-do list. You might be able to do 5 things at once, but if your boss gives you 10000, they will pile up in your inbox until the first 5 that you are doing get done. The difference between Tasks and the ThreadPool is that Tasks (as I mentioned earlier) give you better control over the relationship between different items of work (imagine to-do items with multiple instructions stapled together), whereas the ThreadPool just allows you to queue up a bunch of individual, single-stage items (Functions).
You are hearing two different notions of task. The first is the notion of a job, and the second is the notion of a process.
A long time ago (in computer terms), there were no threads. Each running instance of a program was called a process, since it simply performed one step after another after another until it exited. This matches the intuitive idea of a process as a series of steps, like that of a factory assembly line. The operating system manages the process abstraction.
Then, developers began to add multiple assembly lines to the factories. Now a program could do more than one thing at once, and either a library or (more commonly today) the operating system would manage the scheduling of the steps within each thread. A thread is kind of a lightweight process, but a thread belongs to a process, and all the threads in a process share memory. On the other hand, multiple processes can't mess with each others' memory. So, the multiple threads in your web server can each access the same information about the connection, but Word can't access Excel's in-memory data structures because Word and Excel are running as separate processes. The idea of a process as a series of steps doesn't really match the model of a process with threads, so some people took to calling the "abstraction formerly known as a process" a task. This is the second definition of task that you saw in the blog post. Note that plenty of people still use the word process to mean this thing.
Well, as threads became more commmon, developers added even more abstractions over top of them to make them easier to use. This led to the rise of the thread pool, which is a library-managed "pool" of threads. You pass the library a job, and the library picks a thread and runs the job on that thread. The .NET framework has a thread pool implementation, and the first time you heard about a "task" the documentation really meant a job that you pass to the thread pool.
So in a sense, both the documentation and the blog post are right. The overloading of the term task is the unfortunate source of confusion.
Threads have been a part of .Net from v1.0, Tasks were introduced in the Task Parallel Library TPL which was released in .Net 4.0.
You can consider a Task as a more sophisticated version of a Thread. They are very easy to use and have a lot of advantages over Threads as follows:
You can create return types to Tasks as if they are functions.
You can the "ContinueWith" method, which will wait for the previous task and then start the execution. (Abstracting wait)
Abstracts Locks which should be avoided as per guidlines of my company.
You can use Task.WaitAll and pass an array of tasks so you can wait till all tasks are complete.
You can attach task to the parent task, thus you can decide whether the parent or the child will exist first.
You can achieve data parallelism with LINQ queries.
You can create parallel for and foreach loops
Very easy to handle exceptions with tasks.
*Most important thing is if the same code is run on single core machine it will just act as a single process without any overhead of threads.
Disadvantage of tasks over threads:
You need .Net 4.0
Newcomers who have learned operating systems can understand threads better.
New to the framework so not much assistance available.
Some tips:-
Always use Task.Factory.StartNew method which is semantically perfect and standard.
Take a look at Task Parallel Libray for more information
http://msdn.microsoft.com/en-us/library/dd460717.aspx
Expanding on the comment by Eric Lippert:
Threads are a way that allows your application to do several things in parallel. For example, your application might have one thread that processes the events from the user, like button clicks, and another thread that performs some long computation. This way, you can do two different things “at the same time”. If you didn't do that, the user wouldn't be to click buttons until the computation finished. So, Thread is something that can execute some code you wrote.
Task, on the other hand represents an abstract notion of some job. That job can have a result, and you can wait until the job finishes (by calling Wait()) or say that you want to do something after the job finishes (by calling ContinueWith()).
The most common job that you want to represent is to perform some computation in parallel with the current code. And Task offers you a simple way to do that. How and when the code actually runs is defined by TaskScheduler. The default one uses a ThreadPool: a set of threads that can run any code. This is done because creating and switching threads in inefficient.
But Task doesn't have to be directly associated with some code. You can use TaskCompletionSource to create a Task and then set its result whenever you want. For example, you could create a Task and mark it as completed when the user clicks a button. Some other code could wait on that Task and while it's waiting, there is no code executing for that Task.
If you want to know when to use Task and when to use Thread: Task is simpler to use and more efficient that creating your own Threads. But sometimes, you need more control than what is offered by Task. In those cases, it makese sense to use Thread directly.
Tasks really are just a wrapper for the boilerplate code of spinning up threads manually. At the root, there is no difference. Tasks just make the management of threads easier, as well as they are generally more expressive due to the lessening of the boilerplate noise.
We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?
I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps
I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.
Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.
I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}
In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.
Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.
The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth
Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)