I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie
Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.
I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/
One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.
You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.
A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.
Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.
Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.
Related
First, thanks for all the replies!
I want to be more specific - I have a website that shows some current and historical reports. I want to be able to allow users to delete all or some of the history, while still navigating the website.
Therefore, I want to run a separate thread that will handle deleting the data, but I want to give this thread a low priority so it doesn't make the web site slow or unresponsive.
I'm in the design phase right now and I'd appreciate some strategy suggestions. Thanks!
You should be fine. Lowering the priority of CPU-intensive background tasks to allow 'normal' response from a GUI and/or other apps is one of the better uses for altering thread priorities.
Note that lowering the thread priority is not going to have an enormous efect on any resource except CPU, but still worth doing IME.
If you want to carry on using Firefox, Office, VLC etc. while running your app then, yes, lower the priority of the CPU-heavy threads. You should then be able to browse SO, listen to some albums or watch a few films while waiting for your results to eventually come out :)
Note that if you want to change the priority for a thread, you should make sure it's one you've created yourself. It's not good to change the priority on a thread pool thread. This means avoiding the new Task features, unless you're prepared to write a TaskScheduler that doesn't use the thread pool.
Also consider setting the process priority instead, if that suits your scenario. See MSDN for more info. This would affect threads equally.
Edit: Thanks for the additional information. It sounds as though your code is hosted in IIS. From this answer we can confirm that IIS uses the same thread pool as ThreadPool.QueueUserWorkItem - the standard .NET thread pool. Therefore you must not alter the priority of a thread pool thread; these threads belong to IIS.
You could create your own thread. But it seems ill advised to try to host a background operation in IIS like this. You never know when the app pool might be recycled, for example.
It would be better to consider a couple of other options. The best solution for a potentially long running background operation seems to be workflow services. Used in conjunction with AppFabric Server, these are very powerful and sound as though they would handle your situation.
A simpler alternative would be to move the process outside of IIS. Maybe the user's action could mark items for deletion, then a scheduled task outside of IIS could run to perform the slow operation.
I would use ThreadPool.UnsafeQueueUserWorkItem(new WaitCallback(MyDeleteMethod), itemsTobeDeleted)
In MyDeleteMethod, you may consider breaking the deletion process in different bulks. In other words, you can specify a max number of rows to be deleted every time you are running the delete method.
If you are worried about number of available connections in the db pool, you can improve the performance by defining a static object (with a timer) and add the itemsTobeDeleted to that list(you need to use lock for sync access). Whenever timer Elapsed event fires, you can perform the bulk delete ( For instance you have 5,000 record and you are going to delete them 500 by 500 ).
I have an app that works great for processing files that land in a directory on my server. The process is:
1) check for files in a directory
2) queue a user work item to handle each file in the background
3) wait until all workers have completed
4) goto 1
This works nicely and I never worry about the same file being processed twice or multiple threads being spawned for the same file. However, if there's one file that takes too long to process, step #3 hangs on that one file and holds up all other processing.
So my question is, what's the correct paradigm to spawn exactly one thread for each file I need to process, while not blocking if one file takes too long? I considered FileSystemWatcher, but the files may not be immediately readable which is why I continually look at all files and spawn a process for each (which will immediately exit if the file is locked).
Should I remove step #3 and maintain a list of files I've already processed? That seems messy and the list would grow very large over time so I suspect there's a more elegant solution.
I would suggest that you maintain a list of files which you are currently processing. Have the thread remove itself from this list when the thread finishes. When looking for new files, exclude those in the currently-running list.
Move the files to a processing directory before you start threads. Then you can fire-and-forget the threads and any admins can see at a glance what's going on.
Spawning one thread per item to process is almost never good approach. In your case when number of files will go above several hundreds one-thread-per-file will make application performance pretty bad and with 32-bit process will start running out of address space.
List solution by Dark Falcon is simple enough and matches your algorithm. I would actually use queue (likle ConcurrentQueue - http://msdn.microsoft.com/en-us/library/dd267265.aspx) to put items to process on one side (i.e. based on periodic scans of file watcher) and pick items for processing by one or several threads on other side. You generally want smaller number of threads (i.e. 1-2x number of CPUs for CPU intensive load).
Also consider using Task Parallel Library (like Parallel.ForEach - http://msdn.microsoft.com/en-us/library/dd989744.aspx) to deal with multiple thread.
To minimize number of files to handle I would keep persistent (i.e. disk file) list of items that are already processed - file path + last modified date (unless you can obtain this information from other source).
My two main questions would be:
What are the size of the files?
How often will files appear?
Depending on your answer there, I might go with the following producer-consumer algorithm:
Use a file system watcher to see that there is activity in the directory you are monitoring
When activity occurs, start polling "lightly"; that is test each file available to see if it is not locked (i.e., try open w/ write privileges using a simple IsLocked extension method that tests via a try..catch); if 1 or more files are not free, set a timer to go off in some amount of time (longer if expecting larger fewer files, shorter if smaller and/or more frequent) to again test files
As soon as you see that a file is free, process it (i.e., move it to another folder, put an item in a concurrent queue, have your consumer threads process the queue, archive the file/results).
Have some kind of persistence mechanism like Alexei mentions (i.e., disk/database) to be able to recover your processing where you left off in case of system failure.
I feel that this is a good combination of non-blocking, low cpu-usage behavior. But measure your before and after results. I would recommend using the ThreadPool and try to keep threads from blocking (i.e., try to ensure thread re-use by not blocking by doing something like Thread.Sleep)
Notes:
Base the number of threads processing files on the number of CPUs and cores available on the machine; also consider server load
FileSystemWatcher can be finicky; be sure that it's running from the same machine that you are monitoring (i.e., not watching a remote server), otherwise you'll need to reinitialize connectivity from time to time.
I definitely would not spawn a different process per file; multiple threads should be plenty sufficient; reusing threads is best. Spawning processes is a very very expensive operation and spawning threads is an expensive operation. Alexei has some good information wrt Task Parallel Library; it uses the ThreadPool.
I have never written multi-threaded code before (barring a few basic backgroundworker tricks) and am hoping for some guidance about how I would approach my problem.
I have an XML file which is a serialized List<Stock>. For each one of these stock items I need to perform a webservice call called UpdatePrice().
What I want to do is take each one of these items, create a threadpool (who's size depends on the amount of rows I will need to process) and begin making webservice calls.
I am not asking for a complete solution (obviously) but would really appreciate some guidance about how one would typically solve this problem.
The biggest issue that I see arising is how I would designate which threads would work on which objects. Do I simply take the list divide it by the number of threads I make and split the work? Or am I better off allowing each thread to arbitrarily pick an item from the list to process? (Then I have locking issues but as a plus can ensure no thread is idle)
As I said before I am not looking for a complete solution but just some basic guidance on where to start because honestly I am lost on this one and haven't written a single line of code.
PS: Also are autogenerated webservice proxies in .NET threadsafe?
I would suggest looking into TPL and PLINQ for a solution. A simple example solution using Parallel.ForEach() could look like this (parallel calls limited to 5 in the example).
List<Stock> stocks;
Parallel.ForEach(stocks,
new ParallelOptions() { MaxDegreeOfParallelism = 5 },
(stock) =>
{
float newPrice = UpdatePrice(stock.TickerSymbol); //web service call
stock.Price = newPrice;
});
i would:
First read the whole XML data synchronously.
Then, i would put each element to be processed in a single queue.
Then, you can spawn N processing threads, in which at the beginning of each one, it would "pop" an element of your queue, wrapping this specific piece of code in a mutex / semaphore (Google C# mutex, or concurrent access, or anything related). This is easily done in C# with the "lock" keyword on an arbitrary object.
Hope this helps.
Pierre.
There's no point in using threads here. A thread can only give you one resource: more cpu cycles, provided that you have a CPU with multiple cores. That is not the resource that you need to speed up your program. You need a faster Internet connection.
If you have an UI you don't want frozen then the BackgroundWorker tricks will work just fine.
We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?
I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps
I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.
Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.
I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}
In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.
Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.
The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth
Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)
I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.
I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.
Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.
You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.