Process files concurrently as they arrive in c# - c#

I have an app that works great for processing files that land in a directory on my server. The process is:
1) check for files in a directory
2) queue a user work item to handle each file in the background
3) wait until all workers have completed
4) goto 1
This works nicely and I never worry about the same file being processed twice or multiple threads being spawned for the same file. However, if there's one file that takes too long to process, step #3 hangs on that one file and holds up all other processing.
So my question is, what's the correct paradigm to spawn exactly one thread for each file I need to process, while not blocking if one file takes too long? I considered FileSystemWatcher, but the files may not be immediately readable which is why I continually look at all files and spawn a process for each (which will immediately exit if the file is locked).
Should I remove step #3 and maintain a list of files I've already processed? That seems messy and the list would grow very large over time so I suspect there's a more elegant solution.

I would suggest that you maintain a list of files which you are currently processing. Have the thread remove itself from this list when the thread finishes. When looking for new files, exclude those in the currently-running list.

Move the files to a processing directory before you start threads. Then you can fire-and-forget the threads and any admins can see at a glance what's going on.

Spawning one thread per item to process is almost never good approach. In your case when number of files will go above several hundreds one-thread-per-file will make application performance pretty bad and with 32-bit process will start running out of address space.
List solution by Dark Falcon is simple enough and matches your algorithm. I would actually use queue (likle ConcurrentQueue - http://msdn.microsoft.com/en-us/library/dd267265.aspx) to put items to process on one side (i.e. based on periodic scans of file watcher) and pick items for processing by one or several threads on other side. You generally want smaller number of threads (i.e. 1-2x number of CPUs for CPU intensive load).
Also consider using Task Parallel Library (like Parallel.ForEach - http://msdn.microsoft.com/en-us/library/dd989744.aspx) to deal with multiple thread.
To minimize number of files to handle I would keep persistent (i.e. disk file) list of items that are already processed - file path + last modified date (unless you can obtain this information from other source).

My two main questions would be:
What are the size of the files?
How often will files appear?
Depending on your answer there, I might go with the following producer-consumer algorithm:
Use a file system watcher to see that there is activity in the directory you are monitoring
When activity occurs, start polling "lightly"; that is test each file available to see if it is not locked (i.e., try open w/ write privileges using a simple IsLocked extension method that tests via a try..catch); if 1 or more files are not free, set a timer to go off in some amount of time (longer if expecting larger fewer files, shorter if smaller and/or more frequent) to again test files
As soon as you see that a file is free, process it (i.e., move it to another folder, put an item in a concurrent queue, have your consumer threads process the queue, archive the file/results).
Have some kind of persistence mechanism like Alexei mentions (i.e., disk/database) to be able to recover your processing where you left off in case of system failure.
I feel that this is a good combination of non-blocking, low cpu-usage behavior. But measure your before and after results. I would recommend using the ThreadPool and try to keep threads from blocking (i.e., try to ensure thread re-use by not blocking by doing something like Thread.Sleep)
Notes:
Base the number of threads processing files on the number of CPUs and cores available on the machine; also consider server load
FileSystemWatcher can be finicky; be sure that it's running from the same machine that you are monitoring (i.e., not watching a remote server), otherwise you'll need to reinitialize connectivity from time to time.
I definitely would not spawn a different process per file; multiple threads should be plenty sufficient; reusing threads is best. Spawning processes is a very very expensive operation and spawning threads is an expensive operation. Alexei has some good information wrt Task Parallel Library; it uses the ThreadPool.

Related

C# processing received socket data via threads

I have a problem with scalability and processing and I want to get the opinion of the stack overflow community.
I basically have XML data coming down a socket and I want to process that data. For each XML line sent processing can include writing to a text file, opening a socket to another server and using various database queries; all of which take time.
At the minute my solution involves the following threads:
Thread 1
Accepts incoming sockets and thus generates child threads that handle each socket (there will only be a couple of incoming sockets from clients). When an XML line comes through (ReadLine() method on StreamReader) I basically put this line into a Queue, which is accessible via a static method on a class. This static method contains locking logic to ensure that the program is threadsafe (I could use Concurrent Queue for this of course instead of manual locking).
Threads 2-5
Constantly take XML lines from the queue and processes them one at a time (database queries, file writes etc).
This method seems to be working but I was curious if there is a better way of doing things because this seems very crude. If I take the processing that threads 2-5 do into thread 1 this results in extremely slow performance, which I expected, so I created my worker threads (2-5).
I appreciate I could replace threads 2-5 with a thread pool but the thread pool would still be reading from the same Queue of XML lines so I wandered if there is a more efficient way of processing these events instead of using the Queue?
A queue1 is the right approach. But I would certainly move from manual thread control to the thread pool (and thus I don't need to do thread management) and let it manage the number of threads.2
But in the end there is only so much processing a single computer (however expensive) can do. At some point one of memory size, CPU-memory bandwidth, storage IO, network IO, … is going to be saturated. At that point using an external queuing system (MSMQ, WebSphere*MQ, Rabbit-MQ, …) with each task being a separate message allows many workers on many computers to process the data ("competing consumers" pattern).
1 I would move immediately to ConcurrentQueue: getting locking right is hard, the more you don't need to do it yourself the better.
2 At some point you might find you need more control than the thread pool providers, that is the time to switch to a custom thread pool. But prototype and test: it is quite possible your implementation will actually be worse: see paragraph 2.

Quickest way to process large number of files with thousands of Data in each file

I need to process data from some large number of file with thousands of data in terms of rows.Earlier i was reading the whole file row by row and processing.It took a lot of time for processing all the file when the number of files increased.Then some one said that threads can be used to perform the task in less amount of time??Can threading make this process fast.I'm using c# language.
It certainly can although it depends on the particular job in question. A very common pattern is to have one thread doing the file IO and multiple threads processing the actual lines.
How many processing threads to start will depend on how many processors/cores you have on your system, and how the results of the processing get written out. If the processing time per line is very small however, you probably won't get too much speed improvement having multiple processing threads and a single processing thread would be optimal.
Good thing with performance question is to assume that your code is just doing something unnecessary and try to find what it is - measure, review, draw - whatever works for you. I'm not saying that the code you have is slow, it just a way to look at it.
With adding multithreading to the mix first you may find it to be much harder to analyze the code.
More concrete for your task: combining multiple similar operation (like read a record from file or commit to DB) together may save significant amount of time (you need to prototype and measure).
I would recommend you do batch insert to your database.
You can have a thread that reads a line to a concurrent queue. while other thread is pulling the data from concurrent queue. agregating it if necessary or if you are doing any operation on it. then batch insert the data to database. it will save you quite a time.
Inserting a line to db would be very slow. you have to do batch inserts.
Yes, using threads can speed thigns up.
Threads are to be used when you have time onsuming tasks you can run in the background (like, when you process say 10 files, but only need one, you can have a thread process each of them which will be a lot faster then processing them all on your main thread).
Please not that there may be bugs related, so you should make sure all threads finished running before continuing and trying to access what they got.
Look up "C#.NET multithreading"
any thread can run a specified function, and background worker is a nice class as well (I prefer pure multithreading though).
Also note that this may backfire and wind up slower, but it's a good idea to try.
Threading is one way (there are others) of letting you overlap the processing with the I/O. That means instead of the total time being the sum of the time to read the data and the time to processing the data, you can reduce it to (roughly) whichever of the two is larger (usually the I/O time).
If you mostly want to overlap the I/O time, you might want to look at overlapped I/O and/or I/O completion ports.
Edit: If you're going to do this, you normally want to base the number of I/O threads on the number of separate physical disks you're going to be reading from, and the number of processing threads on the number of processors you have available to do the processing (but only as many as necessary to keep up with the data being supplied by the reader thread). For a typical desktop machine, that will often mean only two threads, one to read and one to process data.

Question regarding threading/background workers

I have a question around threading and background workers that I hope you can help with.
I plan on making an ftp application to upload a file to 50 servers. Rather than the user having to wait for each upload to finish before the next one starts I was looking at threading/background workers. Once an upload finishes I want to report the status of the upload "completed/failed" back to the UI. From my understanding, I will need to use background workers for this so I know when the task has completed. I know with threading I can use producer/consumer queue or a semaphore to run a given amount of threads at once but I am not quite sure how I can achieve this with back ground workers.
So my question is, what would be a sensible number of background workers controlling uploading to run at once and what would be the best way to queue the rest?
There is no limit on the size of the upload file so this could be quite small or up to a few MB.
Thanks in advance.
Edit - I tested out one backgroundworker for each server running simultaneousness. The results where faster than just a single backgroundworker but I can't say that i was fully comfortable with running 50 plus background workers at once and since the server count may increase in the future, I decided to stick with just the one, which seems to be fast enough. I may in future look at increasing the count of workers to 2 or 3 but currently 1 seems to be adequate. Thanks for everyones help.
Thanks
I'd go in a completely different direction with it, tbh. Your app should take the file and store it once, responding to the client that it's got it. The file should then be propagated to the other servers. You can do this many ways, but if you want it controlled by the same application (i.e. not done using a windows service or the like) then a good way would be to use a message queue (either MSMQ or one of the OS ones).
This is much easier than using a semaphore or producer-consumer queue.
Put all your tasks in a queue (doesn't need to be a thread-safe queue, it will only be used from the UI thread).
Loop from 1 to N, taking out a task and starting a BackgroundWorker. (Be sure to handle the empty queue, when there were less than N tasks to begin with). In the RunWorkerCompleted event, update your UI, dequeue another task, and start another BackgroundWorker.
The bottleneck here is going to be your network bandwidth. If your local upstream connection is so fast that you can saturate the incoming connections on two or more remote hosts, then you'll benefit from running multiple uploads in parallel. If not, then it makes very little difference to the total upload time, since it'll be dictated by (file size * number of uploads) / (local bandwidth). In other words - if you do 20 uploads one at a time, it'll take an hour; if you do 20 uploads in parallel, it'll still take an hour. The advantage of the first approach is that if you lose connectivity you'll only need to resume/restart a single upload - whichever one was in progress when the connection was lost.
I'd therefore use a single background thread to sequentially upload the file to each server in turn. If you're using the .NET BackgroundWorker to do this, you can get it to ReportProgress at the end of each file (and you know in advance how many files are to be uploaded so you can calculate progress as a percentage), and attach some custom state to the progress update to inform the user whether the last upload succeeded or not.
The only way to know for sure is to test and measure, but it can be different from machine to machine, mostly depending on uplink speed.
Starting 50 backgroundworkers at the same time is a bit on the high end, but is not incredibly many. A simple approach would be to start 50 all at the same time and measure memory consumption and upload speed.
If the FTP servers are each much faster than the client uplink speed the most efficient would be to just upload one (or possibly two) at a time.

Using multithreading for loop

I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie
Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.
I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/
One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.
You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.
A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.
Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.
Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.

Reading same file from multiple threads in C#

I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.
I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.
Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.
You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.

Categories

Resources