Reading same file from multiple threads in C#

Reading same file from multiple threads in C# - c#

I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.
I have basically two classes. One is the Converter and the other is ConverterThread
I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import
The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.
The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.
It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)
Do you have any advise on better practices of implementation of threads so I can accomplish this?
Thanks in advance.

I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.
Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.
Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.
Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.

Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.

You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.
You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.

Related

Quickest way to process large number of files with thousands of Data in each file

I need to process data from some large number of file with thousands of data in terms of rows.Earlier i was reading the whole file row by row and processing.It took a lot of time for processing all the file when the number of files increased.Then some one said that threads can be used to perform the task in less amount of time??Can threading make this process fast.I'm using c# language.

It certainly can although it depends on the particular job in question. A very common pattern is to have one thread doing the file IO and multiple threads processing the actual lines.
How many processing threads to start will depend on how many processors/cores you have on your system, and how the results of the processing get written out. If the processing time per line is very small however, you probably won't get too much speed improvement having multiple processing threads and a single processing thread would be optimal.

Good thing with performance question is to assume that your code is just doing something unnecessary and try to find what it is - measure, review, draw - whatever works for you. I'm not saying that the code you have is slow, it just a way to look at it.
With adding multithreading to the mix first you may find it to be much harder to analyze the code.
More concrete for your task: combining multiple similar operation (like read a record from file or commit to DB) together may save significant amount of time (you need to prototype and measure).

I would recommend you do batch insert to your database.
You can have a thread that reads a line to a concurrent queue. while other thread is pulling the data from concurrent queue. agregating it if necessary or if you are doing any operation on it. then batch insert the data to database. it will save you quite a time.
Inserting a line to db would be very slow. you have to do batch inserts.

Yes, using threads can speed thigns up.
Threads are to be used when you have time onsuming tasks you can run in the background (like, when you process say 10 files, but only need one, you can have a thread process each of them which will be a lot faster then processing them all on your main thread).
Please not that there may be bugs related, so you should make sure all threads finished running before continuing and trying to access what they got.
Look up "C#.NET multithreading"
any thread can run a specified function, and background worker is a nice class as well (I prefer pure multithreading though).
Also note that this may backfire and wind up slower, but it's a good idea to try.

Threading is one way (there are others) of letting you overlap the processing with the I/O. That means instead of the total time being the sum of the time to read the data and the time to processing the data, you can reduce it to (roughly) whichever of the two is larger (usually the I/O time).
If you mostly want to overlap the I/O time, you might want to look at overlapped I/O and/or I/O completion ports.
Edit: If you're going to do this, you normally want to base the number of I/O threads on the number of separate physical disks you're going to be reading from, and the number of processing threads on the number of processors you have available to do the processing (but only as many as necessary to keep up with the data being supplied by the reader thread). For a typical desktop machine, that will often mean only two threads, one to read and one to process data.

Process files concurrently as they arrive in c#

I have an app that works great for processing files that land in a directory on my server. The process is:
1) check for files in a directory
2) queue a user work item to handle each file in the background
3) wait until all workers have completed
4) goto 1
This works nicely and I never worry about the same file being processed twice or multiple threads being spawned for the same file. However, if there's one file that takes too long to process, step #3 hangs on that one file and holds up all other processing.
So my question is, what's the correct paradigm to spawn exactly one thread for each file I need to process, while not blocking if one file takes too long? I considered FileSystemWatcher, but the files may not be immediately readable which is why I continually look at all files and spawn a process for each (which will immediately exit if the file is locked).
Should I remove step #3 and maintain a list of files I've already processed? That seems messy and the list would grow very large over time so I suspect there's a more elegant solution.

I would suggest that you maintain a list of files which you are currently processing. Have the thread remove itself from this list when the thread finishes. When looking for new files, exclude those in the currently-running list.

Move the files to a processing directory before you start threads. Then you can fire-and-forget the threads and any admins can see at a glance what's going on.

Spawning one thread per item to process is almost never good approach. In your case when number of files will go above several hundreds one-thread-per-file will make application performance pretty bad and with 32-bit process will start running out of address space.
List solution by Dark Falcon is simple enough and matches your algorithm. I would actually use queue (likle ConcurrentQueue - http://msdn.microsoft.com/en-us/library/dd267265.aspx) to put items to process on one side (i.e. based on periodic scans of file watcher) and pick items for processing by one or several threads on other side. You generally want smaller number of threads (i.e. 1-2x number of CPUs for CPU intensive load).
Also consider using Task Parallel Library (like Parallel.ForEach - http://msdn.microsoft.com/en-us/library/dd989744.aspx) to deal with multiple thread.
To minimize number of files to handle I would keep persistent (i.e. disk file) list of items that are already processed - file path + last modified date (unless you can obtain this information from other source).

My two main questions would be:
What are the size of the files?
How often will files appear?
Depending on your answer there, I might go with the following producer-consumer algorithm:
Use a file system watcher to see that there is activity in the directory you are monitoring
When activity occurs, start polling "lightly"; that is test each file available to see if it is not locked (i.e., try open w/ write privileges using a simple IsLocked extension method that tests via a try..catch); if 1 or more files are not free, set a timer to go off in some amount of time (longer if expecting larger fewer files, shorter if smaller and/or more frequent) to again test files
As soon as you see that a file is free, process it (i.e., move it to another folder, put an item in a concurrent queue, have your consumer threads process the queue, archive the file/results).
Have some kind of persistence mechanism like Alexei mentions (i.e., disk/database) to be able to recover your processing where you left off in case of system failure.
I feel that this is a good combination of non-blocking, low cpu-usage behavior. But measure your before and after results. I would recommend using the ThreadPool and try to keep threads from blocking (i.e., try to ensure thread re-use by not blocking by doing something like Thread.Sleep)
Notes:
Base the number of threads processing files on the number of CPUs and cores available on the machine; also consider server load
FileSystemWatcher can be finicky; be sure that it's running from the same machine that you are monitoring (i.e., not watching a remote server), otherwise you'll need to reinitialize connectivity from time to time.
I definitely would not spawn a different process per file; multiple threads should be plenty sufficient; reusing threads is best. Spawning processes is a very very expensive operation and spawning threads is an expensive operation. Alexei has some good information wrt Task Parallel Library; it uses the ThreadPool.

Using multithreading for loop

I'm new to threading and want to do something similar to this question:
Speed up loop using multithreading in C# (Question)
However, I'm not sure if that solution is the best one for me as I want them to keep running and never finish. (I'm also using .net 3.5 rather than 2.0 as for that question.)
I want to do something like this:
foreach (Agent agent in AgentList)
{
// I want to start a new thread for each of these
agent.DoProcessLoop();
}
---
public void DoProcessLoop()
{
while (true)
{
// do the processing
// this is things like check folder for new files, update database
// if new files found
}
}
Would a ThreadPool be the best solution or is there something that suits this better?
Update: Thanks for all the great answers! I thought I'd explain the use case in more detail. A number of agents can upload files to a folder. Each agent has their own folder which they can upload assets to (csv files, images, pdfs). Our service (it's meant to be a windows service running on the server they upload their assets to, rest assured I'll be coming back with questions about windows services sometime soon :)) will keep checking every agent's folder if any new assets are there, and if there are, the database will be updated and for some of them static html pages created. As it could take a while for them to upload everything and we want them to be able to see their uploaded changes pretty much straight away, we thought a thread per agent would be a good idea as no agent then needs to wait for someone else to finish (and we have multiple processors so wanted to use their full capacity). Hope this explains it!
Thanks,
Annelie

Given the specific usage your describe (watching for files), I'd suggest you use a FileSystemWatcher to determine when there are new files and then fire off a thread with the threadpool to process the files until there are no more to process -- at which point the thread exits.
This should reduce i/o (since you're not constantly polling the disk), reduce CPU usage (since the constant looping of multiple threads polling the disk would use cycles), and reduce the number of threads you have running at any one time (assuming there aren't constant modifications being made to the file system).
You might want to open and read the files only on the main thread and pass the data to the worker threads (if possible), to limit i/o to a single thread.

I believe that the Parallels Extensions make this possible:
Parallel.Foreach
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx
http://blogs.msdn.com/pfxteam/

One issue with ThreadPool would be that if the pool happens to be smaller than the number of Agents you would like to have, the ones you try to start later may never execute. Some tasks may never begin to execute, and you could starve everything else in your app domain that uses the thread pool as well. You're probably better off not going down that route.

You definitely don't want to use the ThreadPool for this purpose. ThreadPool threads are not meant to be used for long-running tasks ("infinite" counts as long-running), since that would obviously tie up resources meant to be shared.
For your application, it would probably be better to create one thread (not from the ThreadPool) and in that thread execute your while loop, inside of which you iterate through your Agents collection and perform the processing for each one. In the while loop you should also use a Thread.Sleep call so you don't max out the processor (there are better ways of executing code periodically, but Thread.Sleep will work for your purposes).
Finally, you need to include some way for the while loop to exit when your program terminates.
Update: Finally finally, multi-threading does not automatically speed up slow-running code. Nine women can't make a baby in one month.

A thread pool is useful when you expect threads to be coming into and out of existence fairly regularly, not for a predefined set number of threads.

Hmm.. as Ragoczy points out, its better to use FileSystemWatcher to monitor the files. However, since you have additional operations, you may think in terms of multithreading.
But beware, no matter how many processers you have, there is a limit to it's capacity. You may not want to create as many threads as the number of concurrent users, for the simple reason that your number of agents can increase.

Until you upgrade to .NET 4, the ThreadPool might be your best option. You may also want to use a Semaphore and a AutoResetEvent to control the number of concurrent threads. If you're talking about long-running work then the overhead of starting up and managing your own threads is low and the solution is more elegant. That will allow you to use a WorkerThread.Join() so you can make sure all worker threads are complete before you resume execution.

C# Multithreading File IO (Reading)

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.
Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary
We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.
Any ideas to make this more efficient is appreciated.
EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.
Is it suitable to make use of the threadpool for this?

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,
ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);
YourMethod(object o){
Your Code here...
}
For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx
Hope, this helps

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.
You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):
var filesContent = from file in enumerableOfFilesToProcess
select new
{
File=file,
Content=File.ReadAllText(file)
};
var processedContent = from content in filesContent
select new
{
content.File,
ProcessedContent = ProcessContent(content.Content)
};
var dictionary = processedContent
.AsParallel()
.ToDictionary(c => c.File);
PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)
PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.
Here's some sample code using the CCR, an Interleave would work nicely here:
Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
new TeardownReceiverGroup(Arbiter.Receive<bool>(
false, mainPort, new Handler<bool>(Teardown))),
new ExclusiveReceiverGroup(Arbiter.Receive<object>(
true, mainPort, new Handler<object>(WriteData))),
new ConcurrentReceiverGroup(Arbiter.Receive<string>(
true, mainPort, new Handler<string>(ReadAndProcessData)))));
public void WriteData(object data)
{
// write data to the dictionary
// this code is never executed in parallel so no synchronization code needed
}
public void ReadAndProcessData(string s)
{
// this code gets scheduled to be executed in parallel
// CCR take care of the task scheduling for you
}
public void Teardown(bool b)
{
// clean up when all tasks are done
}

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.
Build a worker class that does the processing and give it a callback routine to return results and status.
For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
To stop early, just clear the queue.
Use locking around your thread collections to preserve your sanity.

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.
So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.
ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.
Hth

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:
Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).
This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.
Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.
The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.
Hope this helps somebody with their decision making in the future :)

Threading cost - minimum execution time when threads would add speed

I am working on a C# application that works with an array. It walks through it (meaning that at one time only a narrow part of the array is used). I am considering adding threads in it to make it perform faster (it runs on a dualcore computer). The problem is that I do not know if it would actually help, because threads cost something and this cost could easily be more than the parallel gain... So how do I determine if threading would help?

Try writing some benchmarks that mimic, as closely as possible, the real-world conditions in which your software will actually be used.
Test and time the single-threaded version. Test and time the multi-threaded version. Compare the two sets of results.

If your application is CPU bound (i.e. it isn't spending time trying to read files or waiting for data from a device) and there is little to no sharing of live data (data being altered, if its read only its fine) between the threads then you can pretty much increase the speed by 50->75% by adding another thread (as long as it still remains CPU bound of course).
The main overhead in multithreading comes from 2 places.
Creation & initialization of the thread. Creating a thread requires quite a few resources to be allocated and involves swaps between kernel and user mode, this is expensive though a once off per thread so you can pretty much ignore it if the thread is running for any reasonable amount of time. The best way to mitigate this problem is to use a thread pool as it will keep the thread on hand and not need to be recreated.
Handling synchronization of data. If one thread is reading from data that another is writing, bad things will generally happen (worse if both are changing it). This requires you to lock your data before altering it so that no thread reads a half written value. These locks are generally quite slow as well. To mitigate this problem, you need to design your data layout so that the threads don't need to read or write to the same data as much as possible. If you do need a lot of these locks it can then become slower than the single thread option.
In short, if you are doing something that requires the CPU's to share a lot of data, then multi-threading it will be slower and if the program isn't CPU bound there will be little or no difference (could be a lot slower depending on what it is bound to, e.g. a cd/hard drive). If your program matches these conditions, then it will PROBABLY be worthwhile to add another thread (though the only way to be certain would be profiling).
One more little note, you should only create as many CPU bound threads as you have physical cores (threads that idle most of the time, such as a GUI message pump thread, can be ignored for this condition).
P.S. You can reduce the cost of locking data by using a methodology called "lock-free programming", though this something that should really only be attempted by people with a lot of experience with multi-threading and a clear understanding of their target architecture (including how the cache is treated and the memory bus).

I agree with Luke's answer. Benchmark it, it's the only way to be sure.
I can also give a prediction of the results - the fastest version will be when the number of threads matches the number of cores, EXCEPT if the array is very small and each thread would have to process just a few items, the setup/teardown times might get larger than the processing itself. How few - that depends on what you do. Again - benchmark.
I'd advise to find out a "minimum number of items for a thread to be useful". Then, when you are deciding how many threads to spawn (or take from a pool), check how many cores the computer has and how many items there are. Spawn as many threads as possible, but no more than the computer has cores, and not so many that each thread would have less than the minimum number of items to process.
For example if the minimum number of items is, say, 1000; and the computer has 4 cores; and your list contains 2500 items, you would spawn just 2 threads, because more threads would be inefficient (each would process less than 1000 items).

Making a step by step list for Luke's idea:
Make a single threaded test app
Download Sysinternals Process Monitor and run it
Run your test app and find it on the process list (remember to run it as a release build outside of Visual Studio)
Double click the process and select the Performance Graph tab
Observe the CPU time used by your process
If the CPU time is sittling flat 50% for more than a few seconds, you can probably speed your overall process up using threads (assuming the bunch of stuff Mr Peters refered to holds true)
(However, the best you can do on a duel core machine is to halve the time it takes to run. If your process only take 4 seconds, it might not be worth getting it to run in 2 seconds)
Using the task parallel library / Rx provides a friendlier interface than System.Threading.ThreadPool, which might make your world a bit easier.

You miss imho one item, which is that it is not always about execution time. There is:
The problem to koop a UI operational during an operation. Even if the UI is "dormant", a nonresponsive message pump makes a worse impression.
The possibility to use a thread pool to actually not ahve to start / stop threads all the time. I use thread pools very extensively, and various parts of the applications keep them busy.
Anyhow, ignoring my point 1 - where you may go multi threaded without speeding things up in order to keep your UI responsive - I would say it is always then faster when you can actually either split up work (so you can keep more than one core busy) or offload it for othe reasons.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.