Why isn't this Parallel.ForEach loop improving performance?

Why isn't this Parallel.ForEach loop improving performance? - c#

I have the following code:
if (!this.writeDataStore.Exists(mat))
{
BlockingCollection<ImageFile> imageFiles = new BlockingCollection<ImageFile>();
Parallel.ForEach(fileGrouping, fi => DecompressAndReadGzFile(fi, imageFiles));
this.PushIntoDb(mat, imageFiles.ToList());
}
DecompressAndReadGzFile is a static method in the same class that this method is contained in. As per the method name I am decompressing and reading gz files, lots of them, i.e. up to 1000, so the overhead of parallelisation is worth it for the benefits. However, I'm not seeing the benefits. When I use ANTS performance profiler I see that they are running at exactly the same times as if no parallelisation is occuring. I also check the CPU cores with process explorer and it looks like there is possibly work being done on two cores but one core seems to be doing most of the work. What am I not understanding as far as getting Parallel.ForEach to decompress and read files in parallel?
UPDATED QUESTION: What is the fastest way to read information in from a list of files?
The Problem (simplified):
There is a large list of .gz files (1200).
Each file has a line containing "DATA: ", the location and line number are not static and can vary from file to file.
We need to retrieve the first number after "DATA: " (just for simplicity's sake) and store it in an object in memory (e.g. a List)
In the initial question, I was using the Parallel.ForEach loop but I didn't seem to be CPU bound on more than 1 core.

Is it possible that the threads are spending most of their time waiting for IO? By reading multiple files at a time, you may be making the disk thrash more than it would with a single operation. It's possible that you could improve performance by using a single thread reading sequentially, but then doling out the CPU-bound decompression to separate threads... but you may actually find that you only really need one thread performing the decompression anyway, if the disk is slower than the decompression process itself.
One way to test this would be to copy the files requiring decompression onto a ramdisk first and still use your current code. I suspect you'll then find you're CPU-bound, and all the processors are busy almost all the time.
(You should also consider what you're doing with the decompressed files. Are you writing those back to disk? If so, again there's the possibility that you're basically waiting for a thrashing disk.)

is there any chance your static method is sharing any global resource among its calls.
Because in that case this static method will be called sequentially and no parallel benefit.
Can you put your fileGrouping class code ?

Related

CPU & Memory spikes, during Parallel.ForEach

I am building an application for work to copy files and folders, with a few more options but these are not being utilised during this issue.
The function in question iterates through each file in a directory, and then copies the file to an identical directory, in a new location (so it preserves nested file structures).
The application is a Windows Form, and due to issues writing to a text box at the same time, I have surrounded the parallel function in a Task.Factory.StartNew(), which fixed that issue.
Task.Factory.StartNew(() =>
{
Parallel.ForEach(Directory.GetFiles(root, "*.*", SearchOption.AllDirectories), newPath =>
{
try
{
File.Copy(newPath, newPath.Replace(root, destination), false);
WriteToOutput("recreated the file '" + newPath.Replace(root, destination) + "'");
}
catch (Exception e)
{
WriteToOutput(e.Message);
}
});
});
When run, the diagnostic tools show spikes every few seconds. How can I 'even out' these spikes and make the performance consistent? I am also writing to the screen for each file that is moved, and there is a noticeable second or so pause between every maybe, 20/25 files.
The below screenshot is a sample from the Diagnostic Tools.

Your work is primarily IO bound, not CPU bound. You don't have any work for a CPU to do most of the time. You're just waiting for the hard drive to do its work. The spikes in your CPU are merely the short periods of time after the disk has finished an operation where the CPU is trying to figure out what to ask it to do next, which takes very little time, hence why you see spikes, not plateaus.

I am concerned by this sentence:
due to issues writing to a text box at the same time, I have surrounded the parallel function in a Task.Factory.StartNew(), which fixed that issue
I honestly doubt that fixed the issue. It probably concealed it. You do not appear to be awaiting or checking on the Task, so you are therefore not observing any exceptions. The short CPU spike and the delay in output could easily be caused by a stack unwind of some kind.
If you having trouble updating the UI from your worker threads, make sure you understand the purpose of Invoke and be sure you are using it. Then get rid of the StartNew, or make sure you are handling any exceptions.

What you're doing is to press the disk with many file read requests in parallel. Well, disks, like any other I/O device, don't work well in that mode.
For one thing, if you're reading the HDD, then it definitely cannot answer the parallel requests simply because it would have to move the reading head to multiple locations at the same time.
Even with an SDD, the device cannot answer the requests at the same rate at which CPU can ask.
In any case, the disk will definitely not be able to return the data at uniform speed. Many file read requests will be pending for the whole eternity (measured in CPU time), leaving those tasks blocked. That is the reason why performance is uneven when storming the disk with many parallel operations.
When attempting to process many files, you might choose to allocate one task to read them, and then process the loaded data in parallel. Think about that design instead. The I/O-bound task would be only one and it won't be blocked more than necessary. That will let the drive return the data at maximum speed which it can achieve at the time. The CPU-bound tasks would be non-blocking, obviously, because their data would already be in memory at the time any of the tasks is started. I would expect that design to provide smooth performance.

Async/Await vs Parellel.For, which is better in this instance?

So I have 1000s of items to check whether they are up to date. Each one of those items requires reading thousands of files (some of which might be the same file across different items).
Currently this is implements using the TPL (async/await), one for each file it has to read and one for each item it has to check. This works fine, except for when I profile it, about the 3rd most expensive function is TrySteal in the thread pool.
Using the visual studio concurrency viewer, I see that 99% of a threads time in spent in concurrently related items, and only 1% in execution. It is this that leads me to think that I am perhaps just creating too many tasks (note: I don't use Task.Run anywhere, just await).
Would Parellel.For be any less overhead than reading a bunch of files using async/await? How much overhead is expected using the task programming library?

If you are checking files on the hard drive, I don't think that this task is very well parallelable. If you are trying to read thousands of files at the same time, you just make the process much slower, because it cannot read that many of them at the same time, and even worse, it cannot cache too many into memory.
The fastest option, without optimization of the checking process itself, should be just running it consecutively.
If you really want to optimize it, I suggest to loop through the files, checking for each item, instead of looping through item, checking each file. In this case, it might be effective even to do it in multiple threads (not all at once though).
Update:
For the case when you have enough memory to cache all your files, then it does not restrict multithreading that much. Still, I would suggest to limit amount of parallel threads to number, comparable to amount of processor cores you going to work with. It is better to do it with Parallel.ForEach(). Also, Parallel.Foreach() clearly states, that you loop is async, so the code will be easier to understand.

Write Large File Listing To File Efficiently in C#

I have what I would consider to be a fairly common problem, but have not managed to find a good solution on my own or by browsing this forum.
Problem
I have written a tool to get a file listing of a folder with some additional information such as file name, file path, file size, hash, etc.
The biggest problem that I have is that some of the folders contain millions of files (maybe 50 million in the structure).
Possible Solutions
I have two solutions, but neither of them are ideal.
Every time a file is read, the information is written straight to file. This is OK, but it means I can't multi-thread the file without running into issues with the thread locking the file.
Every time a file is read, the information is added to some form of collection such as a ConcurrentBag. The means I can multi-thread the enumeration of the files and add them to the collection. Once the enumeration is done, I can write the whole collection to a file using File.WriteAllLines; however adding 50 million entries to the collection makes most machines run out of memory.
Other Options
Is there any way to add items to a collection and then write them to a file when it gets to a certain number of records in the collection or something like that?
I looked into a BlockingCollection, but that will just fill up really quickly as the producer will be multi-threaded, but the consumer would only be single-threaded.

Create a FileStream that is shared by all threads. Before writing to that FileStream, a thread must lock it. FileStream has some buffer (4096bytes if i remember right), so it doesn't actually write to disk every time. You may use a BufferedStream around that if 4096 bytes is still not enough.

BlockingCollection is precisely what you need. You can create one with a large buffer and have a single writer thread writing to a file that it keeps open for the duration of the run.
If reading is the dominant operation time-wise the queue will be near empty the whole time and total time will be just slightly more than the read time.
If writing is the dominant operation time-wise the queue will fill up until you reach the limit you set (to prevent out of memory situations) and producers will only advance as the writer advances. The total time will be the time needed to write all the records to a single file sequentially and you cannot do better than that (when writer is the slowest part).
You may be able to get slightly better performance by pipelining through multiple blocking collections, e.g. making the hash-calculation (a CPU-bound operation) potentially separate from the read, or write operations. If you want to do that though consider the TPL DataFlow library.

Threads vs Processes in .NET

I have a long-running process that reads large files and writes summary files. To speed things up, I'm processing multiple files simultaneously using regular old threads:
ThreadStart ts = new ThreadStart(Work);
Thread t = new Thread(ts);
t.Start();
What I've found is that even with separate threads reading separate files and no locking between them and using 4 threads on a 24-core box, I can't even get up to 10% on the CPU or 10% on disk I/O. If I use more threads in my app, it seems to run even more slowly.
I'd guess I'm doing something wrong, but where it gets curious is that if I start the whole exe a second and third time, then it actually processes files two and three times faster. My question is, why can't I get 12 threads in my one app to process data and tax the machine as well as 4 threads in 3 instances of my app?
I've profiled the app and the most time-intensive and frequently called functions are all string processing calls.

It's possible that your computing problem is not CPU bound, but I/O bound. It doesn't help to state that your disk I/O is "only at 10%". I'm not sure such performance counter even exists.
The reason why it gets slower while using more threads is because those threads are all trying to get to their respective files at the same time, while the disk subsystem is having a hard time trying to accomodate all of the different threads. You see, even with a modern technology like SSDs where the seek time is several orders of magnitude smaller than with traditional hard drives, there's still a penalty involved.
Rather, you should conclude that your problem is disk bound and a single thread will probably be the fastest way to solve your problem.
One could argue that you could use asynchronous techniques to process a bit that's been read, while on the background the next bit is being read in, but I think you'll see very little performance improvement there.
I've had a similar problem not too long ago in a small tool where I wanted to calculate MD5 signatures of all the files on my harddrive and I found that the CPU is way too fast compared to the storage system and I got similar results trying to get more performance by using more threads.
Using the Task Parallel Library didn't alleviate this problem.

First of all on a 24 core box if you are using only 4 threads the most cpu it could ever use is 16.7% so really you are getting 60% utilization, which is fairly good.
It is hard to tell if your program is I/O bound at this point, my guess is that is is. You need to run a profiler on your project and see what sections of code your project is spending the most of it's time. If it is sitting on a read/write operation it is I/O bound.
It is possable you have some form of inter-thread locking being used. That would cause the program to slow down as you add more threads, and yes running a second process would fix that but fixing your locking would too.
What it all boils down to is without profiling information we can not say if using a second process will speed things up or make things slower, we need to know if the program is hanging on a I/O operation, a locking operation, or just taking a long time in a function that can be parallelized better.

I think you find out what file cache is not ideal in case when one proccess write data in many file concurrently. File cache should sync to disk when the number of dirty page cache exceeds a threshold. It seems concurrent writers in one proccess hit threshold faster than the single thread writer. You can read read about file system cache here File Cache Performance and Tuning

Try using Task library from .net 4 (System.Threading.Task). This library have built-in optimizations for different number of processors.
Have no clue what is you problem, maybe because your code snippet is not really informative

C#: poor performance with multithreading with heavy I/O

I've written an application in C# that moves jpgs from one set of directories to another set of directories concurrently (one thread per fixed subdirectory). The code looks something like this:
string destination = "";
DirectoryInfo dir = new DirectoryInfo("");
DirectoryInfo subDirs = dir.GetDirectories();
foreach (DirectoryInfo d in subDirs)
{
FileInfo[] files = subDirs.GetFiles();
foreach (FileInfo f in files)
{
f.MoveTo(destination);
}
}
However, the performance of the application is horrendous - tons of page faults/sec. The number of files in each subdirectory can get quite large, so I think a big performance penalty comes from a context switch, to where it can't keep all the different file arrays in RAM at the same time, such that it's going to disk nearly every time.
There's a two different solutions that I can think of. The first is rewriting this in C or C++, and the second is to use multiple processes instead of multithreading.
Edit: The files are named based on a time stamp, and the directory they are moved to are based on that name. So the directories they are moved to would correspond to the hour it was created; 3-27-2009/10 for instance.
We are creating a background worker per directory for threading.
Any suggestions?

Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.
If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.

Reconsidered answer
I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.
However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)
I suggest the following plan of attack:
Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
Experiment with different numbers of threads, with a queue of directories still to process.
Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.
Original answer
Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.
It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.
You may be interested in a benchmark (description and initial results) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (i.e. it's hardly doing any CPU work) the best results are always with a single thread.

If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.

It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.
Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.

the performence problem comes from the hard drive there is no point from redoing everything with C/C++ nor from multiple processes

Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (e.g. 'loading' executable code) - they're not a bad thing per se.
If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.
Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.

If GetFiles() is indeed returning a large set of data, you could write an enumerator, as in:
IEnumerable<string> GetFiles();

So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.
And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).
If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.