file copy is faster after one pass

file copy is faster after one pass - c#

I have a program that copies a large amount of files from several directories to on directory. the amount is not known(about 50K), but they are all on the same drive. in the program there should be a progress bar. when i wrote the program for the first time i did not wrote the progress bar and program ran slow. i toke about 15-20 min to pass the files. in order to write the progress bar i needed to know how many files do i have, so i went through the directories and listed the files. now the first ran through the files takes about 5 min, but the copy takes only 5-7 min.
can anyone explain the phenomenon? I'm sorry that i can't share the code, but it's a simple use of File.Copy and a simple c# .net 3.5 progressBar

This approach minimizes the most expensive operation on a disk drive, moving the reader head. Disk speeds are rated by two basic mechanical constraints. One is how fast the platters spin which sets an upper bound on the data transfer speed. That's fixed. And how fast the read head can be moved to another track. The seek time, a fat dozen milliseconds to move it by one track is typical. A very long time in cpu cycles. Which makes the order in which you access disk data very important. Constantly jumping the reader head back-and-forth between the directory records and the file data clusters like you did originally is very expensive.
To what degree the data on the disk is fragmented is also very important, the reason defrag utilities exist. A drive that sees a high rate of files getting created and deleted tends to get fragmented quicker. The higher the fragmentation rate, the more disk seeks you'll need to read data from the drive.
By reading the directory entries first you can avoid a lot of seeks. They are localized in an area of the drive called the MFT, physically close to each other so far fewer long seeks. You'll read them again when you actually start copying the files, but this time they come from the file system cache. Stored in RAM when you scanned the directories the first time. So no need for an expensive seek back to the MFT.
Also notably the reason why SSDs work so much better than mechanical drives, they have a very low seek time.

This is not a phenomenon, it is Caching

Related

.Net Write continuously data to the disk in different files

We have an application that extract data from several hardware devices. Each device's data should be stored in a different file.
Currently we have one FileStream by file and doing a write when a data comes and that's it.
We have a lot of data coming in, the disk is struggling with an HDD(not a SSD), I guess because the flash is faster, but also because we do not have to jump to different file places all the time.
Some metrics for the default case: 400 different data source(each should have his own file) and we receive ~50KB/s for each data(so 20MB/s). Each data source acquisition is running concurrently and at total we are using ~6% of the CPU.
Is there a way to organize the flush to the disk in order to ensure the better flow?
We will also consider improving the hardware, but it's not really the subject here, since it's a good way to improve our read/write

Windows and NTFS handle multiple concurrent sequential IO streams to the same disk terribly inefficiently. Probably, you are suffering from random IO. You need to schedule the IO yourself in bigger chunks.
You might also see extreme fragmentation. In such cases NTFS sometimes allocates every Nth sector to each of the N files. It is hard to believe how bad NTFS is in such scenarios.
Buffer data for each file until you have like 16MB. Then, flush it out. Do not write to multiple files at the same time. That way you have one disk seek for each 16MB segment which reduces seek overhead to near zero.

Use multithreading for multiple file copies

I have to copy large number of files (10000 files)
because it takes long time to copy.
I have tried using two threads instead of single thread, one to copy odd number files in list and other to copy even numbers from list
I have used this code:
ThreadPool.QueueUserWorkItem(new WaitCallback(this.RunFileCopy),object)
but there is no significant difference in time when using single thread and when using two threads.
What could be the reason for this?

File copying is not a CPU process, it a IO process, so multithreding or parallelism wont help you.
Multithreading will slow you down in almost all cases.If disc is SSD too it has a limited speed for r/w and it will use it efficiently with single thread too. If u use parallelism you will just split your speed into pieces and this will create a huge overhead for HDD
Multithreading only helps you in more than one disc case, when you read from different discs and write to different discs.
If files are too small. Zipping and unzipping the files on the target drive can be faster in most cases, and if u zip the files with low compression it will be quite faster
using System.IO;
using System.IO.Compression;
.....
string startPath = #"c:\example\start";
string zipPath = #"c:\example\result.zip";
string extractPath = #"c:\example\extract";
ZipFile.CreateFromDirectory(startPath, zipPath, CompressionLevel.Fastest, true);
ZipFile.ExtractToDirectory(zipPath, extractPath);
More implementation details here
How to: Compress and Extract Files

I'm going to provide a minority opinion here. Everybody is telling you that Disk I/O is preventing you from getting any speedup from multiple threads. That's ... sort ... of right, but...
Given a single disk request, the OS can only choose to move the heads to the point on the disk selected impliclity by the file access, usually incurring an average of half of the full stroke seek time (tens of milliseconds) and rotational delays (another 10 milliseconds) to access the data. And sticking with single disk requests, this is a pretty horrendous (and unavoidable) cost to pay.
Because disk accesses take a long time, the OS has plenty of CPU to consider the best order to access the disk when there are multiple requests, should they occur while it is already waiting for the disk to do something. The OS does so usually with an elevator algorithm, causing the heads to efficiently scan across the disk in one direction in one pass, and scan efficiently in the other direction when the "furthest" access has been reached.
The idea is simple: if you process multiple disk requests in exactly the time order in which they occur, the disk heads will likely jump randomly about the disk (under the assumption the files are placed randomly), thus incurring the helf-full seek + rotational delay on every access. With 1000 live accesses processed in order, 1000 average half-full seeks will occur. Ick.
Instead, give N near-simultaneous accesses, the OS can sort these accesses by the physical cylinder they will touch, and then process them in cylinder order. A 1000 live accesses, processed in cylinder order (even with random file distributions), is likely to have one request per cylinder. Now the heads only have to step from one cylinder to the next, and that's a lot less than the average seek.
So having lots of requests should help the OS make better access-order decisions.
Since OP has lots of files, there's no reason he could not run a lot of threads, each copying its own file and generating demand for disk locations. He would want each thread to issue a read and write of of something like a full track, so that when the heads arrive at a cylinder, a full track is read or written (under the assumption the OS lays files out contiguously on a track where it can).
OP would want to make sure his machine had enough RAM to buffer his threadcount times tracksize. An 8Gb machine with 4 Gb unbusy during the copy has essentially a 4 Gb disk cache. A 100Kb per track (been a long time since I looked) suggests "room" for 10,000 threads. I seriously doubt he needs that many; mostly he needs enough threads to overwhelm the number of cylinders on his disk. I'd surely consider several hundred threads.
Two threads surely is not enough. (Windows appears to use one thread when you ask it copy a bunch of files. That's always seemed pretty dumb to me).
Another poster suggested zipping the files. With lots of threads, and everything waiting on the disk (the elevator algorithm doesnt change that, just the average wait time), many threads can afford to throw computational cycles at zipping. This won't help with reads; files are what they are when read. But it may shorten the amount of data to write, and provide effectively larger buffers in memory, providing some additional speedup.
Note: If one has an SSD, then there are no physical cylinders, thus no seek time, and nothing for an elevator algorithm to optimize. Here, lots of threads don't buy any cylinder ordering time. They shouldn't hurt, either.

Performance when downloading thousands of images

I have a function that downloads thousands of images at a time from a 3rd party source. The number of images can range from 2,500-250,000 per run. As you can imagine, this process takes some time and am looking to optimize the best I can.
The way it works is I take a list of image paths, do a loop through them and request the image from the 3rd party. Currently, before I make the request, I do a check to see if the image already exists on the server...if it does, it skips that image...if it does not, it downloads it.
My question is if anyone knows if the check before the download is slowing down the process (or possibly speeding it up)? Would it be more efficient to download the file and let it override for already existing images, thus cutting out the step of checking for existence?
If anyone else has any tips for downloading this volume of images, they are welcome!

The real answer depends on three things:
1: how often you come across an image that already exists. The less often you have a hit, the less useful checking is.
2: The latency of the destination storage. Is the destination storage location local or far away? if it is in India with a 300ms latency (and probable high packet loss), the check becomes more expensive relative to the download. This is mitigated significantly by smart threading.
3: Your bandwidth / throughput from your source to your destination. The higher your bandwidth, the less downloading a file twice costs you.
If you have a less than 1% hit rate for images that already exist, you're not getting much of a gain from the check (max ~1%), but if 90% of the images already exist, it would be probably be worth checking even if the destination file store is remote / far away. Either way it is a balancing act, but if you have a hit rate high enough to ask, its likely that checking to see if you already have the file would be useful.
If images you already have don't get deleted, the best way to do this would probably be to keep a database of images that you've downloaded, and check your list of files to download against that database.
If that isn't feasible because images get deleted / renamed or something, minimize the impact of the check by threading it. The performance difference between foreach and Parallel.ForEach for operations with high latency are huge.
Finally, 250k images can be a lot of data if they're large images. It might be faster to send physical media (i.e. put the data on a hard drive and send the drive).

Doing a
System.IO.File.Exists(pathName);
is a lot less expensive than doing a download. So it would speed up the process by avoiding the time to do the download.

Mutiple Threading in the eyes of I/O operations?

I was thinking...
Does Multithreading using c# in I/O operations ( lets say copying many files from c:\1\ to c:\2\ ) , will have performance differences rather than doing the operation - sequential ?
The reason why im struggle with myself is that an IO operation finally - is One item which has to do work. so even if im working in parallel - he will still execute those copy orders as sequential...
or maybe my assumption is wrong ?
in that case is there any benefit of using multithreaded copy to :
copy many small file ( sum 4GB)
copy 4 big files ( sum 4 gb , 1000 mb each)
thanks

Like others says, it has to be measured against concrete application context.
But just would like to invite an attention on this.
Every time you copy a file the permission of Write access to destination location is checked, which is slow.
All of us met a case when you have to copy/paste a sequence of already compressed files, and if you them compress again into one big ZIP file, so the total compressed size is not significally smaller then the sum of all content, the IO operation will be executed a way faster. (Just try it, you will see a huge difference, if you didn't do it before).
So I would assume (again it has to be measured on concrete system, mine are just guesses) that having one big file write on single disk, will be faster the a lot of small files.
Hope this helps.

Multithreading with files is not so much about the CPU but about IO. This means that totally different rules apply. Different devices have different characterstics:
Magnetic disks like sequential IO
SSDs like sequential or parallel random IO (mine has 4 hardware "threads")
The network likes many parallel operations to amortize latency

I'm no expert in hard-disk related questions, but maybe this will shred some light for you:
Windows is using the NTFS file system. This system doesn't "like" too much small files, for example, under 1kb files. It will "magically" make 100 files of 1kb weight 400kb instead of 100kb. It is also VERY slow when dealing which a lot of "small" files. Therefore, copying one big file instead of many small files of the same weight will be much faster.
Also, from my personal experience and knowledge, Multithreading will NOT speed up the copying of many files, because the actual hardware disk is acting like one unit, and can't be sped up by sending many requests at the same time (it will process them one by one.)

C#: poor performance with multithreading with heavy I/O

I've written an application in C# that moves jpgs from one set of directories to another set of directories concurrently (one thread per fixed subdirectory). The code looks something like this:
string destination = "";
DirectoryInfo dir = new DirectoryInfo("");
DirectoryInfo subDirs = dir.GetDirectories();
foreach (DirectoryInfo d in subDirs)
{
FileInfo[] files = subDirs.GetFiles();
foreach (FileInfo f in files)
{
f.MoveTo(destination);
}
}
However, the performance of the application is horrendous - tons of page faults/sec. The number of files in each subdirectory can get quite large, so I think a big performance penalty comes from a context switch, to where it can't keep all the different file arrays in RAM at the same time, such that it's going to disk nearly every time.
There's a two different solutions that I can think of. The first is rewriting this in C or C++, and the second is to use multiple processes instead of multithreading.
Edit: The files are named based on a time stamp, and the directory they are moved to are based on that name. So the directories they are moved to would correspond to the hour it was created; 3-27-2009/10 for instance.
We are creating a background worker per directory for threading.
Any suggestions?

Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.
If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.

Reconsidered answer
I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.
However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)
I suggest the following plan of attack:
Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
Experiment with different numbers of threads, with a queue of directories still to process.
Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.
Original answer
Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.
It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.
You may be interested in a benchmark (description and initial results) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (i.e. it's hardly doing any CPU work) the best results are always with a single thread.

If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.

It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.
Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.

the performence problem comes from the hard drive there is no point from redoing everything with C/C++ nor from multiple processes

Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (e.g. 'loading' executable code) - they're not a bad thing per se.
If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.
Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.

If GetFiles() is indeed returning a large set of data, you could write an enumerator, as in:
IEnumerable<string> GetFiles();

So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.
And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).
If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.