Performance when downloading thousands of images

Performance when downloading thousands of images - c#

I have a function that downloads thousands of images at a time from a 3rd party source. The number of images can range from 2,500-250,000 per run. As you can imagine, this process takes some time and am looking to optimize the best I can.
The way it works is I take a list of image paths, do a loop through them and request the image from the 3rd party. Currently, before I make the request, I do a check to see if the image already exists on the server...if it does, it skips that image...if it does not, it downloads it.
My question is if anyone knows if the check before the download is slowing down the process (or possibly speeding it up)? Would it be more efficient to download the file and let it override for already existing images, thus cutting out the step of checking for existence?
If anyone else has any tips for downloading this volume of images, they are welcome!

The real answer depends on three things:
1: how often you come across an image that already exists. The less often you have a hit, the less useful checking is.
2: The latency of the destination storage. Is the destination storage location local or far away? if it is in India with a 300ms latency (and probable high packet loss), the check becomes more expensive relative to the download. This is mitigated significantly by smart threading.
3: Your bandwidth / throughput from your source to your destination. The higher your bandwidth, the less downloading a file twice costs you.
If you have a less than 1% hit rate for images that already exist, you're not getting much of a gain from the check (max ~1%), but if 90% of the images already exist, it would be probably be worth checking even if the destination file store is remote / far away. Either way it is a balancing act, but if you have a hit rate high enough to ask, its likely that checking to see if you already have the file would be useful.
If images you already have don't get deleted, the best way to do this would probably be to keep a database of images that you've downloaded, and check your list of files to download against that database.
If that isn't feasible because images get deleted / renamed or something, minimize the impact of the check by threading it. The performance difference between foreach and Parallel.ForEach for operations with high latency are huge.
Finally, 250k images can be a lot of data if they're large images. It might be faster to send physical media (i.e. put the data on a hard drive and send the drive).

Doing a
System.IO.File.Exists(pathName);
is a lot less expensive than doing a download. So it would speed up the process by avoiding the time to do the download.

Related

Use multithreading for multiple file copies

I have to copy large number of files (10000 files)
because it takes long time to copy.
I have tried using two threads instead of single thread, one to copy odd number files in list and other to copy even numbers from list
I have used this code:
ThreadPool.QueueUserWorkItem(new WaitCallback(this.RunFileCopy),object)
but there is no significant difference in time when using single thread and when using two threads.
What could be the reason for this?

File copying is not a CPU process, it a IO process, so multithreding or parallelism wont help you.
Multithreading will slow you down in almost all cases.If disc is SSD too it has a limited speed for r/w and it will use it efficiently with single thread too. If u use parallelism you will just split your speed into pieces and this will create a huge overhead for HDD
Multithreading only helps you in more than one disc case, when you read from different discs and write to different discs.
If files are too small. Zipping and unzipping the files on the target drive can be faster in most cases, and if u zip the files with low compression it will be quite faster
using System.IO;
using System.IO.Compression;
.....
string startPath = #"c:\example\start";
string zipPath = #"c:\example\result.zip";
string extractPath = #"c:\example\extract";
ZipFile.CreateFromDirectory(startPath, zipPath, CompressionLevel.Fastest, true);
ZipFile.ExtractToDirectory(zipPath, extractPath);
More implementation details here
How to: Compress and Extract Files

I'm going to provide a minority opinion here. Everybody is telling you that Disk I/O is preventing you from getting any speedup from multiple threads. That's ... sort ... of right, but...
Given a single disk request, the OS can only choose to move the heads to the point on the disk selected impliclity by the file access, usually incurring an average of half of the full stroke seek time (tens of milliseconds) and rotational delays (another 10 milliseconds) to access the data. And sticking with single disk requests, this is a pretty horrendous (and unavoidable) cost to pay.
Because disk accesses take a long time, the OS has plenty of CPU to consider the best order to access the disk when there are multiple requests, should they occur while it is already waiting for the disk to do something. The OS does so usually with an elevator algorithm, causing the heads to efficiently scan across the disk in one direction in one pass, and scan efficiently in the other direction when the "furthest" access has been reached.
The idea is simple: if you process multiple disk requests in exactly the time order in which they occur, the disk heads will likely jump randomly about the disk (under the assumption the files are placed randomly), thus incurring the helf-full seek + rotational delay on every access. With 1000 live accesses processed in order, 1000 average half-full seeks will occur. Ick.
Instead, give N near-simultaneous accesses, the OS can sort these accesses by the physical cylinder they will touch, and then process them in cylinder order. A 1000 live accesses, processed in cylinder order (even with random file distributions), is likely to have one request per cylinder. Now the heads only have to step from one cylinder to the next, and that's a lot less than the average seek.
So having lots of requests should help the OS make better access-order decisions.
Since OP has lots of files, there's no reason he could not run a lot of threads, each copying its own file and generating demand for disk locations. He would want each thread to issue a read and write of of something like a full track, so that when the heads arrive at a cylinder, a full track is read or written (under the assumption the OS lays files out contiguously on a track where it can).
OP would want to make sure his machine had enough RAM to buffer his threadcount times tracksize. An 8Gb machine with 4 Gb unbusy during the copy has essentially a 4 Gb disk cache. A 100Kb per track (been a long time since I looked) suggests "room" for 10,000 threads. I seriously doubt he needs that many; mostly he needs enough threads to overwhelm the number of cylinders on his disk. I'd surely consider several hundred threads.
Two threads surely is not enough. (Windows appears to use one thread when you ask it copy a bunch of files. That's always seemed pretty dumb to me).
Another poster suggested zipping the files. With lots of threads, and everything waiting on the disk (the elevator algorithm doesnt change that, just the average wait time), many threads can afford to throw computational cycles at zipping. This won't help with reads; files are what they are when read. But it may shorten the amount of data to write, and provide effectively larger buffers in memory, providing some additional speedup.
Note: If one has an SSD, then there are no physical cylinders, thus no seek time, and nothing for an elevator algorithm to optimize. Here, lots of threads don't buy any cylinder ordering time. They shouldn't hurt, either.

file copy is faster after one pass

I have a program that copies a large amount of files from several directories to on directory. the amount is not known(about 50K), but they are all on the same drive. in the program there should be a progress bar. when i wrote the program for the first time i did not wrote the progress bar and program ran slow. i toke about 15-20 min to pass the files. in order to write the progress bar i needed to know how many files do i have, so i went through the directories and listed the files. now the first ran through the files takes about 5 min, but the copy takes only 5-7 min.
can anyone explain the phenomenon? I'm sorry that i can't share the code, but it's a simple use of File.Copy and a simple c# .net 3.5 progressBar

This approach minimizes the most expensive operation on a disk drive, moving the reader head. Disk speeds are rated by two basic mechanical constraints. One is how fast the platters spin which sets an upper bound on the data transfer speed. That's fixed. And how fast the read head can be moved to another track. The seek time, a fat dozen milliseconds to move it by one track is typical. A very long time in cpu cycles. Which makes the order in which you access disk data very important. Constantly jumping the reader head back-and-forth between the directory records and the file data clusters like you did originally is very expensive.
To what degree the data on the disk is fragmented is also very important, the reason defrag utilities exist. A drive that sees a high rate of files getting created and deleted tends to get fragmented quicker. The higher the fragmentation rate, the more disk seeks you'll need to read data from the drive.
By reading the directory entries first you can avoid a lot of seeks. They are localized in an area of the drive called the MFT, physically close to each other so far fewer long seeks. You'll read them again when you actually start copying the files, but this time they come from the file system cache. Stored in RAM when you scanned the directories the first time. So no need for an expensive seek back to the MFT.
Also notably the reason why SSDs work so much better than mechanical drives, they have a very low seek time.

This is not a phenomenon, it is Caching

Extremely high rates of paging active memory to disk but low constant memory usage

As the title states, I have a problem with high page file activity.
I am developing a program that process a lot of images, which it loads from the hard drive.
From every image it generates some data, that I save on a list. For every 3600 images, I save the list to the hard drive, its size is about 5 to 10 MB. It is running as fast as it can, so it max out one CPU Thread.
The program works, it generates the data that it is supposed to, but when I analyze it in Visual Studio I get a warning saying: DA0014: Extremely high rates of paging active memory to disk.
The memory comsumption of the program, according to Task Manager is about 50 MB and seems to be stable. When I ran the program I had about 2 GB left out of 4 GB, so I guess I am not running out of RAM.
http://i.stack.imgur.com/TDAB0.png
The DA0014 rule description says "The number of Pages Output/sec is frequently much larger than the number of Page Writes/sec, for example. Because Pages Output/sec also includes changed data pages from the system file cache. However, it is not always easy to determine which process is directly responsible for the paging or why."
Does this mean that I get this warning simply because I read a lot of images from the hard drive, or is it something else? Not really sure what kind of bug I am looking for.
EDIT: Link to image inserted.
EDIT1: The images size is about 300 KB each. I dipose each one before loading the next.
UPDATE: Looks from experiments like the paging comes from just loading the large amount of files. As I am no expert in C# or the underlying GDI+ API, I don't know which of the answers are most correct. I chose Andras Zoltans answer as it was well explained and because it seems he did a lot of work to explain the reason to a newcomer like me:)

Updated following more info
The working set of your application might not be very big - but what about the virtual memory size? Paging can occur because of this and not just because of it's physical size. See this screen shot from Process Explorer of VS2012 running on Windows 8:
And on task manager? Apparently the private working set for the same process is 305,376Kb.
We can take from this a) that Task Manager can't necessarily be trusted and b) an application's size in memory, as far as the OS is concerned, is far more complicated than we'd like to think.
You might want to take a look at this.
The paging is almost certainly because of what you do with the files and the high final figures almost certainly because of the number of files you're working with. A simple test of that would be experiment with different numbers of files and generate a dataset of final paging figures alongside those. If the number of files is causing the paging, then you'll see a clear correlation.
Then take out any processing (but keep the image-loading) you do and compare again - note the difference.
Then stub out the image-loading code completely - note the difference.
Clearly you'll see the biggest drop in faults when you take out the image loading.
Now, looking at the Emgu.CV Image code, it uses the Image class internally to get the image bits - so that's firing up GDI+ via the function GdipLoadImageFromFile (Second entry on this index)) to decode the image (using system resources, plus potentially large byte arrays) - and then it copies the data to an uncompressed byte array containing the actual RGB values.
This byte array is allocated using GCHandle.Alloc (also surrounded by GC.AddMemoryPressure and GC.RemoveMemoryPressure) to create a pinned byte array to hold the image data (uncompressed). Now I'm no expert on .Net memory management, but it seems to me that what we have a potential for heap fragmentation here, even if each file is loaded sequentially and not in parallel.
Whether that's causing the hard paging I don't know. But it seems likely.
In particular the in-memory representation of the image could be specifically geared around displaying as opposed to being the original file bytes. So if we're talking JPEGs, for example, then a 300Kb JPEG could be considerably larger in physical memory, depending on its size. E.g. a 1027x768 32 bit image is 3Mb - and that's been allocated twice for each image since it's loaded (first allocation) then copied (second allocation) into the EMGU image object before being disposed.
But you have to ask yourself if it's necessary to find a way around the problem. If your application is not consuming vast amounts of physical RAM, then it will have much less of an impact on other applications; one process hitting the page file lots and lots won't badly affect another process that doesn't, if there's sufficient physical memory.

However, it is not always easy to determine which process is directly responsible for the paging or why.
The devil is in that cop-out note. Bitmaps are mapped into memory from the file that contains the pixel data using a memory-mapped file. That's an efficient way to avoid reading and writing the data directly into/from RAM, you only pay for what you use. The mechanism that keeps the file in sync with RAM is paging. So it is inevitable that if you process a lot of images then you'll see a lot of page faults. The tool you use just isn't smart enough to know that this is by design.
Feature, not a bug.

Mutiple Threading in the eyes of I/O operations?

I was thinking...
Does Multithreading using c# in I/O operations ( lets say copying many files from c:\1\ to c:\2\ ) , will have performance differences rather than doing the operation - sequential ?
The reason why im struggle with myself is that an IO operation finally - is One item which has to do work. so even if im working in parallel - he will still execute those copy orders as sequential...
or maybe my assumption is wrong ?
in that case is there any benefit of using multithreaded copy to :
copy many small file ( sum 4GB)
copy 4 big files ( sum 4 gb , 1000 mb each)
thanks

Like others says, it has to be measured against concrete application context.
But just would like to invite an attention on this.
Every time you copy a file the permission of Write access to destination location is checked, which is slow.
All of us met a case when you have to copy/paste a sequence of already compressed files, and if you them compress again into one big ZIP file, so the total compressed size is not significally smaller then the sum of all content, the IO operation will be executed a way faster. (Just try it, you will see a huge difference, if you didn't do it before).
So I would assume (again it has to be measured on concrete system, mine are just guesses) that having one big file write on single disk, will be faster the a lot of small files.
Hope this helps.

Multithreading with files is not so much about the CPU but about IO. This means that totally different rules apply. Different devices have different characterstics:
Magnetic disks like sequential IO
SSDs like sequential or parallel random IO (mine has 4 hardware "threads")
The network likes many parallel operations to amortize latency

I'm no expert in hard-disk related questions, but maybe this will shred some light for you:
Windows is using the NTFS file system. This system doesn't "like" too much small files, for example, under 1kb files. It will "magically" make 100 files of 1kb weight 400kb instead of 100kb. It is also VERY slow when dealing which a lot of "small" files. Therefore, copying one big file instead of many small files of the same weight will be much faster.
Also, from my personal experience and knowledge, Multithreading will NOT speed up the copying of many files, because the actual hardware disk is acting like one unit, and can't be sped up by sending many requests at the same time (it will process them one by one.)

Limit Bandwidth Speeds

i wrote an app that sync's local folders with online folders, but it eats all my bandwidth, how can i limit the amount of bandwidth the app use? (programatically)?

Take a look at http://www.codeproject.com/KB/IP/MyDownloader.aspx
He's using the well known technique which can be found in Downloader.Extension\SpeedLimit
Basically, before more data is read of a stream, a check is performed on how much data has actually been read since the previous iteration . If that rate exceeds the max rate, then the read command is suspended for a very short time and the check is repeated. Most applications use this technique.

Try this: http://www.netlimiter.com/ It's been on my "check this out" list for a long time (though I haven't tried it yet myself).

I'd say "don't". Unless you're doing something really wrong, your program shouldn't be hogging bandwidth. Your router should be balancing the available bandwidth between all requests.
I'd recommend you do the following:
a) Create md5 hashes for all the files. Compare hashes and/or dates and sizes for the files and only sync the files that have changed. Unless you're syncing massive files you shouldn't have to sync a whole lot of data.
b) Limit the sending rate. In your upload thread read the files in 1-8KB chunks and then call Thread.Sleep after every chunk to limit the rate. You have to do this on the upload side however.
c) Pipe everything through a Gzip stream. (System.IO.Compression) For text files this can reduce the size of the data that needs to be transfered.
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.