Batch Uploading Huge Sets of Images to Azure Blob Storage

Batch Uploading Huge Sets of Images to Azure Blob Storage - c#

I have about 110,000 images of various formats (jpg, png and gif) and sizes (2-40KB) stored locally on my hard drive. I need to upload them to Azure Blob Storage. While doing this, I need to set some metadata and the blob's ContentType, but otherwise it's a straight up bulk upload.
I'm currently using the following to handle uploading one image at a time (paralleled over 5-10 concurrent Tasks).
static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
//convert image to bytes
using(MemoryStream ms = new MemoryStream())
{
pic.Save(ms, format);
ms.Position = 0;
//create the blob, set metadata and properties
var blob = container.GetBlobReference(filename);
blob.Metadata["Filename"] = filename;
blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));
//upload!
blob.UploadFromStream(ms);
blob.SetMetadata();
blob.SetProperties();
}
}
I was wondering if there was another technique I could employ to handle the uploading, to make it as fast as possible. This particular project involves importing a lot of data from one system to another, and for customer reasons it needs to happen as quickly as possible.

Okay, here's what I did. I tinkered around with running BeginUploadFromStream(), then BeginSetMetadata(), then BeginSetProperties() in an asynchronous chain, paralleled over 5-10 threads (a combination of ElvisLive's and knightpfhor's suggestions). This worked, but anything over 5 threads had terrible performance, taking upwards of 20 seconds for each thread (working on a page of ten images at a time) to complete.
So, to sum up the performance differences:
Asynchronous: 5 threads, each running an async chain, each working on ten images at a time (paged for statistical reasons): ~15.8 seconds (per thread).
Synchronous: 1 thread, ten images at a time (paged for statistical reasons): ~3.4 seconds
Okay, that's pretty interesting. One instance uploading blobs synchronously performed 5x better than each thread in the other approach. So, even running the best async balance of 5 threads nets essentially the same performance.
So, I tweaked my image file importing to separate the images into folders containing 10,000 images each. Then I used Process.Start() to launch an instance of my blob uploader for each folder. I have 170,000 images to work with in this batch, so that means 17 instances of the uploader. When running all of those on my laptop, performance across all of them leveled out at ~4.3 seconds per set.
Long story short, instead of trying to get threading working optimally, I just run a blob uploader instance for every 10,000 images, all on the one machine at the same time. Total performance boost?
Async Attempts: 14-16 hours, based on average execution time when running it for an hour or two.
Synchronous with 17 separate instances: ~1 hour, 5 minutes.

You should definitely upload in parallel in several streams (ie. post multiple files concurrently), but before you do any experiment showing (erroneously) that there is not benefit, make sure you actually increase the value of ServicePointManager.DefaultConnectionLimit:
The maximum number of concurrent connections allowed by a ServicePoint
object. The default value is 2.
With a default value of 2, you can have at most two outstanding HTTP requests against any destination.

As the files that you're uploading are pretty small, I think the code that you've written is probably about as efficient as you can get. Based on your comment it looks like you've tried running these uploads in parallel which was really the only other code suggestion I had.
I suspect that in order to get the greatest throughput will be about finding the right number of threads for your hardware, your connection and your file size. You could try using the Azure Throughput Analyzer to make finding this balance easier.
Microsoft's Extreme Computing group have also benchmarks and suggestions on improving throughput. It's focused on throughput from worker roles deployed on Azure, but it will give you an idea of the best you could hope for.

You may want to increase ParallelOperationThreadCount as shown below. I haven't checked the latest SDK, but in 1.3 the limit was 64. Not setting this value resulted in lower concurrent operations.
CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);
// todo: set this in blob extensions
blobStorage.ParallelOperationThreadCount = 64

If the parallel method takes 5 times more to upload than the serial one, then you either
have awful bandwidth
have a very slow computer
do something wrong
My command-line util gets quite a boost when running in parallel even though I don't use memory streams nor any other nifty stuff like that, I simply generate a string array of the filenames, then upload them with Parallel.ForEach.
Additionally, the Properties.ContentType call probably sets you back quite a bit. Personally I never use them and I guess they shouldn't even matter unless you want to view them right in the browser via direct URLs.

You could always try the async methods of uploading.
public override IAsyncResult BeginUploadFromStream (
Stream source,
AsyncCallback callback,
Object state
)
http://msdn.microsoft.com/en-us/library/windowsazure/ee772907.aspx

Related

Use multithreading for multiple file copies

I have to copy large number of files (10000 files)
because it takes long time to copy.
I have tried using two threads instead of single thread, one to copy odd number files in list and other to copy even numbers from list
I have used this code:
ThreadPool.QueueUserWorkItem(new WaitCallback(this.RunFileCopy),object)
but there is no significant difference in time when using single thread and when using two threads.
What could be the reason for this?

File copying is not a CPU process, it a IO process, so multithreding or parallelism wont help you.
Multithreading will slow you down in almost all cases.If disc is SSD too it has a limited speed for r/w and it will use it efficiently with single thread too. If u use parallelism you will just split your speed into pieces and this will create a huge overhead for HDD
Multithreading only helps you in more than one disc case, when you read from different discs and write to different discs.
If files are too small. Zipping and unzipping the files on the target drive can be faster in most cases, and if u zip the files with low compression it will be quite faster
using System.IO;
using System.IO.Compression;
.....
string startPath = #"c:\example\start";
string zipPath = #"c:\example\result.zip";
string extractPath = #"c:\example\extract";
ZipFile.CreateFromDirectory(startPath, zipPath, CompressionLevel.Fastest, true);
ZipFile.ExtractToDirectory(zipPath, extractPath);
More implementation details here
How to: Compress and Extract Files

I'm going to provide a minority opinion here. Everybody is telling you that Disk I/O is preventing you from getting any speedup from multiple threads. That's ... sort ... of right, but...
Given a single disk request, the OS can only choose to move the heads to the point on the disk selected impliclity by the file access, usually incurring an average of half of the full stroke seek time (tens of milliseconds) and rotational delays (another 10 milliseconds) to access the data. And sticking with single disk requests, this is a pretty horrendous (and unavoidable) cost to pay.
Because disk accesses take a long time, the OS has plenty of CPU to consider the best order to access the disk when there are multiple requests, should they occur while it is already waiting for the disk to do something. The OS does so usually with an elevator algorithm, causing the heads to efficiently scan across the disk in one direction in one pass, and scan efficiently in the other direction when the "furthest" access has been reached.
The idea is simple: if you process multiple disk requests in exactly the time order in which they occur, the disk heads will likely jump randomly about the disk (under the assumption the files are placed randomly), thus incurring the helf-full seek + rotational delay on every access. With 1000 live accesses processed in order, 1000 average half-full seeks will occur. Ick.
Instead, give N near-simultaneous accesses, the OS can sort these accesses by the physical cylinder they will touch, and then process them in cylinder order. A 1000 live accesses, processed in cylinder order (even with random file distributions), is likely to have one request per cylinder. Now the heads only have to step from one cylinder to the next, and that's a lot less than the average seek.
So having lots of requests should help the OS make better access-order decisions.
Since OP has lots of files, there's no reason he could not run a lot of threads, each copying its own file and generating demand for disk locations. He would want each thread to issue a read and write of of something like a full track, so that when the heads arrive at a cylinder, a full track is read or written (under the assumption the OS lays files out contiguously on a track where it can).
OP would want to make sure his machine had enough RAM to buffer his threadcount times tracksize. An 8Gb machine with 4 Gb unbusy during the copy has essentially a 4 Gb disk cache. A 100Kb per track (been a long time since I looked) suggests "room" for 10,000 threads. I seriously doubt he needs that many; mostly he needs enough threads to overwhelm the number of cylinders on his disk. I'd surely consider several hundred threads.
Two threads surely is not enough. (Windows appears to use one thread when you ask it copy a bunch of files. That's always seemed pretty dumb to me).
Another poster suggested zipping the files. With lots of threads, and everything waiting on the disk (the elevator algorithm doesnt change that, just the average wait time), many threads can afford to throw computational cycles at zipping. This won't help with reads; files are what they are when read. But it may shorten the amount of data to write, and provide effectively larger buffers in memory, providing some additional speedup.
Note: If one has an SSD, then there are no physical cylinders, thus no seek time, and nothing for an elevator algorithm to optimize. Here, lots of threads don't buy any cylinder ordering time. They shouldn't hurt, either.

Asynchronous S3 PutBucketRequests appear to be executing in series

I'm working on improving the upload performance of a .net app that uploads groups of largeish (~15mb each) files to S3.
I have adjusted multipart options (threads, chunk size, etc.) and I think I've improved that as much as possible, but while closely watching the network utilization, I noticed something unexpected.
I iterate over a number of files in a directory and then submit each of them for upload using an instance of the S3 transfer utility like so:
// prepare the upload
this._transferUtility.S3Client.PutBucket(new PutBucketRequest().WithBucketName(streamingBucket));
request = new TransferUtilityUploadRequest()
.WithBucketName(streamingBucket)
.WithFilePath(streamFile)
.WithKey(targetFile)
.WithTimeout(uploadTimeout)
.WithSubscriber(this.uploadFileProgressCallback);
// start the upload
this._transferUtility.Upload(request);
Then I watch for these to complete in the uploadFileProgressCallback specified above.
However when I watch the network interface, I can see a number of distinct "humps" in my outbound traffic graph which coincide precisely with the number of files I'm uploading to S3.
Since this is an asynchronous call, I was under the impression that each transfer would begin immediately, and I'd see a stepped increase in outbound data followed by a stepped decrease as each upload was completed. Base on what I'm seeing now I wonder if these requests, while asynchronous to the calling code, are being queued up somewhere and then executed in series?
If so I'd like to change that so the request all begin uploading at (close to) the same time, so I can maximize the upload bandwidth I have available and reduce the overal execution time.
I poked around in the S3 .net SDK documentation but I couldn't find any mention of this queueing mechanism or any properties/etc. that appeared to provide a way of increasing the concurrency of these calls.
Any pointers appreciated!

This is something that's not intrinsically supported by the SDKs due to simplicity requirements maybe? I implemented my own concurrent part uploads based on this article.
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
Some observations:
This approach is good only when you have the complete content in memory as you have to break it into chunks and wrap it up in part uploads. In many cases it may not make sense to have order of GBs of data in memory just so that you can do concurrent uploads. You may have to evaluate the tradeoff there.
SDKs have a limit of upto 16MB for a singlePut upload and any file size beyond this value would be divided into 5MB chunks for part uploads. Unfortunately these values are not configurable. So I had to pretty much write my own multipart upload logic. The values mentioned above are for the java SDK and I'd expect these to be the same for the C# one too.
All operations are non-blocking which is good.

In c# you could try to set the partsize manually.
TransferUtilityUploadRequest request =
new TransferUtilityUploadRequest()
.WithPartSize(??).
Or
TransferUtilityConfig utilityConfig = new TransferUtilityConfig();
utilityConfig.MinSizeBeforePartUpload = ??;
But i don't know the defaults

Threads vs Processes in .NET

I have a long-running process that reads large files and writes summary files. To speed things up, I'm processing multiple files simultaneously using regular old threads:
ThreadStart ts = new ThreadStart(Work);
Thread t = new Thread(ts);
t.Start();
What I've found is that even with separate threads reading separate files and no locking between them and using 4 threads on a 24-core box, I can't even get up to 10% on the CPU or 10% on disk I/O. If I use more threads in my app, it seems to run even more slowly.
I'd guess I'm doing something wrong, but where it gets curious is that if I start the whole exe a second and third time, then it actually processes files two and three times faster. My question is, why can't I get 12 threads in my one app to process data and tax the machine as well as 4 threads in 3 instances of my app?
I've profiled the app and the most time-intensive and frequently called functions are all string processing calls.

It's possible that your computing problem is not CPU bound, but I/O bound. It doesn't help to state that your disk I/O is "only at 10%". I'm not sure such performance counter even exists.
The reason why it gets slower while using more threads is because those threads are all trying to get to their respective files at the same time, while the disk subsystem is having a hard time trying to accomodate all of the different threads. You see, even with a modern technology like SSDs where the seek time is several orders of magnitude smaller than with traditional hard drives, there's still a penalty involved.
Rather, you should conclude that your problem is disk bound and a single thread will probably be the fastest way to solve your problem.
One could argue that you could use asynchronous techniques to process a bit that's been read, while on the background the next bit is being read in, but I think you'll see very little performance improvement there.
I've had a similar problem not too long ago in a small tool where I wanted to calculate MD5 signatures of all the files on my harddrive and I found that the CPU is way too fast compared to the storage system and I got similar results trying to get more performance by using more threads.
Using the Task Parallel Library didn't alleviate this problem.

First of all on a 24 core box if you are using only 4 threads the most cpu it could ever use is 16.7% so really you are getting 60% utilization, which is fairly good.
It is hard to tell if your program is I/O bound at this point, my guess is that is is. You need to run a profiler on your project and see what sections of code your project is spending the most of it's time. If it is sitting on a read/write operation it is I/O bound.
It is possable you have some form of inter-thread locking being used. That would cause the program to slow down as you add more threads, and yes running a second process would fix that but fixing your locking would too.
What it all boils down to is without profiling information we can not say if using a second process will speed things up or make things slower, we need to know if the program is hanging on a I/O operation, a locking operation, or just taking a long time in a function that can be parallelized better.

I think you find out what file cache is not ideal in case when one proccess write data in many file concurrently. File cache should sync to disk when the number of dirty page cache exceeds a threshold. It seems concurrent writers in one proccess hit threshold faster than the single thread writer. You can read read about file system cache here File Cache Performance and Tuning

Try using Task library from .net 4 (System.Threading.Task). This library have built-in optimizations for different number of processors.
Have no clue what is you problem, maybe because your code snippet is not really informative

How to Download 5 files at a time using Thread in .net framework 3.5

I need to download certain files using FTP.Already it is implemented without using the thread. It takes too much time to download all the files.
So i need to use some thread for speed up the process .
my code is like
foreach (string str1 in files)
{
download_FTP(str1)
}
I refer this , But i don't want every files to be queued at ones.say for example 5 files at a time.

If the process is too slow, it means most likely that the network/Internet connection is the bottleneck. In that case, downloading the files in parallel won't significantly increase the performance.
It might be another story though if you are downloading from different servers. We may then imagine that some of the servers are slower than others. In that case, parallel downloads would increase the overall performance since the program would download files from other servers while being busy with slow downloads.
EDIT: OK, we have more info from you: Single server, many small files.
Downloading multiple files involves some overhead. You can decrease this overhead by somehow grouping the files (tar, zip, whatever) on server-side. Of course, this may not be possible. If your app would talk to a web server, I'd advise to create a zip file on the fly server-side according to the list of files transmitted in the request. But you are on an FTP server so I'll assume you have nearly no flexibility server-side.
Downloading several files in parallel may probably increase the throughput in your case. Be very careful though about restrictions set by the server such as the max amount of simultaneous connections. Also, keep in mind that if you have many simultaneous users, you'll end up with a big amount of connections on the server: users x threads. Which may prove counter-productive according to the scalability of the server.
A commonly accepted rule of good behaviour consists in limiting to max 2 simultaneoud connections per user. YMMV.

Okay, as you're not using .NET 4 that makes it slightly harder - the Task Parallel Library would make it really easy to create five threads reading from a producer/consumer queue. However, it still won't be too hard.
Create a Queue<string> with all the files you want to download
Create 5 threads, each of which has a reference to the queue
Make each thread loop, taking an item off the queue and downloading it, or finishing if the queue is empty
Note that as Queue<T> isn't thread-safe, you'll need to lock to make sure that only one thread tries to fetch an item from the queue at a time:
string fileToDownload = null;
lock(padlock)
{
if (queue.Count == 0)
{
return; // Done
}
fileToDownload = queue.Dequeue();
}
As noted elsewhere, threading may not speed things up at all - it depends where the bottleneck is. If the bottleneck is the user's network connection, you won't be able to get more data down the same size of pipe just by using multi-threading. On the other hand, if you have a lot of small files to download from different hosts, then it may be latency rather than bandwidth which is the problem, in which case threading will help.

look up on ParameterizedThreadStart
List<System.Threading.ParameterizedThreadStart> ThreadsToUse = new List<System.Threading.ParameterizedThreadStart>();
int count = 0;
foreach (string str1 in files)
{
ThreadsToUse.add(System.Threading.ParameterizedThreadStart aThread = new System.Threading.ParameterizedThreadStart(download_FTP));
ThreadsToUse[count].Invoke(str1);
count ++;
}
I remember something about Thread.Join that can make all threads respond to one start statement, due to it being a delegate.
There is also something else you might want to look up on which i'm still trying to fully grasp which is AsyncThreads, with these you will know when the file has been downloaded. With a normal thread you gonna have to find another way to flag it's finished.
This may or may not help your speed, in one way of your line speed is low then it wont help you much,
on the other hand some servers set each connection to be capped to a certain speed in which you this in theory will set up multiple connections to the server therefore having a slight increase in speed. how much increase tho I cannot answer.
Hope this helps in some way

I can add some experience to the comments already posted. In an app some years ago I had to generate a treeview of files on an FTP server. Listing files does not normally require actual downloading, but some of the files were zipped folders and I had to download these and unzip them, (sometimes recursively), to display the files/folders inside. For a multithreaded solution, this reqired a 'FolderClass' for each folder that could keep state and so handle both unzipped and zipped folders. To start the operation off, one of these was set up with the root folder and submitted to a P-C queue and a pool of threads. As the folder was LISTed and iterated, more FolderClass instances were submitted to the queue for each subfolder. When a FolderClass instance reached the end of its LIST, it PostMessaged itself, (it was not C#, for which you would need BeginInvoke or the like), to the UI thread where its info was added to the listview.
This activity was characterised by a lot of latency-sensitive TCP connect/disconnect with occasional download/unzip.
A pool of, IIRC, 4-6 threads, (as already suggested by other posters), provided the best performance on the single-core system i had at the time and, in this particular case, was much faster than a single-threaded solution. I can't remember the figures exactly, but no stopwatch was needed to detect the performance boost - something like 3-4 times faster. On a modern box with multiiple cores where LISTs and unzips could occur concurrently, I would expect even more improvement.
There were some problems - the visual ListView component could not keep up with the incoming messages, (because of the multiple threads, data arrived for aparrently 'random' positions on the treeview and so required continual tree navigation for display), and so the UI tended to freeze during the operation. Another problem was detecting when the operation had actually finished. These snags are probably not relevant to your download-many-small-files app.
Conclusion - I expect that downloading a lot of small files is going to be faster if multithreaded with multiple connections, if only from mitigating the connect/disconnect latency which can be larger than the actual data download time. In the extreme case of a satellite connection with high speed but very high latency, a large thread pool would provide a massive speedup.
Note the valid caveats from the other posters - if the server, (or its admin), disallows or gets annoyed at the multiple connections, you may get no boost, limited bandwidth or a nasty email from the admin!
Rgds,
Martin

Limit Bandwidth Speeds

i wrote an app that sync's local folders with online folders, but it eats all my bandwidth, how can i limit the amount of bandwidth the app use? (programatically)?

Take a look at http://www.codeproject.com/KB/IP/MyDownloader.aspx
He's using the well known technique which can be found in Downloader.Extension\SpeedLimit
Basically, before more data is read of a stream, a check is performed on how much data has actually been read since the previous iteration . If that rate exceeds the max rate, then the read command is suspended for a very short time and the check is repeated. Most applications use this technique.

Try this: http://www.netlimiter.com/ It's been on my "check this out" list for a long time (though I haven't tried it yet myself).

I'd say "don't". Unless you're doing something really wrong, your program shouldn't be hogging bandwidth. Your router should be balancing the available bandwidth between all requests.
I'd recommend you do the following:
a) Create md5 hashes for all the files. Compare hashes and/or dates and sizes for the files and only sync the files that have changed. Unless you're syncing massive files you shouldn't have to sync a whole lot of data.
b) Limit the sending rate. In your upload thread read the files in 1-8KB chunks and then call Thread.Sleep after every chunk to limit the rate. You have to do this on the upload side however.
c) Pipe everything through a Gzip stream. (System.IO.Compression) For text files this can reduce the size of the data that needs to be transfered.
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.