I'm working on improving the upload performance of a .net app that uploads groups of largeish (~15mb each) files to S3.
I have adjusted multipart options (threads, chunk size, etc.) and I think I've improved that as much as possible, but while closely watching the network utilization, I noticed something unexpected.
I iterate over a number of files in a directory and then submit each of them for upload using an instance of the S3 transfer utility like so:
// prepare the upload
this._transferUtility.S3Client.PutBucket(new PutBucketRequest().WithBucketName(streamingBucket));
request = new TransferUtilityUploadRequest()
.WithBucketName(streamingBucket)
.WithFilePath(streamFile)
.WithKey(targetFile)
.WithTimeout(uploadTimeout)
.WithSubscriber(this.uploadFileProgressCallback);
// start the upload
this._transferUtility.Upload(request);
Then I watch for these to complete in the uploadFileProgressCallback specified above.
However when I watch the network interface, I can see a number of distinct "humps" in my outbound traffic graph which coincide precisely with the number of files I'm uploading to S3.
Since this is an asynchronous call, I was under the impression that each transfer would begin immediately, and I'd see a stepped increase in outbound data followed by a stepped decrease as each upload was completed. Base on what I'm seeing now I wonder if these requests, while asynchronous to the calling code, are being queued up somewhere and then executed in series?
If so I'd like to change that so the request all begin uploading at (close to) the same time, so I can maximize the upload bandwidth I have available and reduce the overal execution time.
I poked around in the S3 .net SDK documentation but I couldn't find any mention of this queueing mechanism or any properties/etc. that appeared to provide a way of increasing the concurrency of these calls.
Any pointers appreciated!
This is something that's not intrinsically supported by the SDKs due to simplicity requirements maybe? I implemented my own concurrent part uploads based on this article.
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
Some observations:
This approach is good only when you have the complete content in memory as you have to break it into chunks and wrap it up in part uploads. In many cases it may not make sense to have order of GBs of data in memory just so that you can do concurrent uploads. You may have to evaluate the tradeoff there.
SDKs have a limit of upto 16MB for a singlePut upload and any file size beyond this value would be divided into 5MB chunks for part uploads. Unfortunately these values are not configurable. So I had to pretty much write my own multipart upload logic. The values mentioned above are for the java SDK and I'd expect these to be the same for the C# one too.
All operations are non-blocking which is good.
In c# you could try to set the partsize manually.
TransferUtilityUploadRequest request =
new TransferUtilityUploadRequest()
.WithPartSize(??).
Or
TransferUtilityConfig utilityConfig = new TransferUtilityConfig();
utilityConfig.MinSizeBeforePartUpload = ??;
But i don't know the defaults
Related
I have read a number of closely related questions but not one that hits this exactly. If it is a duplicate, please send me a link.
I am using an angular version of the flowjs library for doing HTML5 file uploads (https://github.com/flowjs/ng-flow). This works very well and I am able to upload multiple files simultaneously in 1MB chunks. There is an ASP.Net Web API Files controller that accepts these and saves them on disk. Although I can make this work, I am not doing it efficiently and would like to know a better approach.
First, I used the MultipartFormDataStreamProvider in an async method that worked great as long as the file uploaded in a single chunk. Then I switched to just using the FileStream to write the file to disk. This also worked as long as the chunks arrived in order, but of course, I cannot rely on that.
Next, just to see it work, I wrote the chunks to individual file streams and combined them after the fact, hence the inefficiency. A 1GB file would generate a thousand chunks that needed to be read and rewritten after the upload was complete. I could hold all file chunks in memory and flush them after they are all uploaded, but I'm afraid the server would blow up.
It seems that there should be a nice asynchronous solution to this dilemma but I don't know what it is. One possibility might be to use async/await to combine previous chunks while writing the current chunk. Another might be to use Begin/EndInvoke to create a separate thread so that the file manipulation on disk was handled independent of the thread reading from the HttpContext but this would rely on the ThreadPool and I'm afraid that the created threads will be unduly terminated when my MVC controller returns. I could create a FileWatcher that ran completely independent of ASP.Net but that would be very kludgey.
So my questions are, 1) is there a simple solution already that I am missing? (seems like there should be) and 2) if not, what is the best approach to solving this inside the Web API framework?
Thanks, bob
I'm not familiar with that kind of chunked upload, but I believe this should work:
Use flowTotalSize to pre-allocate the file when the first chunk comes in.
Have one SemaphoreSlim per file to serialize the asynchronous writes for that file.
Each chunk will write to its own offset (flowChunkSize * (flowChunkNumber - 1)) within the file.
This doesn't handle situations where the uploads are unexpectedly terminated. That kind of solution usually involves allocating/writing a temporary file (with a special extension) and then moving/renaming that file once the last chunk arrives.
Don't forget to ensure that your file writing is actually asynchronous.
Using #Stephen Cleary's answer, and this thread: https://github.com/flowjs/ng-flow/issues/41 I was able to make an ASP.NET Web Api Implementation and uploaded it for those still wondering about this question such as #Herb Caudill
https://github.com/samhowes/NgFlowSample/tree/master.
The original answer is the real answer to this question, but I don't have enough reputation yet to comment. I did not use a SemaphoreSlim, but instead enabled file Write sharing. But did in fact pre-allocate and make sure that each chunk is getting written to the right location by calculating an offset.
I will be contributing this to the Flow samples at: https://github.com/flowjs/flow.js/tree/master/samples
This is what I have done. Uploaded the chunks and saved those chunks on the server an save the location of chunks in the database with their order (not the order they came in, the order of the chunk in the file they should be).
Then I introduced another endpoint to merge those chunks. Since this part can be a long process I used a messaging service to run the process in the background.
And after the service is done merging the file, sending a notification (or you can trigger an event).
Agree, it won't fix the problem of having to save all those chunks, but after the merging is done, we can just delete those from the disk. However there are some IIS configuration required though for the upload to work smoothly.
Here's my two cents to this old question. Now most of the application use azure or aws for storage. However, still sharing my thoughts in case it helps someone.
I wonder about the pros and cons using these different techniques.
Sending an image as a respose stream will result in just one request to the server, but will take more processing power for the Service.
Compared to saving the image on a file share and sending an url back to the client, and letting the client request the image directly on a file share.
What strategy would you recommend, this service will have a huge amount of requests.
I think the greatest way is sending directly an Image, because :
there is only one request
it's no manipulate the hard disk
the image in RAM is less used
For the other way, i see many cons :
you use the hard drive two times (write and read)
your server must treat two requests
the image in RAM is used to write, to read and to send
You must to do some tests, but i think sending directly image is at instant more greedy, but on total, the more greedy is the saved image solution.
Basically it depends on the processing need for your image.
If your image is specifically processing for each request, you'll need to create a new image on the fly each time. In this case, there is no reason to save your file and share it in a link.
If your image will be reused by other requests, then you could consider both. The streaming overhead is not a complete tricky point but the overhead exists:
processing overhead
request processed by managed code
all httpmodules/handlers called by each request
even if cached by iis (dynamic caching), streaming is more costly than sending a file.
you'll have difficulties in sharing this file across multiple servers -> local cache only (and if you have to pay a distributed cache access it's not relevant any more), plus, a file should be accessed through a cdn
And more, for performance reasons, you should want to free for other requests your http request from processing as soon as possible, then getting a static file image is another working network task on the browser.
To resume, consider your usage and the lifetime cycle of your image data.
I have about 110,000 images of various formats (jpg, png and gif) and sizes (2-40KB) stored locally on my hard drive. I need to upload them to Azure Blob Storage. While doing this, I need to set some metadata and the blob's ContentType, but otherwise it's a straight up bulk upload.
I'm currently using the following to handle uploading one image at a time (paralleled over 5-10 concurrent Tasks).
static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
//convert image to bytes
using(MemoryStream ms = new MemoryStream())
{
pic.Save(ms, format);
ms.Position = 0;
//create the blob, set metadata and properties
var blob = container.GetBlobReference(filename);
blob.Metadata["Filename"] = filename;
blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));
//upload!
blob.UploadFromStream(ms);
blob.SetMetadata();
blob.SetProperties();
}
}
I was wondering if there was another technique I could employ to handle the uploading, to make it as fast as possible. This particular project involves importing a lot of data from one system to another, and for customer reasons it needs to happen as quickly as possible.
Okay, here's what I did. I tinkered around with running BeginUploadFromStream(), then BeginSetMetadata(), then BeginSetProperties() in an asynchronous chain, paralleled over 5-10 threads (a combination of ElvisLive's and knightpfhor's suggestions). This worked, but anything over 5 threads had terrible performance, taking upwards of 20 seconds for each thread (working on a page of ten images at a time) to complete.
So, to sum up the performance differences:
Asynchronous: 5 threads, each running an async chain, each working on ten images at a time (paged for statistical reasons): ~15.8 seconds (per thread).
Synchronous: 1 thread, ten images at a time (paged for statistical reasons): ~3.4 seconds
Okay, that's pretty interesting. One instance uploading blobs synchronously performed 5x better than each thread in the other approach. So, even running the best async balance of 5 threads nets essentially the same performance.
So, I tweaked my image file importing to separate the images into folders containing 10,000 images each. Then I used Process.Start() to launch an instance of my blob uploader for each folder. I have 170,000 images to work with in this batch, so that means 17 instances of the uploader. When running all of those on my laptop, performance across all of them leveled out at ~4.3 seconds per set.
Long story short, instead of trying to get threading working optimally, I just run a blob uploader instance for every 10,000 images, all on the one machine at the same time. Total performance boost?
Async Attempts: 14-16 hours, based on average execution time when running it for an hour or two.
Synchronous with 17 separate instances: ~1 hour, 5 minutes.
You should definitely upload in parallel in several streams (ie. post multiple files concurrently), but before you do any experiment showing (erroneously) that there is not benefit, make sure you actually increase the value of ServicePointManager.DefaultConnectionLimit:
The maximum number of concurrent connections allowed by a ServicePoint
object. The default value is 2.
With a default value of 2, you can have at most two outstanding HTTP requests against any destination.
As the files that you're uploading are pretty small, I think the code that you've written is probably about as efficient as you can get. Based on your comment it looks like you've tried running these uploads in parallel which was really the only other code suggestion I had.
I suspect that in order to get the greatest throughput will be about finding the right number of threads for your hardware, your connection and your file size. You could try using the Azure Throughput Analyzer to make finding this balance easier.
Microsoft's Extreme Computing group have also benchmarks and suggestions on improving throughput. It's focused on throughput from worker roles deployed on Azure, but it will give you an idea of the best you could hope for.
You may want to increase ParallelOperationThreadCount as shown below. I haven't checked the latest SDK, but in 1.3 the limit was 64. Not setting this value resulted in lower concurrent operations.
CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);
// todo: set this in blob extensions
blobStorage.ParallelOperationThreadCount = 64
If the parallel method takes 5 times more to upload than the serial one, then you either
have awful bandwidth
have a very slow computer
do something wrong
My command-line util gets quite a boost when running in parallel even though I don't use memory streams nor any other nifty stuff like that, I simply generate a string array of the filenames, then upload them with Parallel.ForEach.
Additionally, the Properties.ContentType call probably sets you back quite a bit. Personally I never use them and I guess they shouldn't even matter unless you want to view them right in the browser via direct URLs.
You could always try the async methods of uploading.
public override IAsyncResult BeginUploadFromStream (
Stream source,
AsyncCallback callback,
Object state
)
http://msdn.microsoft.com/en-us/library/windowsazure/ee772907.aspx
I need to download certain files using FTP.Already it is implemented without using the thread. It takes too much time to download all the files.
So i need to use some thread for speed up the process .
my code is like
foreach (string str1 in files)
{
download_FTP(str1)
}
I refer this , But i don't want every files to be queued at ones.say for example 5 files at a time.
If the process is too slow, it means most likely that the network/Internet connection is the bottleneck. In that case, downloading the files in parallel won't significantly increase the performance.
It might be another story though if you are downloading from different servers. We may then imagine that some of the servers are slower than others. In that case, parallel downloads would increase the overall performance since the program would download files from other servers while being busy with slow downloads.
EDIT: OK, we have more info from you: Single server, many small files.
Downloading multiple files involves some overhead. You can decrease this overhead by somehow grouping the files (tar, zip, whatever) on server-side. Of course, this may not be possible. If your app would talk to a web server, I'd advise to create a zip file on the fly server-side according to the list of files transmitted in the request. But you are on an FTP server so I'll assume you have nearly no flexibility server-side.
Downloading several files in parallel may probably increase the throughput in your case. Be very careful though about restrictions set by the server such as the max amount of simultaneous connections. Also, keep in mind that if you have many simultaneous users, you'll end up with a big amount of connections on the server: users x threads. Which may prove counter-productive according to the scalability of the server.
A commonly accepted rule of good behaviour consists in limiting to max 2 simultaneoud connections per user. YMMV.
Okay, as you're not using .NET 4 that makes it slightly harder - the Task Parallel Library would make it really easy to create five threads reading from a producer/consumer queue. However, it still won't be too hard.
Create a Queue<string> with all the files you want to download
Create 5 threads, each of which has a reference to the queue
Make each thread loop, taking an item off the queue and downloading it, or finishing if the queue is empty
Note that as Queue<T> isn't thread-safe, you'll need to lock to make sure that only one thread tries to fetch an item from the queue at a time:
string fileToDownload = null;
lock(padlock)
{
if (queue.Count == 0)
{
return; // Done
}
fileToDownload = queue.Dequeue();
}
As noted elsewhere, threading may not speed things up at all - it depends where the bottleneck is. If the bottleneck is the user's network connection, you won't be able to get more data down the same size of pipe just by using multi-threading. On the other hand, if you have a lot of small files to download from different hosts, then it may be latency rather than bandwidth which is the problem, in which case threading will help.
look up on ParameterizedThreadStart
List<System.Threading.ParameterizedThreadStart> ThreadsToUse = new List<System.Threading.ParameterizedThreadStart>();
int count = 0;
foreach (string str1 in files)
{
ThreadsToUse.add(System.Threading.ParameterizedThreadStart aThread = new System.Threading.ParameterizedThreadStart(download_FTP));
ThreadsToUse[count].Invoke(str1);
count ++;
}
I remember something about Thread.Join that can make all threads respond to one start statement, due to it being a delegate.
There is also something else you might want to look up on which i'm still trying to fully grasp which is AsyncThreads, with these you will know when the file has been downloaded. With a normal thread you gonna have to find another way to flag it's finished.
This may or may not help your speed, in one way of your line speed is low then it wont help you much,
on the other hand some servers set each connection to be capped to a certain speed in which you this in theory will set up multiple connections to the server therefore having a slight increase in speed. how much increase tho I cannot answer.
Hope this helps in some way
I can add some experience to the comments already posted. In an app some years ago I had to generate a treeview of files on an FTP server. Listing files does not normally require actual downloading, but some of the files were zipped folders and I had to download these and unzip them, (sometimes recursively), to display the files/folders inside. For a multithreaded solution, this reqired a 'FolderClass' for each folder that could keep state and so handle both unzipped and zipped folders. To start the operation off, one of these was set up with the root folder and submitted to a P-C queue and a pool of threads. As the folder was LISTed and iterated, more FolderClass instances were submitted to the queue for each subfolder. When a FolderClass instance reached the end of its LIST, it PostMessaged itself, (it was not C#, for which you would need BeginInvoke or the like), to the UI thread where its info was added to the listview.
This activity was characterised by a lot of latency-sensitive TCP connect/disconnect with occasional download/unzip.
A pool of, IIRC, 4-6 threads, (as already suggested by other posters), provided the best performance on the single-core system i had at the time and, in this particular case, was much faster than a single-threaded solution. I can't remember the figures exactly, but no stopwatch was needed to detect the performance boost - something like 3-4 times faster. On a modern box with multiiple cores where LISTs and unzips could occur concurrently, I would expect even more improvement.
There were some problems - the visual ListView component could not keep up with the incoming messages, (because of the multiple threads, data arrived for aparrently 'random' positions on the treeview and so required continual tree navigation for display), and so the UI tended to freeze during the operation. Another problem was detecting when the operation had actually finished. These snags are probably not relevant to your download-many-small-files app.
Conclusion - I expect that downloading a lot of small files is going to be faster if multithreaded with multiple connections, if only from mitigating the connect/disconnect latency which can be larger than the actual data download time. In the extreme case of a satellite connection with high speed but very high latency, a large thread pool would provide a massive speedup.
Note the valid caveats from the other posters - if the server, (or its admin), disallows or gets annoyed at the multiple connections, you may get no boost, limited bandwidth or a nasty email from the admin!
Rgds,
Martin
i wrote an app that sync's local folders with online folders, but it eats all my bandwidth, how can i limit the amount of bandwidth the app use? (programatically)?
Take a look at http://www.codeproject.com/KB/IP/MyDownloader.aspx
He's using the well known technique which can be found in Downloader.Extension\SpeedLimit
Basically, before more data is read of a stream, a check is performed on how much data has actually been read since the previous iteration . If that rate exceeds the max rate, then the read command is suspended for a very short time and the check is repeated. Most applications use this technique.
Try this: http://www.netlimiter.com/ It's been on my "check this out" list for a long time (though I haven't tried it yet myself).
I'd say "don't". Unless you're doing something really wrong, your program shouldn't be hogging bandwidth. Your router should be balancing the available bandwidth between all requests.
I'd recommend you do the following:
a) Create md5 hashes for all the files. Compare hashes and/or dates and sizes for the files and only sync the files that have changed. Unless you're syncing massive files you shouldn't have to sync a whole lot of data.
b) Limit the sending rate. In your upload thread read the files in 1-8KB chunks and then call Thread.Sleep after every chunk to limit the rate. You have to do this on the upload side however.
c) Pipe everything through a Gzip stream. (System.IO.Compression) For text files this can reduce the size of the data that needs to be transfered.
Hope this helps!