Multiple Async File Uploads with chunking to ASP.Net Web API

Multiple Async File Uploads with chunking to ASP.Net Web API - c#

I have read a number of closely related questions but not one that hits this exactly. If it is a duplicate, please send me a link.
I am using an angular version of the flowjs library for doing HTML5 file uploads (https://github.com/flowjs/ng-flow). This works very well and I am able to upload multiple files simultaneously in 1MB chunks. There is an ASP.Net Web API Files controller that accepts these and saves them on disk. Although I can make this work, I am not doing it efficiently and would like to know a better approach.
First, I used the MultipartFormDataStreamProvider in an async method that worked great as long as the file uploaded in a single chunk. Then I switched to just using the FileStream to write the file to disk. This also worked as long as the chunks arrived in order, but of course, I cannot rely on that.
Next, just to see it work, I wrote the chunks to individual file streams and combined them after the fact, hence the inefficiency. A 1GB file would generate a thousand chunks that needed to be read and rewritten after the upload was complete. I could hold all file chunks in memory and flush them after they are all uploaded, but I'm afraid the server would blow up.
It seems that there should be a nice asynchronous solution to this dilemma but I don't know what it is. One possibility might be to use async/await to combine previous chunks while writing the current chunk. Another might be to use Begin/EndInvoke to create a separate thread so that the file manipulation on disk was handled independent of the thread reading from the HttpContext but this would rely on the ThreadPool and I'm afraid that the created threads will be unduly terminated when my MVC controller returns. I could create a FileWatcher that ran completely independent of ASP.Net but that would be very kludgey.
So my questions are, 1) is there a simple solution already that I am missing? (seems like there should be) and 2) if not, what is the best approach to solving this inside the Web API framework?
Thanks, bob

I'm not familiar with that kind of chunked upload, but I believe this should work:
Use flowTotalSize to pre-allocate the file when the first chunk comes in.
Have one SemaphoreSlim per file to serialize the asynchronous writes for that file.
Each chunk will write to its own offset (flowChunkSize * (flowChunkNumber - 1)) within the file.
This doesn't handle situations where the uploads are unexpectedly terminated. That kind of solution usually involves allocating/writing a temporary file (with a special extension) and then moving/renaming that file once the last chunk arrives.
Don't forget to ensure that your file writing is actually asynchronous.

Using #Stephen Cleary's answer, and this thread: https://github.com/flowjs/ng-flow/issues/41 I was able to make an ASP.NET Web Api Implementation and uploaded it for those still wondering about this question such as #Herb Caudill
https://github.com/samhowes/NgFlowSample/tree/master.
The original answer is the real answer to this question, but I don't have enough reputation yet to comment. I did not use a SemaphoreSlim, but instead enabled file Write sharing. But did in fact pre-allocate and make sure that each chunk is getting written to the right location by calculating an offset.
I will be contributing this to the Flow samples at: https://github.com/flowjs/flow.js/tree/master/samples

This is what I have done. Uploaded the chunks and saved those chunks on the server an save the location of chunks in the database with their order (not the order they came in, the order of the chunk in the file they should be).
Then I introduced another endpoint to merge those chunks. Since this part can be a long process I used a messaging service to run the process in the background.
And after the service is done merging the file, sending a notification (or you can trigger an event).
Agree, it won't fix the problem of having to save all those chunks, but after the merging is done, we can just delete those from the disk. However there are some IIS configuration required though for the upload to work smoothly.
Here's my two cents to this old question. Now most of the application use azure or aws for storage. However, still sharing my thoughts in case it helps someone.

Related

Safely saving a file in Windows 10 IOT

My team requires a bulletproof way to save a file (less than 100kb) on Windows 10 IOT.
The file cannot be corrupted but it's OK to loose the most recent version if save failed because of power off etc.
Since the File IO has changed significantly (no more File.Replace) we are not sure how to achieve it.
We can see that:
var file = await folder.CreateFileAsync(fileName, CreationCollisionOption.OpenIfExists);
await Windows.Storage.FileIO.WriteTextAsync(file, data);
is reliably unreliable (it repeatedly broke when stopping debugging, or reset the device.) and we are ending up with a corrupted file (full of zeroes) and and a .tmp file next to it. We can recover this .tmp file I'm not confident that we should base our solution on undocumented behaviour.
One way we want to try is:
var tmpfile = await folder.CreateFileAsync(fileName+".tmp",
CreationCollisionOption.ReplaceExisting);
await Windows.Storage.FileIO.WriteTextAsync(tmpfile, data);
var file = await folder.CreateFileAsync(fileName, CreationCollisionOption.OpenIfExists);
// can this end up with a corrupt or missing file?
await tmpfile.MoveAndReplaceAsync(file);
In summary, is there a safe way to save some text to a file that will never corrupt the file?

Not sure if there's a best practice for this, but if needed to come up with something myself:
I would do something like calculating a checksum and save that along with the file.
When saving the next time, don't overwrite it but save it next to the previous one (which should be "known good"), and delete the previous one only after verifying that the new save completed successfully (together with the checksum)
Also I would assume that a rename operation should not corrupt the file, but I haven't researched that

This article has a good explanation: Best practices for writing to files on the underlying processes involved with writing to files in UWP.
The following common issues are highlighted:
A file is partially written.
The app receives an exception when calling one of the methods.
The operations leave behind .TMP files with a file name similar to the target file name.
What is not easily deduced in discussion about the trade off with convenience-vs-control is that while create or edit operations are more prone to failure, because they do a lot of things, renaming operations are a lot more fault tolerant if they are not physically writing bits around the filesystem.
You suggestion of creating a temp file first, is on the right track and may serve you well, but using MoveAndReplaceAsync means that you are still susceptible to these known issues if the destination file already exists.
UWP will use a transactional pattern with the file system and may create various backup copies of the source and the destination files.
You can take control of the final element by deleting the original file before calling MoveAndReplaceAsync, or you could simply use RenameAsync if your temp file is in the same folder, these have less components which should reduce the area for failure.
#hansmbakker has an answer along these lines, how you identify that the file write was successful is up to you, but by isolating the heavy write operation and verifying it before overwriting your original is a good idea if you need it to be bulletproof.
About Failure
I have observed the .TMP files a lot, when using the Append variants of FileIO writing, the .TMP files have the content of the original file before Append, but the actual file does not always have all of the original client, sometimes its a mix of old and new content, and sometimes the
In my experience, UWP file writes are very reliable when your entire call structure to the write operation is asynchronous and correctly awaits the pipeline. AND you take steps to ensure that only one process is trying to access the same file at any point in time.
When you try to manipulate files from a synchronous context we can start to see the "unreliable" nature you have identified, this happens a lot in code that is being transitioned from the old synchronous operations to the newer Async variants of FileIO operations.
Make sure the code calling your write method is non-blocking and correctly awaits, this will allow you to catch any exceptions that might be raised
it is common for us traditionally synchronous minded developers to try to use a lock(){} pattern to ensure single access to the file, but you cannot easily await inside a lock and attempts to do so often become the source of UWP file write issues.
If your code has a locking mechanism to ensure singleton access to the file, have a read over these articles for a different approach, they're old but a good resource that covers the transition for a traditional synchronous C# developer into async and parallel development.
What’s New For Parallelism in .NET 4.5
Building Async Coordination Primitives, Part 6: AsyncLock
Building Async Coordination Primitives, Part 7: AsyncReaderWriterLock
Other times we encounter a synchronous constraint are when an Event or Timer or Dispose context are the trigger for writing to the file in the first place. There are different techniques to involve there, please post another question that covers that scenario specifically if you think it might be contributing to your issues. :)

Way to obtain full directory information in a batch

We’ve got a process that obtains a list of files from a remote transatlantic Samba share. This is naturally on the slow side, however it’s made worse by the fact that we don’t just need the names of the files, we need the last write times to spot updates. There’s quite a lot of files in the directory, and as far as I can tell, the .NET file API insists on me asking for each one individually. Is there a faster way of obtaining the information we need?

I would love to find a way myself. I have exactly the same problem - huge number of files on a slow network location, and I need to scan for changes.
As far as I know, you do need to ask for file properties one by one.
The amount of information transferred per file should not be high though; the roundabout request-response time is probably the main problem. You can help the situation by running multiple requests in parallel (e.g. using Parallel.ForEach)

The answer to your question is most likely no, at least not in a meaningful way.
Exactly how you enumerate the files in your code is almost irrelevant since they all boil down to the same file system API in Windows. Unfortunately, there is no function that returns a list of file details in one call*.
So, no matter what your code looks like, somewhere below, it's still enumerating the directory contents and calling a particular file function individually for each file.
If this is really a problem, I would look into moving the detection logic closer to the files and send your app the results periodically.
*Disclaimer: It's been a long time since I've been down this far in the stack and I'm just browsing the API docs now, there may be a new function somewhere that does exactly this.

virtual temp file, omit IO operations

Let's say I received a .csv-File over network,
so I have a byte[].
I also have a parser that reads .csv-files and does business things with it,
using File.ReadAllLines().
So far I did:
File.WriteAllBytes(tempPath, incomingBuffer);
parser.Open(tempPath);
I won't ever need the actual file on this device, though.
Is there a way to "store" this file in some virtual place and "open" it again from there, but all in memory?
That would save me ages of waiting on the IO operations to complete (good article on that on coding horror),
plus reducing wear on the drive (relevant if this occured a few dozen times a minute 24/7)
and in general eliminating a point of failure.
This is a bit in the UNIX-direction, where everything is a file-stream, but we're talking windows here.

I won't ever need the actual file on this device, though. - Well, you kind of do if all your API's expect file on the disk.
You can:
1) Get decent API's(I am sure there are CSV parsers that take Stream as construtor parameter - you then can possibly use MemoryStream, for example.)
2) If performance is serious issue, and there is no way you can handle the API's, there's one simple solution: write your own implementation of ramdisk, which will cache everything that is needed, and page stuff to hdd if necessary.
http://code.msdn.microsoft.com/windowshardware/RAMDisk-Storage-Driver-9ce5f699 (Oh did I mention that you absolutely need to have mad experience with drivers :p?)
There's also "ready" solutions for ramdisk(Google!), which means you can just run(in your application initializer) 'CreateRamDisk.exe -Hdd "__MEMDISK__"'(for example), and use File.WriteAllBytes("__MEMDISK__:\yourFile.csv");
Alternatively you can read about memory-mapped files(>= C# 4.0 has nice support). However, by the sounds of it, that probably does not help you too much.

Measuring internet speed without saving remote file

I am writing an internet speed measurer. I want to measure internet speed without saving remote file to either file or memory - I need fetch the data and forget it, so it seems that WebClient and StreamReader is not good for it (maybe i should inherit them and override some private methods). How to make it?

I think that when writing such a system you should also select exactly which size of file you want to download.
Anyway, you can maybe download parts of the file if the "very large" or "infinite" size is a problem. You can maybe use HttpWebRequest.AddRange
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.addrange.aspx

You can use System.Net.WebRequest to measure the throughput, using asynchronous calls to capture the data as it is being read.
The MSDN example code for WebRequest.BeginGetResponse shows one method for using the asynchronous methods to read the data from a remote file as it is being received. The example stores the response data in a StringBuilder, but you can skip that since you're not actually interested in the data itself.
I added some timing code to their example and tested it against a couple of large file downloads, seems to do the job you need it for.

Asynchronous S3 PutBucketRequests appear to be executing in series

I'm working on improving the upload performance of a .net app that uploads groups of largeish (~15mb each) files to S3.
I have adjusted multipart options (threads, chunk size, etc.) and I think I've improved that as much as possible, but while closely watching the network utilization, I noticed something unexpected.
I iterate over a number of files in a directory and then submit each of them for upload using an instance of the S3 transfer utility like so:
// prepare the upload
this._transferUtility.S3Client.PutBucket(new PutBucketRequest().WithBucketName(streamingBucket));
request = new TransferUtilityUploadRequest()
.WithBucketName(streamingBucket)
.WithFilePath(streamFile)
.WithKey(targetFile)
.WithTimeout(uploadTimeout)
.WithSubscriber(this.uploadFileProgressCallback);
// start the upload
this._transferUtility.Upload(request);
Then I watch for these to complete in the uploadFileProgressCallback specified above.
However when I watch the network interface, I can see a number of distinct "humps" in my outbound traffic graph which coincide precisely with the number of files I'm uploading to S3.
Since this is an asynchronous call, I was under the impression that each transfer would begin immediately, and I'd see a stepped increase in outbound data followed by a stepped decrease as each upload was completed. Base on what I'm seeing now I wonder if these requests, while asynchronous to the calling code, are being queued up somewhere and then executed in series?
If so I'd like to change that so the request all begin uploading at (close to) the same time, so I can maximize the upload bandwidth I have available and reduce the overal execution time.
I poked around in the S3 .net SDK documentation but I couldn't find any mention of this queueing mechanism or any properties/etc. that appeared to provide a way of increasing the concurrency of these calls.
Any pointers appreciated!

This is something that's not intrinsically supported by the SDKs due to simplicity requirements maybe? I implemented my own concurrent part uploads based on this article.
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
Some observations:
This approach is good only when you have the complete content in memory as you have to break it into chunks and wrap it up in part uploads. In many cases it may not make sense to have order of GBs of data in memory just so that you can do concurrent uploads. You may have to evaluate the tradeoff there.
SDKs have a limit of upto 16MB for a singlePut upload and any file size beyond this value would be divided into 5MB chunks for part uploads. Unfortunately these values are not configurable. So I had to pretty much write my own multipart upload logic. The values mentioned above are for the java SDK and I'd expect these to be the same for the C# one too.
All operations are non-blocking which is good.

In c# you could try to set the partsize manually.
TransferUtilityUploadRequest request =
new TransferUtilityUploadRequest()
.WithPartSize(??).
Or
TransferUtilityConfig utilityConfig = new TransferUtilityConfig();
utilityConfig.MinSizeBeforePartUpload = ??;
But i don't know the defaults

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.