How to download a file in pieces with many threads

How to download a file in pieces with many threads - c#

I need to download a file from a website using different threads and downloading different sections of the file at once, I know I can use Webclient.downloadfile method but it doesnt support downloading a file in chunks. If you could point to a tutorial or give me an idea on how to do it I would appreciate it. Thanks!

The server at the other end, the one providing the file, has to support downloading in chunks as well. It would need some way to specify which byte position in the file you want to start at, instead of starting at the first and sending until the client stops accepting them, or it reaches the end of the file.
Assuming the server does support that, they would provide some kind of documentation on how to utilize it and you would definitely find help here turning that into code.

To piggy back on Rex's answer, there is no fool-proof way to know. Some web servers will provide you with a content-length or some will return -1 for length. Annoying, I know..
Your best bet is to specify a fixed range and utilize some heuristics or analysis to determine estimated length of your chunks over time.
You'll also want to look at this similar SO question on Multipart Downloading in C#.

The WebClient object has a 'Headers' property, which should let you define a 'Range' header to ask for only a part of the file.

There are a lot of ifs here, but if you are downloading, say, a giant text file, you could actually split it into many files on the server, and return the addresses of each to the client (or use a filename convention and just report how many sections there are), and the client could in turn could spin up the threads to download each of sections, which it could then reconstitute into a single, large file.
I'm not sure of your use case, but this particular scenario may not be likely to make anything go faster, if that is the idea.

Related

Way to obtain full directory information in a batch

We’ve got a process that obtains a list of files from a remote transatlantic Samba share. This is naturally on the slow side, however it’s made worse by the fact that we don’t just need the names of the files, we need the last write times to spot updates. There’s quite a lot of files in the directory, and as far as I can tell, the .NET file API insists on me asking for each one individually. Is there a faster way of obtaining the information we need?

I would love to find a way myself. I have exactly the same problem - huge number of files on a slow network location, and I need to scan for changes.
As far as I know, you do need to ask for file properties one by one.
The amount of information transferred per file should not be high though; the roundabout request-response time is probably the main problem. You can help the situation by running multiple requests in parallel (e.g. using Parallel.ForEach)

The answer to your question is most likely no, at least not in a meaningful way.
Exactly how you enumerate the files in your code is almost irrelevant since they all boil down to the same file system API in Windows. Unfortunately, there is no function that returns a list of file details in one call*.
So, no matter what your code looks like, somewhere below, it's still enumerating the directory contents and calling a particular file function individually for each file.
If this is really a problem, I would look into moving the detection logic closer to the files and send your app the results periodically.
*Disclaimer: It's been a long time since I've been down this far in the stack and I'm just browsing the API docs now, there may be a new function somewhere that does exactly this.

Multiple Async File Uploads with chunking to ASP.Net Web API

I have read a number of closely related questions but not one that hits this exactly. If it is a duplicate, please send me a link.
I am using an angular version of the flowjs library for doing HTML5 file uploads (https://github.com/flowjs/ng-flow). This works very well and I am able to upload multiple files simultaneously in 1MB chunks. There is an ASP.Net Web API Files controller that accepts these and saves them on disk. Although I can make this work, I am not doing it efficiently and would like to know a better approach.
First, I used the MultipartFormDataStreamProvider in an async method that worked great as long as the file uploaded in a single chunk. Then I switched to just using the FileStream to write the file to disk. This also worked as long as the chunks arrived in order, but of course, I cannot rely on that.
Next, just to see it work, I wrote the chunks to individual file streams and combined them after the fact, hence the inefficiency. A 1GB file would generate a thousand chunks that needed to be read and rewritten after the upload was complete. I could hold all file chunks in memory and flush them after they are all uploaded, but I'm afraid the server would blow up.
It seems that there should be a nice asynchronous solution to this dilemma but I don't know what it is. One possibility might be to use async/await to combine previous chunks while writing the current chunk. Another might be to use Begin/EndInvoke to create a separate thread so that the file manipulation on disk was handled independent of the thread reading from the HttpContext but this would rely on the ThreadPool and I'm afraid that the created threads will be unduly terminated when my MVC controller returns. I could create a FileWatcher that ran completely independent of ASP.Net but that would be very kludgey.
So my questions are, 1) is there a simple solution already that I am missing? (seems like there should be) and 2) if not, what is the best approach to solving this inside the Web API framework?
Thanks, bob

I'm not familiar with that kind of chunked upload, but I believe this should work:
Use flowTotalSize to pre-allocate the file when the first chunk comes in.
Have one SemaphoreSlim per file to serialize the asynchronous writes for that file.
Each chunk will write to its own offset (flowChunkSize * (flowChunkNumber - 1)) within the file.
This doesn't handle situations where the uploads are unexpectedly terminated. That kind of solution usually involves allocating/writing a temporary file (with a special extension) and then moving/renaming that file once the last chunk arrives.
Don't forget to ensure that your file writing is actually asynchronous.

Using #Stephen Cleary's answer, and this thread: https://github.com/flowjs/ng-flow/issues/41 I was able to make an ASP.NET Web Api Implementation and uploaded it for those still wondering about this question such as #Herb Caudill
https://github.com/samhowes/NgFlowSample/tree/master.
The original answer is the real answer to this question, but I don't have enough reputation yet to comment. I did not use a SemaphoreSlim, but instead enabled file Write sharing. But did in fact pre-allocate and make sure that each chunk is getting written to the right location by calculating an offset.
I will be contributing this to the Flow samples at: https://github.com/flowjs/flow.js/tree/master/samples

This is what I have done. Uploaded the chunks and saved those chunks on the server an save the location of chunks in the database with their order (not the order they came in, the order of the chunk in the file they should be).
Then I introduced another endpoint to merge those chunks. Since this part can be a long process I used a messaging service to run the process in the background.
And after the service is done merging the file, sending a notification (or you can trigger an event).
Agree, it won't fix the problem of having to save all those chunks, but after the merging is done, we can just delete those from the disk. However there are some IIS configuration required though for the upload to work smoothly.
Here's my two cents to this old question. Now most of the application use azure or aws for storage. However, still sharing my thoughts in case it helps someone.

Measuring internet speed without saving remote file

I am writing an internet speed measurer. I want to measure internet speed without saving remote file to either file or memory - I need fetch the data and forget it, so it seems that WebClient and StreamReader is not good for it (maybe i should inherit them and override some private methods). How to make it?

I think that when writing such a system you should also select exactly which size of file you want to download.
Anyway, you can maybe download parts of the file if the "very large" or "infinite" size is a problem. You can maybe use HttpWebRequest.AddRange
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.addrange.aspx

You can use System.Net.WebRequest to measure the throughput, using asynchronous calls to capture the data as it is being read.
The MSDN example code for WebRequest.BeginGetResponse shows one method for using the asynchronous methods to read the data from a remote file as it is being received. The example stores the response data in a StringBuilder, but you can skip that since you're not actually interested in the data itself.
I added some timing code to their example and tested it against a couple of large file downloads, seems to do the job you need it for.

Best strategy to implement reader for large text files

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.

I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.

Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.

I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]

I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Sizes and Timeouts on streaming service contract in WCF

I am currently working on a small project, where I need to send a potentially large file over the internet.
After some debating I decided to go with the streaming option instead of a chunking approach. The files can potentially be very big, I don't really want to specify an exact upper bound, 2GB maybe 4GB, who knows.
Naturally this can take a long time. Again I don't really want to have a timeout. It just takes as long as it takes, doesn't matter.
While poking around trying different files of varying size, I slowly, step by step, tuned the properties of my BasicHttpBinding. I am just wondering if the values I came up with are basically okay, or if they are totally evil?
transferMode="Streamed"
sendTimeout="10675199.02:48:05.4775807"
receiveTimeout="10675199.02:48:05.4775807"
openTimeout="10675199.02:48:05.4775807"
closeTimeout="10675199.02:48:05.4775807"
maxReceivedMessageSize="9223372036854775807"
This just doesn't feel right somehow, these are just the maximum possible values for each underlying data structure. But I don't know what else to do.
So again:
Is this basically the right approach? Or did I completely misunderstand and misuse the framework here?
Thanks

Well, a more natural approach might be to send the file in a sequence in mid-size chunks, with a final message to commit; this also makes it possible to resume from error. There is perhaps a slight DOS issue with fully open numbers...

I already have a problem with streaming when the connection between WCF client and server goes through the VPN. If interested, read more in this thread.
If the stream is big enough to be streamed for more then a minute - an exception occurs.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.