How to asynchronically download millions of files from a file storage?

How to asynchronically download millions of files from a file storage? - c#

Let's assume I have a database managing millions of documents, which are stored on a WebDav or SMB server, which does not support retrieving documents in bulks.
Given a list of (potentially all) document IDs, how do I download the corresponding documents as fast as possible?
Iterating over the list and sequentially downloading them is far too slow.
The 2 options I see is threads and async downloads.
My gut says that async programming should be preferred to threads, because I'm just waiting for IO on the client side. But I am rather new to async programming and I don't know how to do it.
I assume that iterating over the whole list and sending an async download request could potentially lead to too many requests in a very short time leading to rejected requests. So how do I throttle this? Is there a best practice way to do this?

Take a look at this: How to limit the amount of concurrent async I/O? Using a SemaphoreSlim, as suggested in the accepted answer, is an easy and quite good solution.
My personal favorite though for this kind if job is the TPL Dataflow library. You can see here an example of using this library to download pages from the web asynchronously with a configurable level of concurrency, in combination with the HttpClient class. Here is another example.

I also found this great article explaining 4 different ways to limit the number of concurrent downloads.

Related

File transfer and bandwidth limitations c#

I need to copy a lot of file between many file systems into one big storage.
I also need to limit the bandwidth of the file transfer because the network in not stable and I need the bandwidth for other things.
Another request is that it will be done in c#.
I thought about using Microsoft File Sync Framework, but I think that it doesn't provide bandwidth limitations.
Also thought about robocopy but it is an external process and handling the error might be a little problem.
I saw the BITS but there is a problem with the scalability of the jobs, I will need to transfer more then 100 files and that means 100 jobs at the same time.
Any suggestions? recommendations?
Thank you

I'd take a look at How to improve the Performance of FtpWebRequest? though it might not be what you're looking for exactly, it should give you ideas.
I think you'll want some sort of limited tunnel so the processes negiotating can't claim more bandwidth because there is none available for them. A connection in a connection.
Alternatively you could make a job queue, which holds off on sending all files at the same time but instead sends n number of files and waits until one is done before starting the next.

Well, you could just use the usual I/O methods (read + write) and throttle the rate.
A simple example (not exactly great, but working), would be something like this:
while ((bytesRead = await fsInput.ReadAsync(...)) > 0)
{
await fsOutput.WriteAsync(...);
await Task.Delay(100);
}
Now, obviously, bandwidth throttling isn't really the job of the application. Unless you're on a very simple network, this should be handled by the QoS of the router, which should ensure that the various services get their share of the bandwidth - a steady stream of data will usually have a lower QoS priority. Of course, it does usually require you to have a network administrator.

Dealing with a very large number of files

I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.
With all the zip and rar files extracted, I get a final total of one million files.
I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.
I have a few questions:
Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?

async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:
consume as small amount of CPU as possible (this is where async/await will help)
perform multiple operations at the same time, in parallel
control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.
You may take a look at a small project I published on github:
Parallel tree walker
It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.
For example:
await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
MaxDegreeOfParallelism = 10,
ProcessElementAsync = async (element) =>
{
var el = element as FileSystemElement;
var path = el.Path;
var isDirectory = el.IsDirectory;
await DoStuffAsync(el);
}
});
(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)

You could use Everything Search. The SDK is open source and have C# example.
It's the fastest way to index files on Windows I've seen.
From FAQ:
1.2 How long will it take to index my files?
"Everything" only uses file and folder names and generally takes a few seconds to build its > database.
A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index.
1,000,000 files will take about 1 minute.
I'm not sure if you can use TPL with it though.

Design: Queue Management question (C#)

I want to build a windows service that will use a remote encoding service (like encoding.com, zencoder, etc.) to upload video files for encoding, download them after the encoding process is complete, and process them.
In order to do that, I was thinking about having different queues, one for handling currently waiting files, one for files being uploaded, one for files waiting for encoding to complete and one more for downloading them. Each queue has a limitation, for example only 5 files can be uploaded for encoding at a certain time. The queues have to be visible and able to resurrect from a crash - currently we do that by writing the queue to an SQL table and managing the number of items in a separate table.
I also want the queues to run in the background, independent of each other, but able to transfer files from one queue to another as the process goes on.
My biggest question mark is about how to build the queues and managing them, and less about limiting the number of items in each queue.
I am not sure what is the right approach for this and would really appreciate any help.
Thanks!

You probably don't need to separate the work into separate queues, as long as they are logically separated in some way (tagged with different "job types" or such).
As I see it, the challenge is to not pick up and process more than a given limited number of jobs from the queue, based on the type of job. I had a somewhat similar issue a while ago which led to a question here on SO, and a subsequent blog post with my solution, both of which might give you some ideas.
In short my solution was that I keep a list of "tokens". When ever I want to perform a job that has some sort of limitation, I first pick up a token. If no tokens are available, I will need to wait for one to become available. Then you can use whatever queueing mechanism suitable to handle the queue as such.

There are various ways to approach this and it depends which one suits your case in terms of reliability and resilience/development cost/maintenance cost. You need to answer the question on the likes that what if server crashes, is it important to carry on what you were doing?
Queue can be implemented in MSMQ, SQL Server or simply in code and all queues in memory. For workflow you can use Windows Workflow Foundation, or implement it yourself which would be probably easier but change would be more difficult.
So if you give a few more hints, I should be able to help you better.

Scaling up Multiple HttpWebRequests?

I'm building a server application that needs to perform a lot of http requests to a couple other servers on an ongoing basis. Currently, I'm basically setting up about 30 threads and continuously running HttpWebRequests synchronously on each thread, achieving a throughput of about 30 requests per second.
I am indeed setting the ServicePoint ConnectionLimit in the app.config so that's not the limiting factor.
I need to scale this up drastically. At the very least I'll need some more CPU horse power, but I'm wondering if I would gain any advantages by using the async methods of the HttpWebRequest object (eg: .BeginGetResponse() ) as opposed to creating threads myself and using the synchronous methods (eg: .GetResponse() ) on these threads.
If I go with the async methods, I obviously have to significantly redesign my app, so I'm wondering if anyone might have some insight before I go and recode everything, in case I'm out to lunch.
Thanks!

If you are on Windows NT, then System.Net.Sockets.Socket class always uses IO Completion ports for async operations. And HTTPWebRequest in async mode uses async sockets, and hence will be using IOCP.
Without doing detailed benchmarking, it is difficult to say if our bottleneck is inside HttpWebRequest, or up the stack, in your application, or on the remote side, in the server. But offhand, for sure, asyncc will give you better performance, because it will end up using IOCP under the covers. And reimplementing the app for async is not that difficult.
So, I would suggest that you first change your app architecture to async. Then see how much max throughput you are getting. Then you can start benchmarking and finding out where the bottleneck is, and removing that.

Fastest result so far for me is using 75 threads running sync httpwebrequest.
About 140 requests per second on a windows 2003 server, 4core 3ghz, 100mb connection.
Async Httprequest / winsock got stuck at about 30-50 req/sec. Did not test sync winsock but I guess it would give you about the same result as httpwebrequest.
Tests was against 1 200 000 blog feeds.
Been struggling with this the last month so it would be interesting to know if someone managed to squeeze more out of .net?
EDIT
New test: Got 350req/sec with the xfserver iocp component. Used a bunch of threads with one instance each before any greater result. The "client part" of the lib had a couple of really annoying bugs that made implementation harder then the "server part". Not what you're asking for and not recommended but some kind of step.
Next: Former winsock test did not use the 3.5 SocketAsyncEventArgs, that will be next.
ANSWER
The answer to your question, no it will not be worth the effort.
The async HttpWebRequest methods offloads main thread while keeping download in background, it does not improve the number/scalability of requests. (at least not in 3.5, might be different in 4.0?)
However, what might be worth looking at is building your own wrapper around async sockets/SocketAsyncEventArgs where iocp works and perhaps implement a begin/end pattern similar to HttpWebRequest (for easiest possible implementation in current code). The improvement is really enormous.

reducing loading time of 100 pages of google

for my project i need to access entire pages(100) of google at a time for a particular keyword.I used 'for' loop for accessing pages in url written in my c# code.But it is taking more time to access.Some times it showing HttpRequest error.Any way to increase the speed?

Query them in parallel. HTTP is asynchronous by nature, so should be your request code.

In your case, speed is limited by the time it takes to fulfill an I/O request. You can speed up the total task by accessing servers in parallel (i.e. using ThreadPool). A browser will generally use a couple (2-8) parallel I/O requests to a serer, so so could you (for instance usefull if you also need image files or css files referenced by the google result). Since you'll have up to 100 servers, you can do it massively parallel; again a task the Threadpool will help you with.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to asynchronically download millions of files from a file storage? - c#

I also found this great article explaining 4 different ways to limit the number of concurrent downloads.

Related

File transfer and bandwidth limitations c#

Dealing with a very large number of files

Design: Queue Management question (C#)

Scaling up Multiple HttpWebRequests?

reducing loading time of 100 pages of google

Categories

Resources