Download multiple files concurrently from FTP using FluentFTP with a maximum value - c#

I would like to download multiple download files recursively from a FTP Directory, to do this I'm using FluentFTP library and my code is this one:
private async Task downloadRecursively(string src, string dest, FtpClient ftp)
{
foreach(var item in ftp.GetListing(src))
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0)
{
System.IO.Directory.CreateDirectory(Path.Combine(dest, item.Name));
downloadRecursively(Path.Combine(src, item.Name), Path.Combine(dest, item.Name), ftp);
}
}
else if (item.Type == FtpFileSystemObjectType.File)
{
await ftp.DownloadFileAsync(Path.Combine(dest, item.Name), Path.Combine(src, item.Name));
}
}
}
I know you need one FtpClient per download you want, but how can I make to use a certain number of connections as maximum, I guess that the idea is to create, connect, download and close per every file I find but just having a X number of downloading files at the same time. Also I'm not sure if I should create Task with async, Threads and my biggest problem, how to implement all of this.
Answer from #Bradley here seems pretty good, but the question does read every file thas has to download from an external file and it doesn't have a maximum concurrent download value so I'm not sure how to apply these both requirements.

Use:
ConcurrentBag class to implement a connection pool;
Parallel class to parallelize the operation;
ParallelOptions.MaxDegreeOfParallelism to limit number of the concurrent threads.
var clients = new ConcurrentBag<FtpClient>();
var opts = new ParallelOptions { MaxDegreeOfParallelism = maxConnections };
Parallel.ForEach(files, opts, file =>
{
file = Path.GetFileName(file);
string thread = $"Thread {Thread.CurrentThread.ManagedThreadId}";
if (!clients.TryTake(out var client))
{
Console.WriteLine($"{thread} Opening connection...");
client = new FtpClient(host, user, pass);
client.Connect();
Console.WriteLine($"{thread} Opened connection {client.GetHashCode()}.");
}
string remotePath = sourcePath + "/" + file;
string localPath = Path.Combine(destPath, file);
string desc =
$"{thread}, Connection {client.GetHashCode()}, " +
$"File {remotePath} => {localPath}";
Console.WriteLine($"{desc} - Starting...");
client.DownloadFile(localPath, remotePath);
Console.WriteLine($"{desc} - Done.");
clients.Add(client);
});
Console.WriteLine($"Closing {clients.Count} connections");
foreach (var client in clients)
{
Console.WriteLine($"Closing connection {client.GetHashCode()}");
client.Dispose();
}
Another approach is to start a fixed number of threads with one connection for each and have them pick files from a queue.
For an example of an implementation, see my article for WinSCP .NET assembly:
Automating transfers in parallel connections over SFTP/FTP protocol
A similar question about SFTP:
Processing SFTP files using C# Parallel.ForEach loop not processing downloads

Here is a TPL Dataflow approach. A BufferBlock<FtpClient> is used as a pool of FtpClient objects. The recursive enumeration takes a parameter of type IEnumerable<string> that holds the segments of one filepath. These segments are combined differently when constructing the local and the remote filepath. As a side effect of invoking the recursive enumeration, the paths of the remote files are sent to an ActionBlock<IEnumerable<string>>. This block handles the parallel downloading of the files. Its Completion property contains eventually all the exceptions that may have occurred during the whole operation.
public static Task FtpDownloadDeep(string ftpHost, string ftpRoot,
string targetDirectory, string username = null, string password = null,
int maximumConnections = 1)
{
// Arguments validation omitted
if (!Directory.Exists(targetDirectory))
throw new DirectoryNotFoundException(targetDirectory);
var fsLocker = new object();
var ftpClientPool = new BufferBlock<FtpClient>();
async Task<TResult> UsingFtpAsync<TResult>(Func<FtpClient, Task<TResult>> action)
{
var client = await ftpClientPool.ReceiveAsync();
try { return await action(client); }
finally { ftpClientPool.Post(client); } // Return to the pool
}
var downloader = new ActionBlock<IEnumerable<string>>(async path =>
{
var remotePath = String.Join("/", path);
var localPath = Path.Combine(path.Prepend(targetDirectory).ToArray());
var localDir = Path.GetDirectoryName(localPath);
lock (fsLocker) Directory.CreateDirectory(localDir);
var status = await UsingFtpAsync(client =>
client.DownloadFileAsync(localPath, remotePath));
if (status == FtpStatus.Failed) throw new InvalidOperationException(
$"Download of '{remotePath}' failed.");
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = maximumConnections,
BoundedCapacity = maximumConnections,
});
async Task Recurse(IEnumerable<string> path)
{
if (downloader.Completion.IsCompleted) return; // The downloader has failed
var listing = await UsingFtpAsync(client =>
client.GetListingAsync(String.Join("/", path)));
foreach (var item in listing)
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0) await Recurse(path.Append(item.Name));
}
else if (item.Type == FtpFileSystemObjectType.File)
{
var accepted = await downloader.SendAsync(path.Append(item.Name));
if (!accepted) break; // The downloader has failed
}
}
}
// Move on to the thread pool, to avoid ConfigureAwait(false) everywhere
return Task.Run(async () =>
{
// Fill the FtpClient pool
for (int i = 0; i < maximumConnections; i++)
{
var client = new FtpClient(ftpHost);
if (username != null && password != null)
client.Credentials = new NetworkCredential(username, password);
ftpClientPool.Post(client);
}
try
{
// Enumerate the files to download
await Recurse(new[] { ftpRoot });
downloader.Complete();
}
catch (Exception ex) { ((IDataflowBlock)downloader).Fault(ex); }
try
{
// Await the downloader to complete
await downloader.Completion;
}
catch (OperationCanceledException)
when (downloader.Completion.IsCanceled) { throw; }
catch { downloader.Completion.Wait(); } // Propagate AggregateException
finally
{
// Clean up
if (ftpClientPool.TryReceiveAll(out var clients))
foreach (var client in clients) client.Dispose();
}
});
}
Usage example:
await FtpDownloadDeep("ftp://ftp.test.com", "", #"C:\FtpTest",
"username", "password", maximumConnections: 10);
Note: The above implementation enumerates the remote directory lazily, following the tempo of the downloading process. If you prefer to enumerate it eagerly, gathering all info available about the remote listings ASAP, just remove the BoundedCapacity = maximumConnections configuration from the ActionBlock that downloads the files. Be aware that doing so could result in high memory consumption, in case the remote directory has a deep hierarchy of subfolders, containing cumulatively a huge number of small files.

I'd split this into three parts.
Recursively build a list of source and destination pairs.
Create the directories required.
Concurrently download the files.
It's the last part that is slow and should be done in parallel.
Here's the code:
private async Task DownloadRecursively(string src, string dest, FtpClient ftp)
{
/* 1 */
IEnumerable<(string source, string destination)> Recurse(string s, string d)
{
foreach (var item in ftp.GetListing(s))
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0)
{
foreach(var pair in Recurse(Path.Combine(s, item.Name), Path.Combine(d, item.Name)))
{
yield return pair;
}
}
}
else if (item.Type == FtpFileSystemObjectType.File)
{
yield return (Path.Combine(s, item.Name), Path.Combine(d, item.Name));
}
}
}
var pairs = Recurse(src, dest).ToArray();
/* 2 */
foreach (var d in pairs.Select(x => x.destination).Distinct())
{
System.IO.Directory.CreateDirectory(d);
}
/* 3 */
var downloads =
pairs
.AsParallel()
.Select(x => ftp.DownloadFileAsync(x.source, x.destination))
.ToArray();
await Task.WhenAll(downloads);
}
It should be clean, neat, and easy to reason about code.

Related

Run Async every x number of times in a for loop

I'm downloading 100K+ files and want to do it in patches, such as 100 files at a time.
static void Main(string[] args) {
Task.WaitAll(
new Task[]{
RunAsync()
});
}
// each group has 100 attachments.
static async Task RunAsync() {
foreach (var group in groups) {
var tasks = new List<Task>();
foreach (var attachment in group.attachments) {
tasks.Add(DownloadFileAsync(attachment, downloadPath));
}
await Task.WhenAll(tasks);
}
}
static async Task DownloadFileAsync(Attachment attachment, string path) {
using (var client = new HttpClient()) {
using (var fileStream = File.Create(path + attachment.FileName)) {
var downloadedFileStream = await client.GetStreamAsync(attachment.url);
await downloadedFileStream.CopyToAsync(fileStream);
}
}
}
Expected
Hoping it to download 100 files at a time, then download next 100;
Actual
It downloads a lot more at the same time. Quickly got an error Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host
Running tasks in "batch" is not a good idea in terms of performance. A long running task would make whole batch block. A better approach would be starting a new task as soon as one is finished.
This can be implemented with a queue as #MertAkcakaya suggested. But I will post another alternative based on my other answer Have a set of Tasks with only X running at a time
int maxTread = 3;
System.Net.ServicePointManager.DefaultConnectionLimit = 50; //Set this once to a max value in your app
var urls = new Tuple<string, string>[] {
Tuple.Create("http://cnn.com","temp/cnn1.htm"),
Tuple.Create("http://cnn.com","temp/cnn2.htm"),
Tuple.Create("http://bbc.com","temp/bbc1.htm"),
Tuple.Create("http://bbc.com","temp/bbc2.htm"),
Tuple.Create("http://stackoverflow.com","temp/stackoverflow.htm"),
Tuple.Create("http://google.com","temp/google1.htm"),
Tuple.Create("http://google.com","temp/google2.htm"),
};
DownloadParallel(urls, maxTread);
async Task DownloadParallel(IEnumerable<Tuple<string,string>> urls, int maxThreads)
{
SemaphoreSlim maxThread = new SemaphoreSlim(maxThreads);
var client = new HttpClient();
foreach(var url in urls)
{
await maxThread.WaitAsync();
DownloadFile(client, url.Item1, url.Item2)
.ContinueWith((task) => maxThread.Release() );
}
}
async Task DownloadFile(HttpClient client, string url, string fileName)
{
var stream = await client.GetStreamAsync(url);
using (var fileStream = File.Create(fileName))
{
await stream.CopyToAsync(fileStream);
}
}
PS: DownloadParallel will return as soon as it starts the last download. So don't await it. If you really want to await it you should add for (int i = 0; i < maxThreads; i++) await maxThread.WaitAsync(); at the end of the method.
PS2: Don't forget to add exception handling to DownloadFile

Why does this code lead to a deadlock?

The following code is meant to copy files asynchronously but it causes deadlock in my app. It uses a task combinator helper method called 'Interleaved(..)' found here to return tasks in the order they complete.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_DEADLOCK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
List<Task<StorageFile>> copyTasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copyTask = file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copyTasks.Add(copyTask);
}
// Serve up each task as it completes
foreach (var bucket in Interleaved(copyTasks))
{
var copyTask = await bucket;
var copiedFile = await copyTask;
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFiles.Count() * 100.0));
}
return copiedFiles;
}
I originally created a simpler 'CopyFiles(...)' which processes the tasks in the order they were supplied (as opposed to completed) and this works fine, but I can't figure out why this one deadlocks frequently. Particularly, when there are many files to process.
Here is the simpler 'CopyFiles' code that works:
public static async Task<List<StorageFile>> CopyFiles_RUNS_OK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
int sourceFilesCount = sourceFiles.Count();
List<Task<StorageFile>> tasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copiedFile = await file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount *100.0));
}
return copiedFiles;
}
EDIT:
In an attempt to find out what's going on I've changed the implementation of CopyFiles(...) to use TPL Dataflow. I am aware that this code will return items in the order they were supplied, which is not what I want, but it removes the Interleaved dependency as a start. Anyway, despite this the app still hangs. It seems as if it's not returning from the file.CopyAsync(..) call. There is of course the possibility I'm just doing something wrong here.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_HANGING_ALSO(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
int sourceFilesCount = sourceFiles.Count();
List<StorageFile> copiedFiles = new List<StorageFile>();
// Store for input files.
BufferBlock<StorageFile> inputFiles = new BufferBlock<StorageFile>();
//
Func<StorageFile, Task<StorageFile>> copyFunc = sf => sf.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
TransformBlock<StorageFile, Task<StorageFile>> copyFilesBlock = new TransformBlock<StorageFile, Task<StorageFile>>(copyFunc);
inputFiles.LinkTo(copyFilesBlock, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var file in sourceFiles)
{
inputFiles.Post(file);
}
inputFiles.Complete();
while (await copyFilesBlock.OutputAvailableAsync())
{
Task<StorageFile> file = await copyFilesBlock.ReceiveAsync();
copiedFiles.Add(await file);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount * 100.0));
}
copyFilesBlock.Completion.Wait();
return copiedFiles;
}
Many thanks in advance for any help.

C# Multi-threading - Upload to FTP Server

I would like to seek your help in implementing Multi-Threading in my C# program.
The program aims to upload 10,000++ files to an ftp server. I am planning to implement atleast a minimum of 10 threads to increase the speed of the process.
With this, this is the line of code that I have:
I have initialized 10 Threads:
public ThreadStart[] threadstart = new ThreadStart[10];
public Thread[] thread = new Thread[10];
My plan is to assign one file to a thread, as follows:
file 1 > thread 1
file 2 > thread 2
file 3 > thread 3
.
.
.
file 10 > thread 10
file 11 > thread 1
.
.
.
And so I have the following:
foreach (string file in files)
{
loop++;
threadstart[loop] = new ThreadStart(() => ftp.uploadToFTP(uploadPath + #"/" + Path.GetFileName(file), file));
thread[loop] = new Thread(threadstart[loop]);
thread[loop].Start();
if (loop == 9)
{
loop = 0;
}
}
The passing of files to their respective threads is working. My problem is that the starting of the thread is overlapping.
One example of exception is that when Thread 1 is running, then a file is passed to it. It returns an error since Thread 1 is not yet successfully done, then a new parameter is being passed to it. Also true with other threads.
What is the best way to implement this?
Any feedback will be greatly appreciated. Thank you! :)
Using async-await and just pass an array of files into it:
private static async void TestFtpAsync(string userName, string password, string ftpBaseUri,
IEnumerable<string> fileNames)
{
var tasks = new List<Task<byte[]>>();
foreach (var fileInfo in fileNames.Select(fileName => new FileInfo(fileName)))
{
using (var webClient = new WebClient())
{
webClient.Credentials = new NetworkCredential(userName, password);
tasks.Add(webClient.UploadFileTaskAsync(ftpBaseUri + fileInfo.Name, fileInfo.FullName));
}
}
Console.WriteLine("Uploading...");
foreach (var task in tasks)
{
try
{
await task;
Console.WriteLine("Success");
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
Then call it like this:
const string userName = "username";
const string password = "password";
const string ftpBaseUri = "ftp://192.168.1.1/";
var fileNames = new[] { #"d:\file0.txt", #"d:\file1.txt", #"d:\file2.txt" };
TestFtpAsync(userName, password, ftpBaseUri, fileNames);
Why doing it the hard way?
.net already has a class called ThreadPool.
You can just use that and it manages the threads itself.
Your code will be like this:
static void DoSomething(object n)
{
Console.WriteLine(n);
Thread.Sleep(10);
}
static void Main(string[] args)
{
ThreadPool.SetMaxThreads(20, 10);
for (int x = 0; x < 30; x++)
{
ThreadPool.QueueUserWorkItem(new WaitCallback(DoSomething), x);
}
Console.Read();
}

C# Task.WaitAll isn't waiting

My aim is to download images from an Amazon Web Services bucket.
I have the following code function which downloads multiple images at once:
public static void DownloadFilesFromAWS(string bucketName, List<string> imageNames)
{
int batchSize = 50;
int maxDownloadMilliseconds = 10000;
List<Task> tasks = new List<Task>();
for (int i = 0; i < imageNames.Count; i++)
{
string imageName = imageNames[i];
Task task = Task.Run(() => GetFile(bucketName, imageName));
tasks.Add(task);
if (tasks.Count > 0 && tasks.Count % batchSize == 0)
{
Task.WaitAll(tasks.ToArray(), maxDownloadMilliseconds);//wait to download
tasks.Clear();
}
}
//if there are any left, wait for them
Task.WaitAll(tasks.ToArray(), maxDownloadMilliseconds);
}
private static void GetFile(string bucketName, string filename)
{
try
{
using (AmazonS3Client awsClient = new AmazonS3Client(Amazon.RegionEndpoint.EUWest1))
{
string key = Path.GetFileName(filename);
GetObjectRequest getObjectRequest = new GetObjectRequest() {
BucketName = bucketName,
Key = key
};
using (GetObjectResponse response = awsClient.GetObject(getObjectRequest))
{
string directory = Path.GetDirectoryName(filename);
if (!Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
}
if (!File.Exists(filename))
{
response.WriteResponseStreamToFile(filename);
}
}
}
}
catch (AmazonS3Exception amazonS3Exception)
{
if (amazonS3Exception.ErrorCode == "NoSuchKey")
{
return;
}
if (amazonS3Exception.ErrorCode != null && (amazonS3Exception.ErrorCode.Equals("InvalidAccessKeyId") || amazonS3Exception.ErrorCode.Equals("InvalidSecurity")))
{
// Log AWS invalid credentials
throw new ApplicationException("AWS Invalid Credentials");
}
else
{
// Log generic AWS exception
throw new ApplicationException("AWS Exception: " + amazonS3Exception.Message);
}
}
catch
{
//
}
}
The downloading of the images all works fine but the Task.WaitAll seems to be ignored and the rest of the code continues to be executed - meaning I try to get files that are currently non existent (as they've not yet been downloaded).
I found this answer to another question which seems to be the same as mine. I tried to use the answer to change my code but it still wouldn't wait for all files to be downloaded.
Can anyone tell me where I am going wrong?
The code behaves as expected. Task.WaitAll returns after ten seconds even when not all files have been downloaded, because you have specified a timeout of 10 seconds (10000 milliseconds) in variable maxDownloadMilliseconds.
If you really want to wait for all downloads to finish, call Task.WaitAll without specifying a timeout.
Use
Task.WaitAll(tasks.ToArray());//wait to download
at both places.
To see some good explanations on how to implement parallel downloads while not stressing the system (only have a maximum number of parallel downloads), see the answer at How can I limit Parallel.ForEach?

AWAIT multiple file downloads with DownloadDataAsync

I have a zip file creator that takes in a String[] of Urls, and returns a zip file with all of the files in the String[]
I figured there would be a number of example of this, but I cannot seem to find an answer to "How to download many files asynchronously and return when done"
How do I download {n} files at once, and return the Dictionary only when all downloads are complete?
private static Dictionary<string, byte[]> ReturnedFileData(IEnumerable<string> urlList)
{
var returnList = new Dictionary<string, byte[]>();
using (var client = new WebClient())
{
foreach (var url in urlList)
{
client.DownloadDataCompleted += (sender1, e1) => returnList.Add(GetFileNameFromUrlString(url), e1.Result);
client.DownloadDataAsync(new Uri(url));
}
}
return returnList;
}
private static string GetFileNameFromUrlString(string url)
{
var uri = new Uri(url);
return System.IO.Path.GetFileName(uri.LocalPath);
}
First, you tagged your question with async-await without actually using it. There really is no reason anymore to use the old asynchronous paradigms.
To wait asynchronously for all concurrent async operation to complete you should use Task.WhenAll which means that you need to keep all the tasks in some construct (i.e. dictionary) before actually extracting their results.
At the end, when you have all the results in hand you just create the new result dictionary by parsing the uri into the file name, and extracting the result out of the async tasks.
async Task<Dictionary<string, byte[]>> ReturnFileData(IEnumerable<string> urls)
{
var dictionary = urls.ToDictionary(
url => new Uri(url),
url => new WebClient().DownloadDataTaskAsync(url));
await Task.WhenAll(dictionary.Values);
return dictionary.ToDictionary(
pair => Path.GetFileName(pair.Key.LocalPath),
pair => pair.Value.Result);
}
public string JUST_return_dataURL_by_URL(string URL, int interval, int max_interval)
{
var client = new WebClient(proxy);
client.Headers = _headers;
string downloaded_from_URL = "false"; //default - until downloading
client.DownloadDataCompleted += bytes =>
{
Console.WriteLine("Done!");
string dataURL = Convert.ToBase64String( bytes );
string filename = Guid.NewGuid().ToString().Trim('{', '}')+".png";
downloaded_from_URL =
"Image Downloaded from " + URL
+ "<br>"
+ "<a href=\""+dataURL+"\" download=\""+filename+"\">"
+ "<img src=\"data:image/png;base64," + dataURL + "\"/>"+filename
+ "</a>"
;
return;
};
client.DownloadDataAsync(new System.Uri(URL));
int i = 0;
do{
// Console.WriteLine(
// "(interval > 10): "+(interval > 10)
// +"\n(downloaded_from_URL == \"false\"): " + (downloaded_from_URL == "false")
// +"\ninterval: "+interval
// );
Thread.Sleep(interval);
i+=interval;
}
while( (downloaded_from_URL == "false") && (i < max_interval) );
return downloaded_from_URL;
}
You'd be wanting the task.WaitAll method...
msdn link
Create each download as a separate task, then pass them as a collection.
A shortcut to this might be to wrap your download method in a task.
Return new Task<downloadresult>(()=>{ method body});
Apologies for vagueness, working on iPad sucks for coding.
EDIT:
Another implementation of this that may be worth considering is wrapping the downloads using the parallel framework.
Since your tasks all do the same thing taking a parameter, you could instead use Parallel.Foreach and wrap that into a single task:
public System.Threading.Tasks.Task<System.Collections.Generic.IDictionary<string, byte[]>> DownloadTask(System.Collections.Generic.IEnumerable<string> urlList)
{
return new System.Threading.Tasks.Task<System.Collections.Generic.IDictionary<string, byte[]>>(() =>
{
var r = new System.Collections.Concurrent.ConcurrentDictionary<string, byte[]>();
System.Threading.Tasks.Parallel.ForEach<string>(urlList, (url, s, l) =>
{
using (System.Net.WebClient client = new System.Net.WebClient())
{
var bytedata = client.DownloadData(url);
r.TryAdd(url, bytedata);
}
});
var results = new System.Collections.Generic.Dictionary<string, byte[]>();
foreach (var value in r)
{
results.Add(value.Key, value.Value);
}
return results;
});
}
This leverages a concurrent collection to support parallel access within the method before converting back to IDictionary.
This method returns a task so can be called with an await.
Hope this provides a helpful alternative.

Categories

Resources