Why does this code lead to a deadlock? - c#

The following code is meant to copy files asynchronously but it causes deadlock in my app. It uses a task combinator helper method called 'Interleaved(..)' found here to return tasks in the order they complete.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_DEADLOCK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
List<Task<StorageFile>> copyTasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copyTask = file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copyTasks.Add(copyTask);
}
// Serve up each task as it completes
foreach (var bucket in Interleaved(copyTasks))
{
var copyTask = await bucket;
var copiedFile = await copyTask;
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFiles.Count() * 100.0));
}
return copiedFiles;
}
I originally created a simpler 'CopyFiles(...)' which processes the tasks in the order they were supplied (as opposed to completed) and this works fine, but I can't figure out why this one deadlocks frequently. Particularly, when there are many files to process.
Here is the simpler 'CopyFiles' code that works:
public static async Task<List<StorageFile>> CopyFiles_RUNS_OK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
int sourceFilesCount = sourceFiles.Count();
List<Task<StorageFile>> tasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copiedFile = await file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount *100.0));
}
return copiedFiles;
}
EDIT:
In an attempt to find out what's going on I've changed the implementation of CopyFiles(...) to use TPL Dataflow. I am aware that this code will return items in the order they were supplied, which is not what I want, but it removes the Interleaved dependency as a start. Anyway, despite this the app still hangs. It seems as if it's not returning from the file.CopyAsync(..) call. There is of course the possibility I'm just doing something wrong here.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_HANGING_ALSO(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
int sourceFilesCount = sourceFiles.Count();
List<StorageFile> copiedFiles = new List<StorageFile>();
// Store for input files.
BufferBlock<StorageFile> inputFiles = new BufferBlock<StorageFile>();
//
Func<StorageFile, Task<StorageFile>> copyFunc = sf => sf.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
TransformBlock<StorageFile, Task<StorageFile>> copyFilesBlock = new TransformBlock<StorageFile, Task<StorageFile>>(copyFunc);
inputFiles.LinkTo(copyFilesBlock, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var file in sourceFiles)
{
inputFiles.Post(file);
}
inputFiles.Complete();
while (await copyFilesBlock.OutputAvailableAsync())
{
Task<StorageFile> file = await copyFilesBlock.ReceiveAsync();
copiedFiles.Add(await file);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount * 100.0));
}
copyFilesBlock.Completion.Wait();
return copiedFiles;
}
Many thanks in advance for any help.

Related

Download multiple files concurrently from FTP using FluentFTP with a maximum value

I would like to download multiple download files recursively from a FTP Directory, to do this I'm using FluentFTP library and my code is this one:
private async Task downloadRecursively(string src, string dest, FtpClient ftp)
{
foreach(var item in ftp.GetListing(src))
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0)
{
System.IO.Directory.CreateDirectory(Path.Combine(dest, item.Name));
downloadRecursively(Path.Combine(src, item.Name), Path.Combine(dest, item.Name), ftp);
}
}
else if (item.Type == FtpFileSystemObjectType.File)
{
await ftp.DownloadFileAsync(Path.Combine(dest, item.Name), Path.Combine(src, item.Name));
}
}
}
I know you need one FtpClient per download you want, but how can I make to use a certain number of connections as maximum, I guess that the idea is to create, connect, download and close per every file I find but just having a X number of downloading files at the same time. Also I'm not sure if I should create Task with async, Threads and my biggest problem, how to implement all of this.
Answer from #Bradley here seems pretty good, but the question does read every file thas has to download from an external file and it doesn't have a maximum concurrent download value so I'm not sure how to apply these both requirements.
Use:
ConcurrentBag class to implement a connection pool;
Parallel class to parallelize the operation;
ParallelOptions.MaxDegreeOfParallelism to limit number of the concurrent threads.
var clients = new ConcurrentBag<FtpClient>();
var opts = new ParallelOptions { MaxDegreeOfParallelism = maxConnections };
Parallel.ForEach(files, opts, file =>
{
file = Path.GetFileName(file);
string thread = $"Thread {Thread.CurrentThread.ManagedThreadId}";
if (!clients.TryTake(out var client))
{
Console.WriteLine($"{thread} Opening connection...");
client = new FtpClient(host, user, pass);
client.Connect();
Console.WriteLine($"{thread} Opened connection {client.GetHashCode()}.");
}
string remotePath = sourcePath + "/" + file;
string localPath = Path.Combine(destPath, file);
string desc =
$"{thread}, Connection {client.GetHashCode()}, " +
$"File {remotePath} => {localPath}";
Console.WriteLine($"{desc} - Starting...");
client.DownloadFile(localPath, remotePath);
Console.WriteLine($"{desc} - Done.");
clients.Add(client);
});
Console.WriteLine($"Closing {clients.Count} connections");
foreach (var client in clients)
{
Console.WriteLine($"Closing connection {client.GetHashCode()}");
client.Dispose();
}
Another approach is to start a fixed number of threads with one connection for each and have them pick files from a queue.
For an example of an implementation, see my article for WinSCP .NET assembly:
Automating transfers in parallel connections over SFTP/FTP protocol
A similar question about SFTP:
Processing SFTP files using C# Parallel.ForEach loop not processing downloads
Here is a TPL Dataflow approach. A BufferBlock<FtpClient> is used as a pool of FtpClient objects. The recursive enumeration takes a parameter of type IEnumerable<string> that holds the segments of one filepath. These segments are combined differently when constructing the local and the remote filepath. As a side effect of invoking the recursive enumeration, the paths of the remote files are sent to an ActionBlock<IEnumerable<string>>. This block handles the parallel downloading of the files. Its Completion property contains eventually all the exceptions that may have occurred during the whole operation.
public static Task FtpDownloadDeep(string ftpHost, string ftpRoot,
string targetDirectory, string username = null, string password = null,
int maximumConnections = 1)
{
// Arguments validation omitted
if (!Directory.Exists(targetDirectory))
throw new DirectoryNotFoundException(targetDirectory);
var fsLocker = new object();
var ftpClientPool = new BufferBlock<FtpClient>();
async Task<TResult> UsingFtpAsync<TResult>(Func<FtpClient, Task<TResult>> action)
{
var client = await ftpClientPool.ReceiveAsync();
try { return await action(client); }
finally { ftpClientPool.Post(client); } // Return to the pool
}
var downloader = new ActionBlock<IEnumerable<string>>(async path =>
{
var remotePath = String.Join("/", path);
var localPath = Path.Combine(path.Prepend(targetDirectory).ToArray());
var localDir = Path.GetDirectoryName(localPath);
lock (fsLocker) Directory.CreateDirectory(localDir);
var status = await UsingFtpAsync(client =>
client.DownloadFileAsync(localPath, remotePath));
if (status == FtpStatus.Failed) throw new InvalidOperationException(
$"Download of '{remotePath}' failed.");
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = maximumConnections,
BoundedCapacity = maximumConnections,
});
async Task Recurse(IEnumerable<string> path)
{
if (downloader.Completion.IsCompleted) return; // The downloader has failed
var listing = await UsingFtpAsync(client =>
client.GetListingAsync(String.Join("/", path)));
foreach (var item in listing)
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0) await Recurse(path.Append(item.Name));
}
else if (item.Type == FtpFileSystemObjectType.File)
{
var accepted = await downloader.SendAsync(path.Append(item.Name));
if (!accepted) break; // The downloader has failed
}
}
}
// Move on to the thread pool, to avoid ConfigureAwait(false) everywhere
return Task.Run(async () =>
{
// Fill the FtpClient pool
for (int i = 0; i < maximumConnections; i++)
{
var client = new FtpClient(ftpHost);
if (username != null && password != null)
client.Credentials = new NetworkCredential(username, password);
ftpClientPool.Post(client);
}
try
{
// Enumerate the files to download
await Recurse(new[] { ftpRoot });
downloader.Complete();
}
catch (Exception ex) { ((IDataflowBlock)downloader).Fault(ex); }
try
{
// Await the downloader to complete
await downloader.Completion;
}
catch (OperationCanceledException)
when (downloader.Completion.IsCanceled) { throw; }
catch { downloader.Completion.Wait(); } // Propagate AggregateException
finally
{
// Clean up
if (ftpClientPool.TryReceiveAll(out var clients))
foreach (var client in clients) client.Dispose();
}
});
}
Usage example:
await FtpDownloadDeep("ftp://ftp.test.com", "", #"C:\FtpTest",
"username", "password", maximumConnections: 10);
Note: The above implementation enumerates the remote directory lazily, following the tempo of the downloading process. If you prefer to enumerate it eagerly, gathering all info available about the remote listings ASAP, just remove the BoundedCapacity = maximumConnections configuration from the ActionBlock that downloads the files. Be aware that doing so could result in high memory consumption, in case the remote directory has a deep hierarchy of subfolders, containing cumulatively a huge number of small files.
I'd split this into three parts.
Recursively build a list of source and destination pairs.
Create the directories required.
Concurrently download the files.
It's the last part that is slow and should be done in parallel.
Here's the code:
private async Task DownloadRecursively(string src, string dest, FtpClient ftp)
{
/* 1 */
IEnumerable<(string source, string destination)> Recurse(string s, string d)
{
foreach (var item in ftp.GetListing(s))
{
if (item.Type == FtpFileSystemObjectType.Directory)
{
if (item.Size != 0)
{
foreach(var pair in Recurse(Path.Combine(s, item.Name), Path.Combine(d, item.Name)))
{
yield return pair;
}
}
}
else if (item.Type == FtpFileSystemObjectType.File)
{
yield return (Path.Combine(s, item.Name), Path.Combine(d, item.Name));
}
}
}
var pairs = Recurse(src, dest).ToArray();
/* 2 */
foreach (var d in pairs.Select(x => x.destination).Distinct())
{
System.IO.Directory.CreateDirectory(d);
}
/* 3 */
var downloads =
pairs
.AsParallel()
.Select(x => ftp.DownloadFileAsync(x.source, x.destination))
.ToArray();
await Task.WhenAll(downloads);
}
It should be clean, neat, and easy to reason about code.

How to get the individual API call status success response in C#

How to get the individual API call status success response in C#.
I am creating a mobile application using Xamarin Forms,
In my application, I need to prefetch certain information when app launches to use the mobile application.
Right now, I am calling the details like this,
public async Task<Response> GetAllVasInformationAsync()
{
var userDetails = GetUserDetailsAsync();
var getWageInfo = GetUserWageInfoAsync();
var getSalaryInfo = GetSalaryInfoAsync();
await Task.WhenAll(userDetails,
getWageInfo,
getSalaryInfo,
);
var resultToReturn = new Response
{
IsuserDetailsSucceeded = userDetails.Result,
IsgetWageInfoSucceeded = getWageInfo.Result,
IsgetSalaryInfoSucceeded = getSalaryInfo.Result,
};
return resultToReturn;
}
In my app I need to update details based on the success response. Something like this (2/5) completed. And the text should be updated whenever we get a new response.
What is the best way to implement this feature? Is it possible to use along with Task.WhenAll. Because I am trying to wrap everything in one method call.
In my app I need to update details based on the success response.
The proper way to do this is IProgress<string>. The calling code should supply a Progress<string> that updates the UI accordingly.
public async Task<Response> GetAllVasInformationAsync(IProgress<string> progress)
{
var userDetails = UpdateWhenComplete(GetUserDetailsAsync(), "user details");
var getWageInfo = UpdateWhenComplete(GetUserWageInfoAsync(), "wage information");
var getSalaryInfo = UpdateWhenComplete(GetSalaryInfoAsync(), "salary information");
await Task.WhenAll(userDetails, getWageInfo, getSalaryInfo);
return new Response
{
IsuserDetailsSucceeded = await userDetails,
IsgetWageInfoSucceeded = await getWageInfo,
IsgetSalaryInfoSucceeded = await getSalaryInfo,
};
async Task<T> UpdateWhenComplete<T>(Task<T> task, string taskName)
{
try { return await task; }
finally { progress?.Report($"Completed {taskName}"); }
}
}
If you also need a count, you can either use IProgress<(int, string)> or change how the report progress string is built to include the count.
So here's what I would do in C# 8 and .NET Standard 2.1:
First, I create the method which will produce the async enumerable:
static async IAsyncEnumerable<bool> TasksToPerform() {
Task[] tasks = new Task[3] { userDetails, getWageInfo, getSalaryInfo };
for (i = 0; i < tasks.Length; i++) {
await tasks[i];
yield return true;
}
}
So now you need to await foreach on this task enumerable. Every time you get a return, you know that a task has been finished.
int numberOfFinishedTasks = 0;
await foreach (var b in TasksToPerform()) {
numberOfFinishedTasks++;
//Update UI here to reflect the finished task number
}
No need to over-complicate this. This code will show how many of your tasks had exceptions. Your await task.whenall just triggers them and waits for them to finish. So after that you can do whatever you want with the tasks :)
var task = Task.Delay(300);
var tasks = new List<Task> { task };
var faultedTasks = 0;
tasks.ForEach(t =>
{
t.ContinueWith(t2 =>
{
//do something with a field / property holding ViewModel state
//that your view is listening to
});
});
await Task.WhenAll(tasks);
//use this to respond with a finished count
tasks.ForEach(_ => { if (_.IsFaulted) faultedTasks++; });
Console.WriteLine($"{tasks.Count() - faultedTasks} / {tasks.Count()} completed.");
.WhenAll() will allow you to determine if /any/ of the tasks failed, they you just count the tasks that have failed.
public async Task<Response> GetAllVasInformationAsync()
{
var userDetails = GetUserDetailsAsync();
var getWageInfo = GetUserWageInfoAsync();
var getSalaryInfo = GetSalaryInfoAsync();
await Task.WhenAll(userDetails, getWaitInfo, getSalaryInfo)
.ContinueWith((task) =>
{
if(task.IsFaulted)
{
int failedCount = 0;
if(userDetails.IsFaulted) failedCount++;
if(getWaitInfo.IsFaulted) failedCount++;
if(getSalaryInfo.IsFaulted) failedCount++;
return $"{failedCount} tasks failed";
}
});
var resultToReturn = new Response
{
IsuserDetailsSucceeded = userDetails.Result,
IsgetWageInfoSucceeded = getWageInfo.Result,
IsgetSalaryInfoSucceeded = getSalaryInfo.Result,
};
return resultToReturn;
}

Task.Whenall(List<Task>) outputs OutOfRangeException

I have two loops that use SemaphoreSlim and a array of strings "Contents"
a foreachloop:
var allTasks = new List<Task>();
var throttle = new SemaphoreSlim(10,10);
foreach (string s in Contents)
{
await throttle.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
rootResponse.Add(await POSTAsync(s, siteurl, src, target));
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
a for loop:
var allTasks = new List<Task>();
var throttle = new SemaphoreSlim(10,10);
for(int s=0;s<Contents.Count;s++)
{
await throttle.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
rootResponse[s] = await POSTAsync(Contents[s], siteurl, src, target);
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
the first foreach loop runs well, but the for loops Task.WhenAll(allTasks) returns a OutOfRangeException and I want the Contents[] index and List index to match.
Can I fix the for loop? or is there a better approach?
This would fix your current problems
for (int s = 0; s < Contents.Count; s++)
{
var content = Contents[s];
allTasks.Add(
Task.Run(async () =>
{
await throttle.WaitAsync();
try
{
rootResponse[s] = await POSTAsync(content, siteurl, src, target);
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
However this is a fairly messy and nasty piece of code. This looks a bit neater
public static async Task DoStuffAsync(Content[] contents, string siteurl, string src, string target)
{
var throttle = new SemaphoreSlim(10, 10);
// local method
async Task<(Content, SomeResponse)> PostAsyncWrapper(Content content)
{
await throttle.WaitAsync();
try
{
// return a content and result pair
return (content, await PostAsync(content, siteurl, src, target));
}
finally
{
throttle.Release();
}
}
var results = await Task.WhenAll(contents.Select(PostAsyncWrapper));
// do stuff with your results pairs here
}
There are many other ways you could do this, PLinq, Parallel.For,Parallel.ForEach, Or just tidying up your captures in your loops like above.
However since you have an IO bound work load, and you have async methods that run it. The most appropriate solution is the async await pattern which neither Parallel.For,Parallel.ForEach cater for optimally.
Another way is TPL DataFlow library which can be found in the System.Threading.Tasks.Dataflow nuget package.
Code
public static async Task DoStuffAsync(Content[] contents, string siteurl, string src, string target)
{
async Task<(Content, SomeResponse)> PostAsyncWrapper(Content content)
{
return (content, await PostAsync(content, siteurl, src, target));
}
var bufferblock = new BufferBlock<(Content, SomeResponse)>();
var actionBlock = new TransformBlock<Content, (Content, SomeResponse)>(
content => PostAsyncWrapper(content),
new ExecutionDataflowBlockOptions
{
EnsureOrdered = false,
MaxDegreeOfParallelism = 100,
SingleProducerConstrained = true
});
actionBlock.LinkTo(bufferblock);
foreach (var content in contents)
actionBlock.Post(content);
actionBlock.Complete();
await actionBlock.Completion;
if (bufferblock.TryReceiveAll(out var result))
{
// do stuff with your results pairs here
}
}
Basically this creates a BufferBlock And TransformBlock, You pump your work load into the TransformBlock, it has degrees of parallel in its options, and it pushes them into the BufferBlock, you await completion and get your results.
Why Dataflow? because it deals with the async await, it has MaxDegreeOfParallelism, it designed for IO bound or CPU Bound workloads, and its extremely simple to use. Additionally, as most data is generally processed in many ways (in a pipeline), you can then use it to pipe and manipulate streams of data in sequence and parallel or in any way way you choose down the line.
Anyway good luck

How do I optimize calculating the hash of thousands of files?

I have a .Net program that runs through a directory containing tens of thousands of relatively small files (around 10MB each), calculates their MD5 hash and stores that data in an SQLite database. The whole process works fine, however it takes a relatively long time (1094353ms with around 60 thousand files) and I'm looking for ways to optimize it. Here are the solutions I've thought of:
Use additional threads and calculate the hash of more than one file simultaneously. Not sure how I/O speed would limit me with this one.
Use a better hashing algorithm. I've looked around and the one I'm currently using seems to be the fastest one (on C# at least).
Which would be the best approach, and are there any better ones?
Here's my current code:
private async Task<string> CalculateHash(string file, System.Security.Cryptography.MD5 md5) {
Task<string> MD5 = Task.Run(() =>
{
{
using (var stream = new BufferedStream(System.IO.File.OpenRead(file), 1200000))
{
var hash = md5.ComputeHash(stream);
var fileMD5 = string.Concat(Array.ConvertAll(hash, x => x.ToString("X2")));
return fileMD5;
}
};
});
return await MD5;
}
public async Main() {
using (var md5 = System.Security.Cryptography.MD5.Create()) {
foreach (var file in Directory.GetFiles(path)) {
var hash = await CalculateHash(file, md5);
// Adds `hash` to the database
}
}
}
Create a pipeline of work, the easiest way I know how to create a pipeline that uses both parts of the code that must be single threaded and parts that must be multi-threaded is to use TPL Dataflow
public static class Example
{
private class Dto
{
public Dto(string filePath, byte[] data)
{
FilePath = filePath;
Data = data;
}
public string FilePath { get; }
public byte[] Data { get; }
}
public static async Task ProcessFiles(string path)
{
var getFilesBlock = new TransformBlock<string, Dto>(filePath => new Dto(filePath, File.ReadAllBytes(filePath))); //Only lets one thread do this at a time.
var hashFilesBlock = new TransformBlock<Dto, Dto>(dto => HashFile(dto),
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = Environment.ProcessorCount, //We can multi-thread this part.
BoundedCapacity = 50}); //Only allow 50 byte[]'s to be waiting in the queue. It will unblock getFilesBlock once there is room.
var writeToDatabaseBlock = new ActionBlock<Dto>(WriteToDatabase,
new ExecutionDataflowBlockOptions {BoundedCapacity = 50});//MaxDegreeOfParallelism defaults to 1 so we don't need to specifiy it.
//Link the blocks together.
getFilesBlock.LinkTo(hashFilesBlock, new DataflowLinkOptions {PropagateCompletion = true});
hashFilesBlock.LinkTo(writeToDatabaseBlock, new DataflowLinkOptions {PropagateCompletion = true});
//Queue the work for the first block.
foreach (var filePath in Directory.EnumerateFiles(path))
{
await getFilesBlock.SendAsync(filePath).ConfigureAwait(false);
}
//Tell the first block we are done adding files.
getFilesBlock.Complete();
//Wait for the last block to finish processing its last item.
await writeToDatabaseBlock.Completion.ConfigureAwait(false);
}
private static Dto HashFile(Dto dto)
{
using (var md5 = System.Security.Cryptography.MD5.Create())
{
return new Dto(dto.FilePath, md5.ComputeHash(dto.Data));
}
}
private static async Task WriteToDatabase(Dto arg)
{
//Write to the database here.
}
}
This creates a pipeline with 3 segments.
One that is single threaded that reads the files from the hard drive in to memory and stored as a byte[].
A second one that can use up to Enviorement.ProcessorCount threads to hash the files, it will only allow 50 items to be sitting on it's inbound queue, when the first block tries to add it will stop processing new items until the next block is ready to accept new items.
And a third one that is single threaded and adds the data to the database, it allows only 50 items in it's inbound queue at a time.
Because of the two 50 limits there will be at most 100 byte[] in memory (50 in hashFilesBlock queue, 50 in the writeToDatabaseBlock queue, items currently being processed count toward the BoundedCapacity limit.
Update: for fun I wrote a version that reports progress too, it's untested though and uses C# 7 features.
using System;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
public static class Example
{
private class Dto
{
public Dto(string filePath, byte[] data)
{
FilePath = filePath;
Data = data;
}
public string FilePath { get; }
public byte[] Data { get; }
}
public static async Task ProcessFiles(string path, IProgress<ProgressReport> progress)
{
int totalFilesFound = 0;
int totalFilesRead = 0;
int totalFilesHashed = 0;
int totalFilesUploaded = 0;
DateTime lastReported = DateTime.UtcNow;
void ReportProgress()
{
if (DateTime.UtcNow - lastReported < TimeSpan.FromSeconds(1)) //Try to fire only once a second, but this code is not perfect so you may get a few rapid fire.
{
return;
}
lastReported = DateTime.UtcNow;
var report = new ProgressReport(totalFilesFound, totalFilesRead, totalFilesHashed, totalFilesUploaded);
progress.Report(report);
}
var getFilesBlock = new TransformBlock<string, Dto>(filePath =>
{
var dto = new Dto(filePath, File.ReadAllBytes(filePath));
totalFilesRead++; //safe because single threaded.
return dto;
});
var hashFilesBlock = new TransformBlock<Dto, Dto>(inDto =>
{
using (var md5 = System.Security.Cryptography.MD5.Create())
{
var outDto = new Dto(inDto.FilePath, md5.ComputeHash(inDto.Data));
Interlocked.Increment(ref totalFilesHashed); //Need the interlocked due to multithreaded.
ReportProgress();
return outDto;
}
},
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = Environment.ProcessorCount, BoundedCapacity = 50});
var writeToDatabaseBlock = new ActionBlock<Dto>(arg =>
{
//Write to database here.
totalFilesUploaded++;
ReportProgress();
},
new ExecutionDataflowBlockOptions {BoundedCapacity = 50});
getFilesBlock.LinkTo(hashFilesBlock, new DataflowLinkOptions {PropagateCompletion = true});
hashFilesBlock.LinkTo(writeToDatabaseBlock, new DataflowLinkOptions {PropagateCompletion = true});
foreach (var filePath in Directory.EnumerateFiles(path))
{
await getFilesBlock.SendAsync(filePath).ConfigureAwait(false);
totalFilesFound++;
ReportProgress();
}
getFilesBlock.Complete();
await writeToDatabaseBlock.Completion.ConfigureAwait(false);
ReportProgress();
}
}
public class ProgressReport
{
public ProgressReport(int totalFilesFound, int totalFilesRead, int totalFilesHashed, int totalFilesUploaded)
{
TotalFilesFound = totalFilesFound;
TotalFilesRead = totalFilesRead;
TotalFilesHashed = totalFilesHashed;
TotalFilesUploaded = totalFilesUploaded;
}
public int TotalFilesFound { get; }
public int TotalFilesRead{ get; }
public int TotalFilesHashed{ get; }
public int TotalFilesUploaded{ get; }
}
As far as I understand, Task.Run will instantiate a new thread for every file you have there, which leads to lots of threads and context switching between them. The case like you describe, sounds like a good case for using Parallel.For or Parallel.Foreach, something like this:
public void CalcHashes(string path)
{
string GetFileHash(System.Security.Cryptography.MD5 md5, string fileName)
{
using (var stream = new BufferedStream(System.IO.File.OpenRead(fileName), 1200000))
{
var hash = md5.ComputeHash(stream);
var fileMD5 = string.Concat(Array.ConvertAll(hash, x => x.ToString("X2")));
return fileMD5;
}
}
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 8;
Parallel.ForEach(filenames, options, fileName =>
{
using (var md5 = System.Security.Cryptography.MD5.Create())
{
GetFileHash(md5, fileName);
}
});
}
EDIT: Seems Parallel.ForEach does not actually do the partitioning automatically. Added max degree of parallelism limit to 8. As a result:
107005 files
46628 ms

Run Async every x number of times in a for loop

I'm downloading 100K+ files and want to do it in patches, such as 100 files at a time.
static void Main(string[] args) {
Task.WaitAll(
new Task[]{
RunAsync()
});
}
// each group has 100 attachments.
static async Task RunAsync() {
foreach (var group in groups) {
var tasks = new List<Task>();
foreach (var attachment in group.attachments) {
tasks.Add(DownloadFileAsync(attachment, downloadPath));
}
await Task.WhenAll(tasks);
}
}
static async Task DownloadFileAsync(Attachment attachment, string path) {
using (var client = new HttpClient()) {
using (var fileStream = File.Create(path + attachment.FileName)) {
var downloadedFileStream = await client.GetStreamAsync(attachment.url);
await downloadedFileStream.CopyToAsync(fileStream);
}
}
}
Expected
Hoping it to download 100 files at a time, then download next 100;
Actual
It downloads a lot more at the same time. Quickly got an error Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host
Running tasks in "batch" is not a good idea in terms of performance. A long running task would make whole batch block. A better approach would be starting a new task as soon as one is finished.
This can be implemented with a queue as #MertAkcakaya suggested. But I will post another alternative based on my other answer Have a set of Tasks with only X running at a time
int maxTread = 3;
System.Net.ServicePointManager.DefaultConnectionLimit = 50; //Set this once to a max value in your app
var urls = new Tuple<string, string>[] {
Tuple.Create("http://cnn.com","temp/cnn1.htm"),
Tuple.Create("http://cnn.com","temp/cnn2.htm"),
Tuple.Create("http://bbc.com","temp/bbc1.htm"),
Tuple.Create("http://bbc.com","temp/bbc2.htm"),
Tuple.Create("http://stackoverflow.com","temp/stackoverflow.htm"),
Tuple.Create("http://google.com","temp/google1.htm"),
Tuple.Create("http://google.com","temp/google2.htm"),
};
DownloadParallel(urls, maxTread);
async Task DownloadParallel(IEnumerable<Tuple<string,string>> urls, int maxThreads)
{
SemaphoreSlim maxThread = new SemaphoreSlim(maxThreads);
var client = new HttpClient();
foreach(var url in urls)
{
await maxThread.WaitAsync();
DownloadFile(client, url.Item1, url.Item2)
.ContinueWith((task) => maxThread.Release() );
}
}
async Task DownloadFile(HttpClient client, string url, string fileName)
{
var stream = await client.GetStreamAsync(url);
using (var fileStream = File.Create(fileName))
{
await stream.CopyToAsync(fileStream);
}
}
PS: DownloadParallel will return as soon as it starts the last download. So don't await it. If you really want to await it you should add for (int i = 0; i < maxThreads; i++) await maxThread.WaitAsync(); at the end of the method.
PS2: Don't forget to add exception handling to DownloadFile

Categories

Resources