I'm trying to copy large set of files from one S3 to another S3, using asynchronous method. To achieve the same, the large set of files is broken into batches and each batch is handed over to a list of async method. The issue is, each async method is not processing more than 1 file in the batch, whereas each batch contains more than 1k files, not sure why async doesn't go back to process the remaining files.
Here is the code:
public void CreateAndExecuteSpawn(string srcBucket, List<List<string>> pdfFileList, IAmazonS3 s3client)
{
int i = 0;
List<Action> actions = new List<Action>();
LambdaLogger.Log("PDF Set count: " + pdfFileList.Count.ToString());
foreach (var list in pdfFileList)
actions.Add(() => RenameFilesAsync(srcBucket, list, s3client));
foreach (var method in actions)
{
method.Invoke();
LambdaLogger.Log("Mehtod invoked: "+ i++.ToString());
}
}
public async void RenameFilesAsync(string srcBucket, List<string> pdfFiles, IAmazonS3 s3client)
{
LambdaLogger.Log("In RenameFileAsync method");
CopyObjectRequest copyRequest = new CopyObjectRequest
{
SourceBucket = srcBucket,
DestinationBucket = srcBucket
};
try
{
foreach (var file in pdfFiles)
{
if (!file.Contains("index.xml"))
{
string[] newFilename = file.Split('{');
string[] destKey = file.Split('/');
copyRequest.SourceKey = file;
copyRequest.DestinationKey = destKey[0] + "/" + destKey[1] + "/Renamed/" + newFilename[1];
LambdaLogger.Log("About to rename File: " + file);
//Here after copying one file, function doesn't return to foreach loop
CopyObjectResponse response = await s3client.CopyObjectAsync(copyRequest);
//await s3client.CopyObjectAsync(copyRequest);
LambdaLogger.Log("Rename done: ");
}
}
}
catch(Exception ex)
{
LambdaLogger.Log(ex.Message);
LambdaLogger.Log(copyRequest.DestinationKey);
}
}
public void FunctionHandler(S3Event evnt, ILambdaContext context)
{
//Some code here
CreateAndExecuteSpawn(bucket, pdfFileSet, s3client);
}
First you need to fix the batch so that it will process the batches one at a time. Avoid async void; use async Task instead:
public async Task CreateAndExecuteSpawnAsync(string srcBucket, List<List<string>> pdfFileList, IAmazonS3 s3client)
{
int i = 0;
List<Func<Task>> actions = new();
LambdaLogger.Log("PDF Set count: " + pdfFileList.Count.ToString());
foreach (var list in pdfFileList)
actions.Add(() => RenameFilesAsync(srcBucket, list, s3client));
foreach (var method in actions)
{
await method();
LambdaLogger.Log("Mehtod invoked: "+ i++.ToString());
}
}
public async Task RenameFilesAsync(string srcBucket, List<string> pdfFiles, IAmazonS3 s3client)
Then you can add asynchronous concurrency within each batch. The current code is just a foreach loop, so of course it only processes one at a time. You can change this to be asynchronously concurrent by Selecting the tasks to run and then doing a Task.WhenAll at the end:
LambdaLogger.Log("In RenameFileAsync method");
CopyObjectRequest copyRequest = new CopyObjectRequest
{
SourceBucket = srcBucket,
DestinationBucket = srcBucket
};
try
{
var tasks = pdfFiles
.Where(file => !file.Contains("index.xml"))
.Select(async file =>
{
string[] newFilename = file.Split('{');
string[] destKey = file.Split('/');
copyRequest.SourceKey = file;
copyRequest.DestinationKey = destKey[0] + "/" + destKey[1] + "/Renamed/" + newFilename[1];
LambdaLogger.Log("About to rename File: " + file);
CopyObjectResponse response = await s3client.CopyObjectAsync(copyRequest);
LambdaLogger.Log("Rename done: ");
})
.ToList();
await Task.WhenAll(tasks);
}
Related
I have a loop creating three tasks:
List<Task> tasks = new List<Task>();
foreach (DSDevice device in validdevices)
{
var task = Task.Run(() =>
{
var conf = PrepareModasConfig(device, alternativconfig));
//CHECK-Point1
string config = ModasDicToConfig(conf);
//CHECK-Point2
if (config != null)
{
//Do Stuff
}
else
{
//Do other Stuff
}
});
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
it calls this method, where some data of a dictionary of a default-config gets overwritten:
private Dictionary<string, Dictionary<string, string>> PrepareModasConfig(DSDevice device, string alternativeconfig)
{
try
{
Dictionary<string, Dictionary<string, string>> config = new Dictionary<string, Dictionary<string, string>>(Project.project.ModasConfig.Config);
if (config.ContainsKey("[Main]"))
{
if (config["[Main]"].ContainsKey("DevName"))
{
config["[Main]"]["DevName"] = device.ID;
}
}
return config;
}
catch
{
return null;
}
}
and after that, it gets converted into a string with this method:
private string ModasDicToConfig(Dictionary<string, Dictionary<string, string>> dic)
{
string s = string.Empty;
try
{
foreach (string key in dic.Keys)
{
s = s + key + "\n";
foreach (string k in dic[key].Keys)
{
s = s + k + "=" + dic[key][k] + "\n";
}
s = s + "\n";
}
return s;
}
catch
{
return null;
}
}
But every Tasks gets the exact same string back.
On //CHECK-Point1 I check the Dic for the changed value: Correct Value for each Task
On //CHECK-Point2 I check the String: Same String on all 3 Tasks (Should be of course different)
Default-Dictionary looks like this: (shortened)
{
{"[Main]",
{"DevName", "Default"},
...
},
...
}
The resulting string look like that:
[Main]
DevName=003 <--This should be different (from Device.ID)
...
[...]
EDIT:
I moved the methods to execute outside the Task. Now I get the correct Results. So I guess it has something to do with the Task?
List<Task> tasks = new List<Task>();
foreach (DSDevice device in validdevices)
{
var conf = PrepareModasConfig(device, alternativconfig));
//CHECK-Point1
string config = ModasDicToConfig(conf);
//CHECK-Point2
var task = Task.Run(() =>
{
if (config != null)
{
//Do Stuff
}
else
{
//Do other Stuff
}
});
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
The problem isn't caused by tasks. The lambda passed to Task.Run captures the loop variable device so when the tasks are executed, all will use the contents of that variable. The same problem would occur even without tasks as this SO question shows. The following code would print 10 times:
List<Action> actions = new List<Action>();
for (int i = 0; i < 10; ++i )
actions.Add(()=>Console.WriteLine(i));
foreach (Action a in actions)
a();
------
10
10
10
10
10
10
10
10
10
10
If the question's code used an Action without Task.Run it would still result in bad results.
One way to fix this is to copy the loop variable into a local variable and use only that in the lambda :
for (int i = 0; i < 10; ++i )
{
var ii=i;
actions.Add(()=>Console.WriteLine(ii));
}
The question's code can be fixed by copying the device loop variable into the loop:
foreach (DSDevice dev in validdevices)
{
var device=dev;
var task = Task.Run(() =>
{
var conf = PrepareModasConfig(device, alternativconfig));
Another way is to use Parallel.ForEach to process all items in parallel, using all available cores, without creating tasks explicitly:
Parallel.ForEach(validdevices,device=>{
var conf = PrepareModasConfig(device, alternativconfig));
string config = ModasDicToConfig(conf);
...
});
Parallel.ForEach allows limiting the number of worker tasks through the MaxDegreeOfParallelism option. It's a blocking call because it uses the current thread to process data along with any worker tasks.
The following code is meant to copy files asynchronously but it causes deadlock in my app. It uses a task combinator helper method called 'Interleaved(..)' found here to return tasks in the order they complete.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_DEADLOCK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
List<Task<StorageFile>> copyTasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copyTask = file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copyTasks.Add(copyTask);
}
// Serve up each task as it completes
foreach (var bucket in Interleaved(copyTasks))
{
var copyTask = await bucket;
var copiedFile = await copyTask;
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFiles.Count() * 100.0));
}
return copiedFiles;
}
I originally created a simpler 'CopyFiles(...)' which processes the tasks in the order they were supplied (as opposed to completed) and this works fine, but I can't figure out why this one deadlocks frequently. Particularly, when there are many files to process.
Here is the simpler 'CopyFiles' code that works:
public static async Task<List<StorageFile>> CopyFiles_RUNS_OK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
int sourceFilesCount = sourceFiles.Count();
List<Task<StorageFile>> tasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copiedFile = await file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount *100.0));
}
return copiedFiles;
}
EDIT:
In an attempt to find out what's going on I've changed the implementation of CopyFiles(...) to use TPL Dataflow. I am aware that this code will return items in the order they were supplied, which is not what I want, but it removes the Interleaved dependency as a start. Anyway, despite this the app still hangs. It seems as if it's not returning from the file.CopyAsync(..) call. There is of course the possibility I'm just doing something wrong here.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_HANGING_ALSO(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
int sourceFilesCount = sourceFiles.Count();
List<StorageFile> copiedFiles = new List<StorageFile>();
// Store for input files.
BufferBlock<StorageFile> inputFiles = new BufferBlock<StorageFile>();
//
Func<StorageFile, Task<StorageFile>> copyFunc = sf => sf.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
TransformBlock<StorageFile, Task<StorageFile>> copyFilesBlock = new TransformBlock<StorageFile, Task<StorageFile>>(copyFunc);
inputFiles.LinkTo(copyFilesBlock, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var file in sourceFiles)
{
inputFiles.Post(file);
}
inputFiles.Complete();
while (await copyFilesBlock.OutputAvailableAsync())
{
Task<StorageFile> file = await copyFilesBlock.ReceiveAsync();
copiedFiles.Add(await file);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount * 100.0));
}
copyFilesBlock.Completion.Wait();
return copiedFiles;
}
Many thanks in advance for any help.
I have tried many different ways to get this to work, and I am sure that is not the proper way to wire up async/await for multi threading. Here is what I have so far. It is a directory walker that I attempted to make async. I know that you don't see any async or await keywords and that is because I was unsuccessful, but that is what I am trying to do. Right now it runs in a console application but I will abstract and refactor later once I get a working POC. Any guidance is appreciated.
static void RunProgram(CancellationToken ct)
{
try
{
foreach (var dir in _directoriesToProcess)
{
var newTask = CreateNewTask(dir, ct);
_tasks.Add(newTask);
}
while (_tasks.Count > 0)
{
lock (_collectionLock)
{
var t = _tasks.Where(x => x.IsCompleted == true).ToList();
if (t != null)
foreach (var task in t)
{
_tasks.Remove(task);
}
}
}
OutputFiles();
StopAndCleanup();
}
catch (Exception ex)
{
Log(LogColor.Red, "Error: " + ex.Message, false);
_cts.Cancel();
}
}
static Task CreateNewTask(string Path, CancellationToken ct)
{
return Task.Factory.StartNew(() => GetDirectoryFiles(Path, ct), ct);
}
static void GetDirectoryFiles(string Path, CancellationToken ct)
{
if (!ct.IsCancellationRequested)
{
List<string> subDirs = new List<string>();
int currentFileCount = 0;
try
{
currentFileCount = Directory.GetFiles(Path, _fileExtension).Count();
subDirs = Directory.GetDirectories(Path).ToList();
lock (_objLock)
{
_overallFileCount += currentFileCount;
Log(LogColor.White, "- Current path: " + Path);
Log(LogColor.Yellow, "-- Sub directory count: " + subDirs.Count);
Log(LogColor.Yellow, "-- File extension: " + _fileExtension);
Log(LogColor.Yellow, "-- Current count: " + currentFileCount);
Log(LogColor.Red, "-- Running total: " + _overallFileCount);
_csvBuilder.Add(string.Format("{0},{1},{2},{3}", Path, subDirs.Count, _fileExtension, currentFileCount));
Console.Clear();
Log(LogColor.White, "Running file count: " + _overallFileCount, false, true);
}
foreach (var dir in subDirs)
{
lock (_collectionLock)
{
var newTask = CreateNewTask(dir, ct);
_tasks.Add(newTask);
}
}
}
catch (Exception ex)
{
Log(LogColor.Red, "Error: " + ex.Message, false);
_cts.Cancel();
}
}
}
I don't think there's any issue with what you're trying to do, just be cautious about uncontrolled concurrency e.g. reading too many directories at once on different threads. Context switching could end up making it slower.
Instead of doing things as side effects in your methods, try returning the collected values. e.g.
static async Task<IEnumerable<DirectoryStat>> GetDirectoryFiles(string path, string fileExtension, CancellationToken ct)
{
var thisDirectory = await Task.Run(() => /* Get directory file count and return a DirectoryStat object */);
var subDirectoriesResults = await Task.WhenAll(Directory.GetDirectories(path).Select(dir => GetDirectoryFiles(dir, fileExtension, ct)));
return (new[] { thisDirectory }).Concat(subDirectoryResults);
}
You can then iterate them later and pull the data you need from DirectoryStat (and sum your file counts as per _overallFileCount etc)
NOTE: Untested :)
You can run Synchronous Code Async with Task.Run(() => { //code });
Also change your Return Type to Taskso you can await it
I would rewrite you code as follows:
static void RunProgram(CancellationToken ct)
{
try
{
foreach (var dir in _directoriesToProcess)
{
var newTask = CreateNewTask(dir, ct);
_tasks.Add(newTask);
}
//change your while so it does not execute all the time
while (_tasks.Count > 0)
{
lock (_collectionLock)
{
var tsk = _tasks.FirstOrDefault();
if (tsk != null)
{
if (tsk.Status <= TaskStatus.Running)
await tsk;
_tasks.Remove(tsk);
}
}
}
OutputFiles();
StopAndCleanup();
}
catch (Exception ex)
{
Log(LogColor.Red, "Error: " + ex.Message, false);
_cts.Cancel();
}
}
static Task CreateNewTask(string Path, CancellationToken ct)
{
return Task.Factory.StartNew(() => GetDirectoryFiles(Path, ct), ct);
}
//always use Task (or Task<T>) as return so you can await the process
static async Task GetDirectoryFiles(string Path, CancellationToken ct)
{
if (!ct.IsCancellationRequested)
{
//Insert Magic
await Task.Run(() => {
List<string> subDirs = new List<string>();
int currentFileCount = 0;
try
{
currentFileCount = Directory.GetFiles(Path, _fileExtension).Count();
subDirs = Directory.GetDirectories(Path).ToList();
lock (_objLock)
{
_overallFileCount += currentFileCount;
Log(LogColor.White, "- Current path: " + Path);
Log(LogColor.Yellow, "-- Sub directory count: " + subDirs.Count);
Log(LogColor.Yellow, "-- File extension: " + _fileExtension);
Log(LogColor.Yellow, "-- Current count: " + currentFileCount);
Log(LogColor.Red, "-- Running total: " + _overallFileCount);
_csvBuilder.Add(string.Format("{0},{1},{2},{3}", Path, subDirs.Count, _fileExtension, currentFileCount));
Console.Clear();
Log(LogColor.White, "Running file count: " + _overallFileCount, false, true);
}
foreach (var dir in subDirs)
{
lock (_collectionLock)
{
var newTask = CreateNewTask(dir, ct);
_tasks.Add(newTask);
}
}
});
}
catch (Exception ex)
{
Log(LogColor.Red, "Error: " + ex.Message, false);
_cts.Cancel();
}
}
}
I'm trying to download multiple files from an FTP server using WebClient.DownloadFileTaskAsync and repeatedly have the issue that several files end up being 0KB.
I've tried different suggested solutions but I just don't manage to get all files. What am I doing wrong?
class Program
{
static void Main()
{
Setup(); // This only sets up working folders etc
Task t = ProcessAsync();
t.ContinueWith(bl =>
{
if (bl.Status == TaskStatus.RanToCompletion)
Logger.Info("All done.");
else
Logger.Warn("Something went wrong.");
});
t.Wait();
}
private static void Setup() {...}
static async Task<bool> ProcessAsync()
{
var c = new Catalog();
var maxItems = Settings.GetInt("maxItems");
Logger.Info((maxItems == 0 ? "Processing all items" : "Processing first {0} items"), maxItems);
await c.ProcessCatalogAsync(maxItems);
return true; // This is not really used atm
}
}
public class Catalog
{
public async Task ProcessCatalogAsync(int maxItems)
{
var client = new Client();
var d = await client.GetFoldersAsync(_remoteFolder, maxItems);
var list = d as IList<string> ?? d.ToList();
Logger.Info("Found {0} folders", list.Count());
await ProcessFoldersAsync(list);
}
private async Task ProcessFoldersAsync(IEnumerable<string> list)
{
var client = new Client();
foreach (var mFolder in list.Select(folder => _folder + "/" + folder))
{
var items = await client.GetItemsAsync(mFolder);
var file = items.FirstOrDefault(n => n.ToLower().EndsWith(".xml"));
if (string.IsNullOrEmpty(file))
{
Logger.Warn("No metadata file found in {0}", mFolder);
continue;
}
await client.DownloadFileAsync(mFolder, file);
// Continue processing the received file...
}
}
}
public class Client
{
public async Task<IEnumerable<string>> GetItemsAsync(string subfolder)
{
return await GetFolderItemsAsync(subfolder, false);
}
public async Task<IEnumerable<string>> GetFoldersAsync(string subfolder, int maxItems)
{
var folders = await GetFolderItemsAsync(subfolder, true);
return maxItems == 0 ? folders : folders.Take(maxItems);
}
private async Task<IEnumerable<string>> GetFolderItemsAsync(string subfolder, bool onlyFolders)
{
// Downloads folder contents using WebRequest
}
public async Task DownloadFileAsync(string path, string file)
{
var remote = new Uri("ftp://" + _hostname + path + "/" + file);
var local = _workingFolder + #"\" + file;
using (var ftpClient = new WebClient { Credentials = new NetworkCredential(_username, _password) })
{
ftpClient.DownloadFileCompleted += (sender, e) => DownloadFileCompleted(sender, e, file);
await ftpClient.DownloadFileTaskAsync(remote, local);
}
}
public void DownloadFileCompleted(object sender, AsyncCompletedEventArgs e, Uri remote, string local)
{
if (e.Error != null)
{
Logger.Warn("Failed downloading\n\t{0}\n\t{1}", file, e.Error.Message);
return;
}
Logger.Info("Downloaded \n\t{1}", file);
}
}
Seems like some of your tasks aren't completed. Try do like this (I do the same when putting bunch of files to ftp)
Create array of tasks for every file being downloaded. Run them in cycle like this:
Task[] tArray = new Task[DictToUpload.Count];
foreach (var pair in DictToUpload)
{
tArray[i] = Task.Factory.StartNew(()=>{/* some stuff */});
}
await Task.WhenAll(tArray);
Use await Task.WhenAll(taskArray) instead of await each task. This guarantees all your tasks are completed.
Learn some peace of TPL ;)
I have a zip file creator that takes in a String[] of Urls, and returns a zip file with all of the files in the String[]
I figured there would be a number of example of this, but I cannot seem to find an answer to "How to download many files asynchronously and return when done"
How do I download {n} files at once, and return the Dictionary only when all downloads are complete?
private static Dictionary<string, byte[]> ReturnedFileData(IEnumerable<string> urlList)
{
var returnList = new Dictionary<string, byte[]>();
using (var client = new WebClient())
{
foreach (var url in urlList)
{
client.DownloadDataCompleted += (sender1, e1) => returnList.Add(GetFileNameFromUrlString(url), e1.Result);
client.DownloadDataAsync(new Uri(url));
}
}
return returnList;
}
private static string GetFileNameFromUrlString(string url)
{
var uri = new Uri(url);
return System.IO.Path.GetFileName(uri.LocalPath);
}
First, you tagged your question with async-await without actually using it. There really is no reason anymore to use the old asynchronous paradigms.
To wait asynchronously for all concurrent async operation to complete you should use Task.WhenAll which means that you need to keep all the tasks in some construct (i.e. dictionary) before actually extracting their results.
At the end, when you have all the results in hand you just create the new result dictionary by parsing the uri into the file name, and extracting the result out of the async tasks.
async Task<Dictionary<string, byte[]>> ReturnFileData(IEnumerable<string> urls)
{
var dictionary = urls.ToDictionary(
url => new Uri(url),
url => new WebClient().DownloadDataTaskAsync(url));
await Task.WhenAll(dictionary.Values);
return dictionary.ToDictionary(
pair => Path.GetFileName(pair.Key.LocalPath),
pair => pair.Value.Result);
}
public string JUST_return_dataURL_by_URL(string URL, int interval, int max_interval)
{
var client = new WebClient(proxy);
client.Headers = _headers;
string downloaded_from_URL = "false"; //default - until downloading
client.DownloadDataCompleted += bytes =>
{
Console.WriteLine("Done!");
string dataURL = Convert.ToBase64String( bytes );
string filename = Guid.NewGuid().ToString().Trim('{', '}')+".png";
downloaded_from_URL =
"Image Downloaded from " + URL
+ "<br>"
+ "<a href=\""+dataURL+"\" download=\""+filename+"\">"
+ "<img src=\"data:image/png;base64," + dataURL + "\"/>"+filename
+ "</a>"
;
return;
};
client.DownloadDataAsync(new System.Uri(URL));
int i = 0;
do{
// Console.WriteLine(
// "(interval > 10): "+(interval > 10)
// +"\n(downloaded_from_URL == \"false\"): " + (downloaded_from_URL == "false")
// +"\ninterval: "+interval
// );
Thread.Sleep(interval);
i+=interval;
}
while( (downloaded_from_URL == "false") && (i < max_interval) );
return downloaded_from_URL;
}
You'd be wanting the task.WaitAll method...
msdn link
Create each download as a separate task, then pass them as a collection.
A shortcut to this might be to wrap your download method in a task.
Return new Task<downloadresult>(()=>{ method body});
Apologies for vagueness, working on iPad sucks for coding.
EDIT:
Another implementation of this that may be worth considering is wrapping the downloads using the parallel framework.
Since your tasks all do the same thing taking a parameter, you could instead use Parallel.Foreach and wrap that into a single task:
public System.Threading.Tasks.Task<System.Collections.Generic.IDictionary<string, byte[]>> DownloadTask(System.Collections.Generic.IEnumerable<string> urlList)
{
return new System.Threading.Tasks.Task<System.Collections.Generic.IDictionary<string, byte[]>>(() =>
{
var r = new System.Collections.Concurrent.ConcurrentDictionary<string, byte[]>();
System.Threading.Tasks.Parallel.ForEach<string>(urlList, (url, s, l) =>
{
using (System.Net.WebClient client = new System.Net.WebClient())
{
var bytedata = client.DownloadData(url);
r.TryAdd(url, bytedata);
}
});
var results = new System.Collections.Generic.Dictionary<string, byte[]>();
foreach (var value in r)
{
results.Add(value.Key, value.Value);
}
return results;
});
}
This leverages a concurrent collection to support parallel access within the method before converting back to IDictionary.
This method returns a task so can be called with an await.
Hope this provides a helpful alternative.