I'm downloading 100K+ files and want to do it in patches, such as 100 files at a time.
static void Main(string[] args) {
Task.WaitAll(
new Task[]{
RunAsync()
});
}
// each group has 100 attachments.
static async Task RunAsync() {
foreach (var group in groups) {
var tasks = new List<Task>();
foreach (var attachment in group.attachments) {
tasks.Add(DownloadFileAsync(attachment, downloadPath));
}
await Task.WhenAll(tasks);
}
}
static async Task DownloadFileAsync(Attachment attachment, string path) {
using (var client = new HttpClient()) {
using (var fileStream = File.Create(path + attachment.FileName)) {
var downloadedFileStream = await client.GetStreamAsync(attachment.url);
await downloadedFileStream.CopyToAsync(fileStream);
}
}
}
Expected
Hoping it to download 100 files at a time, then download next 100;
Actual
It downloads a lot more at the same time. Quickly got an error Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host
Running tasks in "batch" is not a good idea in terms of performance. A long running task would make whole batch block. A better approach would be starting a new task as soon as one is finished.
This can be implemented with a queue as #MertAkcakaya suggested. But I will post another alternative based on my other answer Have a set of Tasks with only X running at a time
int maxTread = 3;
System.Net.ServicePointManager.DefaultConnectionLimit = 50; //Set this once to a max value in your app
var urls = new Tuple<string, string>[] {
Tuple.Create("http://cnn.com","temp/cnn1.htm"),
Tuple.Create("http://cnn.com","temp/cnn2.htm"),
Tuple.Create("http://bbc.com","temp/bbc1.htm"),
Tuple.Create("http://bbc.com","temp/bbc2.htm"),
Tuple.Create("http://stackoverflow.com","temp/stackoverflow.htm"),
Tuple.Create("http://google.com","temp/google1.htm"),
Tuple.Create("http://google.com","temp/google2.htm"),
};
DownloadParallel(urls, maxTread);
async Task DownloadParallel(IEnumerable<Tuple<string,string>> urls, int maxThreads)
{
SemaphoreSlim maxThread = new SemaphoreSlim(maxThreads);
var client = new HttpClient();
foreach(var url in urls)
{
await maxThread.WaitAsync();
DownloadFile(client, url.Item1, url.Item2)
.ContinueWith((task) => maxThread.Release() );
}
}
async Task DownloadFile(HttpClient client, string url, string fileName)
{
var stream = await client.GetStreamAsync(url);
using (var fileStream = File.Create(fileName))
{
await stream.CopyToAsync(fileStream);
}
}
PS: DownloadParallel will return as soon as it starts the last download. So don't await it. If you really want to await it you should add for (int i = 0; i < maxThreads; i++) await maxThread.WaitAsync(); at the end of the method.
PS2: Don't forget to add exception handling to DownloadFile
Related
I'm trying to combine several resources in a single collection (s variable below). GetLocations2 returns a Task and I'm expecting to be able to add that to task results collection (again, the s variable). However, it's complaining that I can't add the task results to the collection because of the following error:
Error CS1503 Argument 1: cannot convert from 'string' to
'System.Threading.Tasks.Task'
Called
public static async Task<string> WebRequest2(Uri uri)
{
using (var client = new WebClient())
{
var result = await client.OpenReadTaskAsync(uri).ConfigureAwait(false);
using (var sr = new StreamReader(result))
{
return sr.ReadToEnd();
}
}
}
Caller
private static async Task GetLocations2()
{
var s = new List<Task<string>>();
foreach (var lob in _lobs)
{
var r = await Helper.WebRequest2(new Uri(lob));
var x = Helper.DeserializeResponse<SLICS>(r);
s.Add(r); //Getting red squiggly here
}
//var w = await Task.WhenAll();
}
You do not need to await before adding task to list, it basically unwraps Task created by Helper.WebRequest2 (also it waits for Task to finish, so tasks in your original code will be executed sequentially):
private static async Task GetLocations2()
{
var s = new List<Task<string>>();
foreach (var lob in _lobs)
{
var r = Helper.WebRequest2(new Uri(lob));
// var x = Helper.DeserializeResponse<SLICS>(r); this should be done after Task.WhenAll
// or using `ContinueWith`
s.Add(r);
}
var w = await Task.WhenAll(s);
}
You can also wrap both operations in an async lambda to avoid having to iterate twice:
private static async Task GetLocations2()
{
IEnumerable<SLICS> w = await Task.WhenAll(_lobs.Select(async lob =>
{
var r = await Helper.WebRequest2(new Uri(lob));
return Helper.DeserializeResponse<SLICS>(r);
}));
}
Using Select will enumerate all the Tasks returned by the lambda expressions, then Task.WhenAll will wait for them all to complete, and unwrap the results.
I have two loops that use SemaphoreSlim and a array of strings "Contents"
a foreachloop:
var allTasks = new List<Task>();
var throttle = new SemaphoreSlim(10,10);
foreach (string s in Contents)
{
await throttle.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
rootResponse.Add(await POSTAsync(s, siteurl, src, target));
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
a for loop:
var allTasks = new List<Task>();
var throttle = new SemaphoreSlim(10,10);
for(int s=0;s<Contents.Count;s++)
{
await throttle.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
rootResponse[s] = await POSTAsync(Contents[s], siteurl, src, target);
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
the first foreach loop runs well, but the for loops Task.WhenAll(allTasks) returns a OutOfRangeException and I want the Contents[] index and List index to match.
Can I fix the for loop? or is there a better approach?
This would fix your current problems
for (int s = 0; s < Contents.Count; s++)
{
var content = Contents[s];
allTasks.Add(
Task.Run(async () =>
{
await throttle.WaitAsync();
try
{
rootResponse[s] = await POSTAsync(content, siteurl, src, target);
}
finally
{
throttle.Release();
}
}));
}
await Task.WhenAll(allTasks);
However this is a fairly messy and nasty piece of code. This looks a bit neater
public static async Task DoStuffAsync(Content[] contents, string siteurl, string src, string target)
{
var throttle = new SemaphoreSlim(10, 10);
// local method
async Task<(Content, SomeResponse)> PostAsyncWrapper(Content content)
{
await throttle.WaitAsync();
try
{
// return a content and result pair
return (content, await PostAsync(content, siteurl, src, target));
}
finally
{
throttle.Release();
}
}
var results = await Task.WhenAll(contents.Select(PostAsyncWrapper));
// do stuff with your results pairs here
}
There are many other ways you could do this, PLinq, Parallel.For,Parallel.ForEach, Or just tidying up your captures in your loops like above.
However since you have an IO bound work load, and you have async methods that run it. The most appropriate solution is the async await pattern which neither Parallel.For,Parallel.ForEach cater for optimally.
Another way is TPL DataFlow library which can be found in the System.Threading.Tasks.Dataflow nuget package.
Code
public static async Task DoStuffAsync(Content[] contents, string siteurl, string src, string target)
{
async Task<(Content, SomeResponse)> PostAsyncWrapper(Content content)
{
return (content, await PostAsync(content, siteurl, src, target));
}
var bufferblock = new BufferBlock<(Content, SomeResponse)>();
var actionBlock = new TransformBlock<Content, (Content, SomeResponse)>(
content => PostAsyncWrapper(content),
new ExecutionDataflowBlockOptions
{
EnsureOrdered = false,
MaxDegreeOfParallelism = 100,
SingleProducerConstrained = true
});
actionBlock.LinkTo(bufferblock);
foreach (var content in contents)
actionBlock.Post(content);
actionBlock.Complete();
await actionBlock.Completion;
if (bufferblock.TryReceiveAll(out var result))
{
// do stuff with your results pairs here
}
}
Basically this creates a BufferBlock And TransformBlock, You pump your work load into the TransformBlock, it has degrees of parallel in its options, and it pushes them into the BufferBlock, you await completion and get your results.
Why Dataflow? because it deals with the async await, it has MaxDegreeOfParallelism, it designed for IO bound or CPU Bound workloads, and its extremely simple to use. Additionally, as most data is generally processed in many ways (in a pipeline), you can then use it to pipe and manipulate streams of data in sequence and parallel or in any way way you choose down the line.
Anyway good luck
The following code is meant to copy files asynchronously but it causes deadlock in my app. It uses a task combinator helper method called 'Interleaved(..)' found here to return tasks in the order they complete.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_DEADLOCK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
List<Task<StorageFile>> copyTasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copyTask = file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copyTasks.Add(copyTask);
}
// Serve up each task as it completes
foreach (var bucket in Interleaved(copyTasks))
{
var copyTask = await bucket;
var copiedFile = await copyTask;
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFiles.Count() * 100.0));
}
return copiedFiles;
}
I originally created a simpler 'CopyFiles(...)' which processes the tasks in the order they were supplied (as opposed to completed) and this works fine, but I can't figure out why this one deadlocks frequently. Particularly, when there are many files to process.
Here is the simpler 'CopyFiles' code that works:
public static async Task<List<StorageFile>> CopyFiles_RUNS_OK(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
List<StorageFile> copiedFiles = new List<StorageFile>();
int sourceFilesCount = sourceFiles.Count();
List<Task<StorageFile>> tasks = new List<Task<StorageFile>>();
foreach (var file in sourceFiles)
{
// Create the copy tasks and add to list
var copiedFile = await file.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
copiedFiles.Add(copiedFile);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount *100.0));
}
return copiedFiles;
}
EDIT:
In an attempt to find out what's going on I've changed the implementation of CopyFiles(...) to use TPL Dataflow. I am aware that this code will return items in the order they were supplied, which is not what I want, but it removes the Interleaved dependency as a start. Anyway, despite this the app still hangs. It seems as if it's not returning from the file.CopyAsync(..) call. There is of course the possibility I'm just doing something wrong here.
public static async Task<List<StorageFile>> CopyFiles_CAUSES_HANGING_ALSO(IEnumerable<StorageFile> sourceFiles, IProgress<int> progress, StorageFolder destinationFolder)
{
int sourceFilesCount = sourceFiles.Count();
List<StorageFile> copiedFiles = new List<StorageFile>();
// Store for input files.
BufferBlock<StorageFile> inputFiles = new BufferBlock<StorageFile>();
//
Func<StorageFile, Task<StorageFile>> copyFunc = sf => sf.CopyAsync(destinationFolder, Guid.NewGuid().ToString()).AsTask();
TransformBlock<StorageFile, Task<StorageFile>> copyFilesBlock = new TransformBlock<StorageFile, Task<StorageFile>>(copyFunc);
inputFiles.LinkTo(copyFilesBlock, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var file in sourceFiles)
{
inputFiles.Post(file);
}
inputFiles.Complete();
while (await copyFilesBlock.OutputAvailableAsync())
{
Task<StorageFile> file = await copyFilesBlock.ReceiveAsync();
copiedFiles.Add(await file);
progress.Report((int)((double)copiedFiles.Count / sourceFilesCount * 100.0));
}
copyFilesBlock.Completion.Wait();
return copiedFiles;
}
Many thanks in advance for any help.
Hi i'm new in async programming. How can I run my method checkAvaible to run async? I would like to download 2500 pages at once if it's possible, dont wait to complete one download and start another. How can I make it?
private static void searchForLinks()
{
string url = "http://www.xxxx.pl/xxxx/?action=xxxx&id=";
for (int i = 0; i < 2500; i++)
{
string tmp = url;
tmp += Convert.ToString(i);
checkAvaible(tmp); // this method run async, do not wait while one page is downloading
}
Console.WriteLine(listOfUrls.Count());
Console.ReadLine();
}
private static async void checkAvaible(string url)
{
using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString(url);
if (htmlCode.IndexOf("Brak takiego obiektu w naszej bazie!") == -1)
listOfUrls.Add(url);
}
}
You would not want to download 2500 pages at the same time since this will be a problem for both your client and the server. Instead, I have added a concurrent download limitation (of 10 by default). The web pages will be downloaded 10 page at a time. (Or you can change it to 2500 if you are running a super computer :))
Generic Lists (I think it is a List of strings in your case) is not thread safe by default therefore you should synchronize access to the Add method. I have also added that.
Here is the updated source code to download pages asynhcronously with a configurable amount of concurrent calls
private static List<string> listOfUrls = new List<string>();
private static void searchForLinks()
{
string url = "http://www.xxxx.pl/xxxx/?action=xxxx&id=";
int numberOfConcurrentDownloads = 10;
for (int i = 0; i < 2500; i += numberOfConcurrentDownloads)
{
List<Task> allDownloads = new List<Task>();
for (int j = i; j < i + numberOfConcurrentDownloads; j++)
{
string tmp = url;
tmp += Convert.ToString(i);
allDownloads.Add(checkAvaible(tmp));
}
Task.WaitAll(allDownloads.ToArray());
}
Console.WriteLine(listOfUrls.Count());
Console.ReadLine();
}
private static async Task checkAvaible(string url)
{
using (WebClient client = new WebClient())
{
string htmlCode = await client.DownloadStringTaskAsync(new Uri(url));
if (htmlCode.IndexOf("Brak takiego obiektu w naszej bazie!") == -1)
{
lock (listOfUrls)
{
listOfUrls.Add(url);
}
}
}
}
It's best to convert code to async by working from the inside and proceeding out. Follow best practices along the way, such as avoiding async void, using the Async suffix, and returning results instead of modifying shared variables:
private static async Task<string> checkAvaibleAsync(string url)
{
using (var client = new HttpClient())
{
string htmlCode = await client.GetStringAsync(url);
if (htmlCode.IndexOf("Brak takiego obiektu w naszej bazie!") == -1)
return url;
else
return null;
}
}
You can then start off any number of these concurrently using Task.WhenAll:
private static async Task<string[]> searchForLinksAsync()
{
string url = "http://www.xxxx.pl/xxxx/?action=xxxx&id=";
var tasks = Enumerable.Range(0, 2500).Select(i => checkAvailableAsync(url + i));
var results = await Task.WhenAll(tasks);
var listOfUrls = results.Where(x => x != null).ToArray();
Console.WriteLine(listOfUrls.Length);
Console.ReadLine();
}
There're Task.WaitAll method which waits for all tasks and Task.WaitAny method which waits for one task. How to wait for any N tasks?
Use case: search result pages are downloaded, each result needs a separate task to download and process it. If I use WaitAll to wait for the results of the subtasks before getting next search result page, I will not use all available resources (one long task will delay the rest). Not waiting at all can cause thousands of tasks to be queued which isn't the best idea either.
So, how to wait for a subset of tasks to be completed? Or, alternatively, how to wait for the task scheduler queue to have only N tasks?
This looks like an excellent problem for TPL Dataflow, which will allow you to control parallelism and buffering to process at maximum speed.
Here's some (untested) code to show you what I mean:
static void Process()
{
var searchReader =
new TransformManyBlock<SearchResult, SearchResult>(async uri =>
{
// return a list of search results at uri.
return new[]
{
new SearchResult
{
IsResult = true,
Uri = "http://foo.com"
},
new SearchResult
{
// return the next search result page here.
IsResult = false,
Uri = "http://google.com/next"
}
};
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = 8, // restrict buffer size.
MaxDegreeOfParallelism = 4 // control parallelism.
});
// link "next" pages back to the searchReader.
searchReader.LinkTo(searchReader, x => !x.IsResult);
var resultActor = new ActionBlock<SearchResult>(async uri =>
{
// do something with the search result.
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = 64,
MaxDegreeOfParallelism = 16
});
// link search results into resultActor.
searchReader.LinkTo(resultActor, x => x.IsResult);
// put in the first piece of input.
searchReader.Post(new SearchResult { Uri = "http://google/first" });
}
struct SearchResult
{
public bool IsResult { get; set; }
public string Uri { get; set; }
}
I think you should independently limit the number of parallel download tasks and the number of concurrent result processing tasks. I would do it using two SemaphoreSlim objects, like below. This version doesn't use the synchronous SemaphoreSlim.Wait (thanks #svick for making the point). It was only slightly tested, the exception handling can be improved; substitute your own DownloadNextPageAsync and ProcessResults:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Console_21666797
{
partial class Program
{
// the actual download method
// async Task<string> DownloadNextPageAsync(string url) { ... }
// the actual process methods
// void ProcessResults(string data) { ... }
// download and process all pages
async Task DownloadAndProcessAllAsync(
string startUrl, int maxDownloads, int maxProcesses)
{
// max parallel downloads
var downloadSemaphore = new SemaphoreSlim(maxDownloads);
// max parallel processing tasks
var processSemaphore = new SemaphoreSlim(maxProcesses);
var tasks = new HashSet<Task>();
var complete = false;
var protect = new Object(); // protect tasks
var page = 0;
// do the page
Func<string, Task> doPageAsync = async (url) =>
{
bool downloadSemaphoreAcquired = true;
try
{
// download the page
var data = await DownloadNextPageAsync(
url).ConfigureAwait(false);
if (String.IsNullOrEmpty(data))
{
Volatile.Write(ref complete, true);
}
else
{
// enable the next download to happen
downloadSemaphore.Release();
downloadSemaphoreAcquired = false;
// process this download
await processSemaphore.WaitAsync();
try
{
await Task.Run(() => ProcessResults(data));
}
finally
{
processSemaphore.Release();
}
}
}
catch (Exception)
{
Volatile.Write(ref complete, true);
throw;
}
finally
{
if (downloadSemaphoreAcquired)
downloadSemaphore.Release();
}
};
// do the page and save the task
Func<string, Task> queuePageAsync = async (url) =>
{
var task = doPageAsync(url);
lock (protect)
tasks.Add(task);
await task;
lock (protect)
tasks.Remove(task);
};
// process pages in a loop until complete is true
while (!Volatile.Read(ref complete))
{
page++;
// acquire download semaphore synchrnously
await downloadSemaphore.WaitAsync().ConfigureAwait(false);
// do the page
var task = queuePageAsync(startUrl + "?page=" + page);
}
// await completion of the pending tasks
Task[] pendingTasks;
lock (protect)
pendingTasks = tasks.ToArray();
await Task.WhenAll(pendingTasks);
}
static void Main(string[] args)
{
new Program().DownloadAndProcessAllAsync("http://google.com", 10, 5).Wait();
Console.ReadLine();
}
}
}
Something like this should work. There might be some edge cases, but all in all it should ensure a minimum of completions.
public static async Task WhenN(IEnumerable<Task> tasks, int n, CancellationTokenSource cts = null)
{
var pending = new HashSet<Task>(tasks);
if (n > pending.Count)
{
n = pending.Count;
// or throw
}
var completed = 0;
while (completed != n)
{
var completedTask = await Task.WhenAny(pending);
pending.Remove(completedTask);
completed++;
}
if (cts != null)
{
cts.Cancel();
}
}
Usage:
static void Main(string[] args)
{
var tasks = new List<Task>();
var completed = 0;
var cts = new CancellationTokenSource();
for (int i = 0; i < 100; i++)
{
tasks.Add(Task.Run(async () =>
{
await Task.Delay(temp * 100, cts.Token);
Console.WriteLine("Completed task {0}", i);
completed++;
}, cts.Token));
}
Extensions.WhenN(tasks, 30, cts).Wait();
Console.WriteLine(completed);
Console.ReadLine();
}
Task[] runningTasks = MyTasksFactory.StartTasks();
while(runningTasks.Any())
{
int finished = Task.WaitAny(runningTasks);
Task.Factory.StareNew(()=> {Consume(runningTasks[Finished].Result);})
runningTasks.RemoveAt(finished);
}