So I am trying to learn how to write asynchronous methods and have been banging my head to get asynchronous calls to work. What always seems to happen is the code hangs on "await" instruction until it eventually seems to time out and crash the loading form in the same method with it.
There are two main reason this is strange:
The code works flawlessly when not asynchronous and just a simple loop
I copied the MSDN code almost verbatim to convert the code to asynchronous calls here: https://msdn.microsoft.com/en-us/library/mt674889.aspx
I know there are a lot of questions already about this on the forms but I have gone through most of them and tried a lot of other ways (with the same result) and now seem to think something is fundamentally wrong after MSDN code wasn't working.
Here is the main method that is called by a background worker:
// this method loads data from each individual webPage
async Task LoadSymbolData(DoWorkEventArgs _e)
{
int MAX_THREADS = 10;
int tskCntrTtl = dataGridView1.Rows.Count;
Dictionary<string, string> newData_d = new Dictionary<string, string>(tskCntrTtl);
// we need to make copies of things that can change in a different thread
List<string> links = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.url].Value.ToString()).ToList());
List<string> symbols = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.symbol].Value.ToString()).ToList());
// we need to create a cancelation token once this is working
// TODO
using (LoadScreen loadScreen = new LoadScreen("Querying stock servers..."))
{
// we cant use the delegate becaus of async keywords
this.loaderScreens.Add(loadScreen);
// wait until the form is loaded so we dont get exceptions when writing to controls on that form
while ( !loadScreen.IsLoaded() );
// load the total number of operations so we can simplify incrementing the progress bar
// on seperate form instances
loadScreen.LoadProgressCntr(0, tskCntrTtl);
// try to run all async tasks since they are non-blocking threaded operations
for (int i = 0; i < tskCntrTtl; i += MAX_THREADS)
{
List<Task<string[]>> ProcessURL = new List<Task<string[]>>();
List<int> taskList = new List<int>();
// Make a list of task indexs
for (int task = i; task < i + MAX_THREADS && task < tskCntrTtl; task++)
taskList.Add(task);
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery =
from task in taskList select QueryHtml(loadScreen, links[task], symbols[task]);
// ***Use ToList to execute the query and start the tasks.
List<Task<string[]>> downloadTasks = downloadTasksQuery.ToList();
// ***Add a loop to process the tasks one at a time until none remain.
while (downloadTasks.Count > 0)
{
// Identify the first task that completes.
Task<string[]> firstFinishedTask = await Task.WhenAny(downloadTasks); // <---- CODE HANGS HERE
// ***Remove the selected task from the list so that you don't
// process it more than once.
downloadTasks.Remove(firstFinishedTask);
// Await the completed task.
string[] data = await firstFinishedTask;
if (!newData_d.ContainsKey(data.First()))
newData_d.Add(data.First(), data.Last());
}
}
// now we have the dictionary with all the information gathered from teh websites
// now we can add the columns if they dont already exist and load the information
// TODO
loadScreen.UpdateProgress(100);
this.loaderScreens.Remove(loadScreen);
}
}
And here is the async method for querying web pages:
async Task<string[]> QueryHtml(LoadScreen _loadScreen, string _link, string _symbol)
{
string data = String.Empty;
try
{
HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
var html = await client.GetStringAsync(_link); // <---- CODE HANGS HERE
doc.LoadHtml(html);
string percGrn = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'up_g')]//span[2]");
string percRed = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'down_r')]//span[2]");
// create somthing we'll nuderstand later
if ((String.IsNullOrEmpty(percGrn) && String.IsNullOrEmpty(percRed)) ||
(!String.IsNullOrEmpty(percGrn) && !String.IsNullOrEmpty(percRed)))
throw new Exception();
// adding string to empty gives string
string perc = percGrn + percRed;
bool isNegative = String.IsNullOrEmpty(percGrn);
double percDouble;
if (double.TryParse(Regex.Match(perc, #"\d+([.])?(\d+)?").Value, out percDouble))
data = (isNegative ? 0 - percDouble : percDouble).ToString();
}
catch (Exception ex) { }
finally
{
// update the progress bar...
_loadScreen.IncProgressCntr();
}
return new string[] { _symbol, data };
}
I could really use some help. Thanks!
In short when you combine async with any 'regular' task functions you get a deadlock
http://olitee.com/2015/01/c-async-await-common-deadlock-scenario/
the solution is by using configureawait
var html = await client.GetStringAsync(_link).ConfigureAwait(false);
The reason you need this is because you didn't await your orginal thread.
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery = from task in taskList select QueryHtml(loadScreen,links[task], symbols[task]);
What's happeneing here is that you mix the await paradigm with thre regular task handling paradigm. and those don't mix (or rather you have to use the ConfigureAwait(false) for this to work.
Related
I am using the arcGIS SDK in C# to search for addresses. What i would like to do is search for multiple addresses and then executing a method when all the addresses are found. This will most likely be achieved by using a loop.
Here is the code to find an address on the map:
public async void searchSubjectAddress(string sAddress)
{
var uri = new Uri("http://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer");
var token = string.Empty;
var locator = new OnlineLocatorTask(uri, token);
var info = await locator.GetInfoAsync();
var singleAddressFieldName = info.SingleLineAddressField.FieldName;
var address = new Dictionary<string, string>();
address.Add(singleAddressFieldName, sAddress);
var candidateFields = new List<string> { "Score", "Addr_type", "Match_addr", "Side" };
var task = locator.GeocodeAsync(address, candidateFields, MyMapView.SpatialReference, new CancellationToken());
IList<LocatorGeocodeResult> results = null;
try
{
results = await task;
if (results.Count > 0)
{
var firstMatch = results[0];
var matchLocation = firstMatch.Location as MapPoint;
Console.WriteLine($"Found point: {matchLocation.ToString()}");
MyMapView.SetView(matchLocation);
}
}
catch (Exception ex)
{
Console.WriteLine("Could not find point");
var msg = $"Exception from geocode: {ex.Message} At address: {sAddress}";
Console.WriteLine(msg);
}
}
I am currently following this tutorial:
https://developers.arcgis.com/net/10-2/desktop/guide/search-for-places.htm
I can find a single address, but the asynchronous tasks are a bit confusing. The code must be executed with the asynchronous tasks to function, so i cant change that.
To use this in an example:
I want to get the distance between one property and several others. I only have access to the street addresses, so i use the above code to find the address and fetch the Geo-coordinates. Then i save those coordinates in a list for later use.
My problem is that when i want to execute the later methods, the async tasks are still running and my program executes the later methods regardless of whether the async methods are completed. When i change the method to a Task type instead of void, i usually end up with an endless wait with no tasks being accomplished.
I would like to know how i can loop the above method synchronously (let each new tasks only run when the older ones are finished) through a list of addresses and then run a method when all async tasks are finished. It would also be nice if the async tasks stop when a resulting address is found.
Help would be much appreciated!
Change
public async void searchSubjectAddress(string sAddress)
to
public async Task searchSubjectAddress(string sAddress)
Then whatever is calling searchSubjectAddress can await this call, so you know when it's complete.
You might also consider making it return Task < SearchResult> instead so you can get the address sent back to the caller.
Generally when you write "async void" you're setting yourself up for problems, so try to always make things "async Task" or "async Task< T>" instead.
On a side note: The v10.x releases have long been deprecated/unsupported, and you really should move to v100.8
This question already has answers here:
Using async/await for multiple tasks
(8 answers)
Closed 4 years ago.
I have a method in which I'm retrieving a list of deployments. For each deployment I want to retrieve an associated release. Because all calls are made to an external API, I now have a foreach-loop in which those calls are made.
public static async Task<List<Deployment>> GetDeployments()
{
try
{
var depjson = await GetJson($"{BASEURL}release/deployments?deploymentStatus=succeeded&definitionId=2&definitionEnvironmentId=5&minStartedTime={MinDateTime}");
var deployments = (JsonConvert.DeserializeObject<DeploymentWrapper>(depjson))?.Value?.OrderByDescending(x => x.DeployedOn)?.ToList();
foreach (var deployment in deployments)
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
deployment.Release = JsonConvert.DeserializeObject<Release>(reljson);
}
return deployments;
}
catch (Exception)
{
throw;
}
}
This all works perfectly fine. However, I do not like the await in the foreach-loop at all. I also believe this is not considered good practice. I just don't see how to refactor this so the calls are made parallel, because the result of each call is used to set a property of the deployment.
I would appreciate any suggestions on how to make this method faster and, whenever possible, avoid the await-ing in the foreach-loop.
There is nothing wrong with what you are doing now. But there is a way to call all tasks at once instead of waiting for a single task, then processing it and then waiting for another one.
This is how you can turn this:
wait for one -> process -> wait for one -> process ...
into
wait for all -> process -> done
Convert this:
foreach (var deployment in deployments)
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
deployment.Release = JsonConvert.DeserializeObject<Release>(reljson);
}
To:
var deplTasks = deployments.Select(d => GetJson($"{BASEURL}release/releases/{d.ReleaseId}"));
var reljsons = await Task.WhenAll(deplTasks);
for(var index = 0; index < deployments.Count; index++)
{
deployments[index].Release = JsonConvert.DeserializeObject<Release>(reljsons[index]);
}
First you take a list of unfinished tasks. Then you await it and you get a collection of results (reljson's). Then you have to deserialize them and assign to Release.
By using await Task.WhenAll() you wait for all the tasks at the same time, so you should see a performance boost from that.
Let me know if there are typos, I didn't compile this code.
Fcin suggested to start all Tasks, await for them all to finish and then start deserializing the fetched data.
However, if the first Task is already finished, but the second task not, and internally the second task is awaiting, the first task could already start deserializing. This would shorten the time that your process is idly waiting.
So instead of:
var deplTasks = deployments.Select(d => GetJson($"{BASEURL}release/releases/{d.ReleaseId}"));
var reljsons = await Task.WhenAll(deplTasks);
for(var index = 0; index < deployments.Count; index++)
{
deployments[index].Release = JsonConvert.DeserializeObject<Release>(reljsons[index]);
}
I'd suggest the following slight change:
// async fetch the Release data of Deployment:
private async Task<Release> FetchReleaseDataAsync(Deployment deployment)
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
return JsonConvert.DeserializeObject<Release>(reljson);
}
// async fill the Release data of Deployment:
private async Task FillReleaseDataAsync(Deployment deployment)
{
deployment.Release = await FetchReleaseDataAsync(deployment);
}
Then your procedure is similar to the solution that Fcin suggested:
IEnumerable<Task> tasksFillDeploymentWithReleaseData = deployments.
.Select(deployment => FillReleaseDataAsync(deployment)
.ToList();
await Task.WhenAll(tasksFillDeploymentWithReleaseData);
Now if the first task has to wait while fetching the release data, the 2nd task begins and the third etc. If the first task already finished fetching the release data, but the other tasks are awaiting for their release data, the first task starts already deserializing it and assigns the result to deployment.Release, after which the first task is complete.
If for instance the 7th task got its data, but the 2nd task is still waiting, the 7th task can deserialize and assign the data to deployment.Release. Task 7 is completed.
This continues until all tasks are completed. Using this method there is less waiting time because as soon as one task has its data it is scheduled to start deserializing
If i understand you right and you want to make the var reljson = await GetJson parralel:
Try this:
Parallel.ForEach(deployments, (deployment) =>
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
deployment.Release = JsonConvert.DeserializeObject<Release>(reljson);
});
you might limit the number of parallel executions such as:
Parallel.ForEach(
deployments,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
(deployment) =>
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
deployment.Release = JsonConvert.DeserializeObject<Release>(reljson);
});
you might also want to be able to break the loop:
Parallel.ForEach(deployments, (deployment, state) =>
{
var reljson = await GetJson($"{BASEURL}release/releases/{deployment.ReleaseId}");
deployment.Release = JsonConvert.DeserializeObject<Release>(reljson);
if (noFurtherProcessingRequired) state.Break();
});
I have a series of Tasks in an array. If a Task is "Good" it returns a string. If it's "Bad": it return a null.
I want to be able to run all the Tasks in parallel, and once the first one comes back that is "Good", then cancel the others and get the "Good" result.
I am doing this now, but the problem is that all the tasks need to run, then I loop through them looking for the first good result.
List<Task<string>> tasks = new List<Task<string>>();
Task.WaitAll(tasks.ToArray());
I want to be able to run all the Tasks in parallel, and once the first one comes back that is "Good", then cancel the others and get the "Good" result.
This is misunderstanding, since Cancellation in TPL is co-operative, so once the Task is started, there's no way to Cancel it. CancellationToken can work before Task is started or later to throw an exception, if Cancellation is requested, which is meant to initiate and take necessary action, like throw custom exception from the logic
Check the following query, it has many interesting answers listed, but none of them Cancel. Following is also a possible option:
public static class TaskExtension<T>
{
public static async Task<T> FirstSuccess(IEnumerable<Task<T>> tasks, T goodResult)
{
// Create a List<Task<T>>
var taskList = new List<Task<T>>(tasks);
// Placeholder for the First Completed Task
Task<T> firstCompleted = default(Task<T>);
// Looping till the Tasks are available in the List
while (taskList.Count > 0)
{
// Fetch first completed Task
var currentCompleted = await Task.WhenAny(taskList);
// Compare Condition
if (currentCompleted.Status == TaskStatus.RanToCompletion
&& currentCompleted.Result.Equals(goodResult))
{
// Assign Task and Clear List
firstCompleted = currentCompleted;
break;
}
else
// Remove the Current Task
taskList.Remove(currentCompleted);
}
return (firstCompleted != default(Task<T>)) ? firstCompleted.Result : default(T);
}
}
Usage:
var t1 = new Task<string>(()=>"bad");
var t2 = new Task<string>(()=>"bad");
var t3 = new Task<string>(()=>"good");
var t4 = new Task<string>(()=>"good");
var taskArray = new []{t1,t2,t3,t4};
foreach(var tt in taskArray)
tt.Start();
var finalTask = TaskExtension<string>.FirstSuccess(taskArray,"good");
Console.WriteLine(finalTask.Result);
You may even return Task<Task<T>>, instead of Task<T> for necessary logical processing
You can achieve your desired results using following example.
List<Task<string>> tasks = new List<Task<string>>();
// ***Use ToList to execute the query and start the tasks.
List<Task<string>> goodBadTasks = tasks.ToList();
// ***Add a loop to process the tasks one at a time until none remain.
while (goodBadTasks.Count > 0)
{
// Identify the first task that completes.
Task<string> firstFinishedTask = await Task.WhenAny(goodBadTasks);
// ***Remove the selected task from the list so that you don't
// process it more than once.
goodBadTasks.Remove(firstFinishedTask);
// Await the completed task.
string firstFinishedTaskResult = await firstFinishedTask;
if(firstFinishedTaskResult.Equals("good")
// do something
}
EDIT : If you want to terminate all the tasks you can use CancellationToken.
For more detail read the docs.
I was looking into Task.WhenAny() which will trigger on the first "completed" task. Unfortunately, a completed task in this sense is basically anything... even an exception is considered "completed". As far as I can tell there is no other way to check for what you call a "good" value.
While I don't believe there is a satisfactory answer for your question I think there may be an alternative solution to your problem. Consider using Parallel.ForEach.
Parallel.ForEach(tasks, (task, state) =>
{
if (task.Result != null)
state.Stop();
});
The state.Stop() will cease the execution of the Parallel loop when it finds a non-null result.
Besides having the ability to cease execution when it finds a "good" value, it will perform better under many (but not all) scenarios.
Use Task.WhenAny It returns the finished Task. Check if it's null. If it is, remove it from the List and call Task.WhenAny Again.
If it's good, Cancel all Tasks in the List (they should all have a CancellationTokenSource.Token.
Edit:
All Tasks should use the same CancellationTokenSource.Token. Then you only need to cancel once.
Here is some code to clarify:
private async void button1_Click(object sender, EventArgs e)
{
CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();
List<Task<string>> tasks = new List<Task<string>>();
tasks.Add(Task.Run<string>(() => // run your tasks
{
while (true)
{
if (cancellationTokenSource.Token.IsCancellationRequested)
{
return null;
}
return "Result"; //string or null
}
}));
while (tasks.Count > 0)
{
Task<string> resultTask = await Task.WhenAny(tasks);
string result = await resultTask;
if (result == null)
{
tasks.Remove(resultTask);
}
else
{
// success
cancellationTokenSource.Cancel(); // will cancel all tasks
}
}
}
I am using Async await with Task.Factory method.
public async Task<JobDto> ProcessJob(JobDto jobTask)
{
try
{
var T = Task.Factory.StartNew(() =>
{
JobWorker jobWorker = new JobWorker();
jobWorker.Execute(jobTask);
});
await T;
}
This method I am calling inside a loop like this
for(int i=0; i < jobList.Count(); i++)
{
tasks[i] = ProcessJob(jobList[i]);
}
What I notice is that new tasks opens up inside Process explorer and they also start working (based on log file). however out of 10 sometimes 8 or sometimes 7 finishes. Rest of them just never come back.
why would that be happening ?
Are they timing out ? Where can I set timeout for my tasks ?
UPDATE
Basically above, I would like each Task to start running as soon as they are called and wait for the response on AWAIT T keyword. I am assuming here that once they finish each of them will come back at Await T and do the next action. I am alraedy seeing this result for 7 out of 10 tasks but 3 of them are not coming back.
Thanks
It is hard to say what the issues is without the rest if the code, but you code can be simplified by making ProcessJob synchronous and then calling Task.Run with it.
public JobDto ProcessJob(JobDto jobTask)
{
JobWorker jobWorker = new JobWorker();
return jobWorker.Execute(jobTask);
}
Start tasks and wait for all tasks to finish. Prefer using Task.Run rather than Task.Factory.StartNew as it provides more favourable defaults for pushing work to the background. See here.
for(int i=0; i < jobList.Count(); i++)
{
tasks[i] = Task.Run(() => ProcessJob(jobList[i]));
}
try
{
await Task.WhenAll(tasks);
}
catch(Exception ex)
{
// handle exception
}
First, let's make a reproducible version of your code. This is NOT the best way to achieve what you are doing, but to show you what is happening in your code!
I'll keep the code almost same as your code, except I'll use simple int rather than your JobDto and on completion of the job Execute() I'll write in a file that we can verify later. Here's the code
public class SomeMainClass
{
public void StartProcessing()
{
var jobList = Enumerable.Range(1, 10).ToArray();
var tasks = new Task[10];
//[1] start 10 jobs, one-by-one
for (int i = 0; i < jobList.Count(); i++)
{
tasks[i] = ProcessJob(jobList[i]);
}
//[4] here we have 10 awaitable Task in tasks
//[5] do all other unrelated operations
Thread.Sleep(1500); //assume it works for 1.5 sec
// Task.WaitAll(tasks); //[6] wait for tasks to complete
// The PROCESS IS COMPLETE here
}
public async Task ProcessJob(int jobTask)
{
try
{
//[2] start job in a ThreadPool, Background thread
var T = Task.Factory.StartNew(() =>
{
JobWorker jobWorker = new JobWorker();
jobWorker.Execute(jobTask);
});
//[3] await here will keep context of calling thread
await T; //... and release the calling thread
}
catch (Exception) { /*handle*/ }
}
}
public class JobWorker
{
static object locker = new object();
const string _file = #"C:\YourDirectory\out.txt";
public void Execute(int jobTask) //on complete, writes in file
{
Thread.Sleep(500); //let's assume does something for 0.5 sec
lock(locker)
{
File.AppendAllText(_file,
Environment.NewLine + "Writing the value-" + jobTask);
}
}
}
After running just the StartProcessing(), this is what I get in the file
Writing the value-4
Writing the value-2
Writing the value-3
Writing the value-1
Writing the value-6
Writing the value-7
Writing the value-8
Writing the value-5
So, 8/10 jobs has completed. Obviously, every time you run this, the number and order might change. But, the point is, all the jobs did not complete!
Now, if I un-comment the step [6] Task.WaitAll(tasks);, this is what I get in my file
Writing the value-2
Writing the value-3
Writing the value-4
Writing the value-1
Writing the value-5
Writing the value-7
Writing the value-8
Writing the value-6
Writing the value-9
Writing the value-10
So, all my jobs completed here!
Why the code is behaving like this, is already explained in the code-comments. The main thing to note is, your tasks run in ThreadPool based Background threads. So, if you do not wait for them, they will be killed when the MAIN process ends and the main thread exits!!
If you still don't want to await the tasks there, you can return the list of tasks from this first method and await the tasks at the very end of the process, something like this
public Task[] StartProcessing()
{
...
for (int i = 0; i < jobList.Count(); i++)
{
tasks[i] = ProcessJob(jobList[i]);
}
...
return tasks;
}
//in the MAIN METHOD of your application/process
var tasks = new SomeMainClass().StartProcessing();
// do all other stuffs here, and just at the end of process
Task.WaitAll(tasks);
Hope this clears all confusion.
It's possible your code is swallowing exceptions. I would add a ContineWith call to the end of the part of the code that starts the new task. Something like this untested code:
var T = Task.Factory.StartNew(() =>
{
JobWorker jobWorker = new JobWorker();
jobWorker.Execute(jobTask);
}).ContinueWith(tsk =>
{
var flattenedException = tsk.Exception.Flatten();
Console.Log("Exception! " + flattenedException);
return true;
});
},TaskContinuationOptions.OnlyOnFaulted); //Only call if task is faulted
Another possibility is that something in one of the tasks is timing out (like you mentioned) or deadlocking. To track down whether a timeout (or maybe deadlock) is the root cause, you could add some timeout logic (as described in this SO answer):
int timeout = 1000; //set to something much greater than the time it should take your task to complete (at least for testing)
var task = TheMethodWhichWrapsYourAsyncLogic(cancellationToken);
if (await Task.WhenAny(task, Task.Delay(timeout, cancellationToken)) == task)
{
// Task completed within timeout.
// Consider that the task may have faulted or been canceled.
// We re-await the task so that any exceptions/cancellation is rethrown.
await task;
}
else
{
// timeout/cancellation logic
}
Check out the documentation on exception handling in the TPL on MSDN.
Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}