How to achieve parallelism and asynchrony in Web crawling using BFS - c#

This question is going to be quite long but I want to explain my code and thought process as thoroughly as possible, so here goes...
I am coding a web crawler in C# which is supposed to search through Wikipedia from a given source link and find a way to a destination link. For example, you can give it a toaster Wiki page link and a pancake Wiki link and it should output a route which takes you to pancake from toast. In other words - I want to find the shortest path between two Wiki articles.
I think I have correctly coded that up, I created two classes: one is called a CrawlerPage and here is its body:
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class CrawlerPage
{
public string mainLink;
private List<CrawlerPage> _pages = new();
public CrawlerPage(string mainLink)
{
this.mainLink = mainLink;
}
public async Task<List<CrawlerPage>> GetPages()
{
var pagesLinks = await Task.Run(() => GetPages(this));
foreach(var page in pagesLinks)
{
_pages.Add(new CrawlerPage(page));
}
return _pages;
}
private HashSet<string> GetPages(CrawlerPage page)
{
string result = "";
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = client.GetAsync(page.mainLink).Result)
{
using (HttpContent content = response.Content)
{
result = content.ReadAsStringAsync().Result;
}
}
}
var wikiLinksList = ParseLinks(result)
.Where(x => x.Contains("/wiki/") && !x.Contains("https://") && !x.Contains(".jpg") &&
!x.Contains(".png"))
.AsParallel()
.ToList();
var wikiLinksHashSet = new HashSet<string>();
foreach(var wikiLink in wikiLinksList)
{
wikiLinksHashSet.Add("https://en.wikipedia.org" + wikiLink);
}
HashSet<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new HashSet<string>() : nodes.AsParallel().ToList().ConvertAll(
r => r.Attributes.AsParallel().ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).AsParallel().ToHashSet();
}
return wikiLinksHashSet;
}
}
}
The class above is supposed to represent a Wiki page article. It contains its own link (mainLink field) and a list of every other page that is on that page (_pages field). GetPages() methods are basically reading a page in HTML and parsing them to a HashSet with links that are of my interest (with links to other articles, this way we can discard any other junk links).
Second class is a Crawler class that performs BFS (Breadth-first search). Code below:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class Crawler
{
private int _maxDepth;
private int _currDepth;
public Crawler(int maxDepth)
{
_currDepth = 0;
_maxDepth = maxDepth;
}
public async Task CrawlParallelAsync(string sourceLink, string destinationLink)
{
var sourcePage = new CrawlerPage(sourceLink);
var destinationPage = new CrawlerPage(destinationLink);
var visited = new HashSet<string>();
Queue <CrawlerPage> queue = new();
queue.Enqueue(sourcePage);
while (queue.Count > 0)
{
var currPage = queue.Dequeue();
Console.WriteLine(currPage.mainLink);
var currPageSubpages = await Task.Run(() => currPage.GetPages());
if (currPage.mainLink == destinationPage.mainLink || _currDepth == _maxDepth)
{
visited.Add(currPage.mainLink);
break;
}
if (visited.Contains(currPage.mainLink))
continue;
visited.Add(currPage.mainLink);
foreach (var page in currPageSubpages)
{
if (!visited.Contains(page.mainLink))
{
queue.Enqueue(page);
}
}
}
foreach (var visitedPage in visited)
{
Console.WriteLine(visitedPage);
}
}
}
}
Note that I am not incrementing currDepth - I want to make it so that if the depth of the search goes too far, the search would stop because of the route being too long.
The class above works as follows: it enqueues the page with sourceLink and performs standard BFS: it dequeues the page, checks if it has been visited, checks if this is the destination page and then gets every subpage of that page (using currPage.GetPages() and adds them to the queue. I believe that the algorithm works fine, although it is extremely sluggish and does not provide any use because of that.
My conclusion: it absolutely needs to be done asynchronously and parallel in order to be efficient. I have tried with Tasks as you can tell, but that doesn't improve the performance at all. My intuition tells me that every time we read subpages of a page, we should do that async and parallel and every time we start crawling on a page, we have to do that async and in parallel as well. I have no idea on how to achieve that, do I need to completely refactor my code? Should I create a new crawler every time I enqueue a subpage?
I'm lost, can you help me figure it out?

You could consider using the new (.NET 6) API Parallel.ForEachAsync. This method accepts an enumerable sequence, and invokes an asynchronous delegate for each element in the sequence, with a specific degree of parallelism. One overload of this method is particularly interesting, because it accepts an IAsyncEnumerable<T> as input, which is essentially an asynchronous stream of data. You could create such a stream dynamically with an iterator method (a method that yields), but it is probably easier to use a Channel<T> that exposes its contents as IAsyncEnumerable<T>. Here is a rough demonstration of this idea:
var channel = Channel.CreateUnbounded<CrawlerPage>();
channel.Writer.TryWrite(new CrawlerPage(sourceLink));
var cts = new CancellationTokenSource();
var options = new ParallelOptions()
{
MaxDegreeOfParallelism = 10,
CancellationToken = cts.Token
};
await Parallel.ForEachAsync(channel.Reader.ReadAllAsync(), async (page, ct) =>
{
CrawlerPage[] subpages = await GetPagesAsync(page);
foreach (var subpage in subpages) channel.Writer.TryWrite(subpage);
});
The parallel loop will continue crunching pages until the channel.Writer.Complete() method is called and then all remaining pages in the channel are consumed, or until the CancellationTokenSource is canceled.

Calling client.GetAsync(page.mainLink).Result makes your code wait synchronously. Use await client.GetAsync(page.mainLink). Doing so you should not use Task.Run. Task.Run can be used to have synchronous work be excecuted in asynchronously.
If you want parallelism you can await several Task using Task.WhenAll.

Related

Observable with backpressure in C#

Is there a way in C# rx to handle backpressure?
I'm trying to call a web api from the results of a paged query. This web api is very fragile and I need to not have more than say 3 concurrent calls, so, the program should be something like:
Feth a page from db
Call the web api with a maximum of three concurrent calls per each record on the page
Save the results back to db
Fetch another page and repeat until there are no more results.
I'm not really getting the sequence that I'm after, basically the db gets all the records regardless of whether they can be processed or not.
I've tried a variety of things including tweaking at the ObserveOn operator, implementing a semaphore, and a few other things. Could I get a little bit of guidance to implement something like this?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reactive.Concurrency;
using System.Reactive.Linq;
using System.Reactive.Threading.Tasks;
using System.Threading;
using System.Threading.Tasks;
using Castle.Core.Internal;
using Xunit;
using Xunit.Abstractions;
namespace ProductValidation.CLI.Tests.Services
{
public class Example
{
private readonly ITestOutputHelper output;
public Example(ITestOutputHelper output)
{
this.output = output;
}
[Fact]
public async Task RunsObservableToCompletion()
{
var repo = new Repository(output);
var client = new ServiceClient(output);
var results = repo.FetchRecords()
.Select(x => client.FetchMoreInformation(x).ToObservable())
.Merge(1)
.Do(async x => await repo.Save(x));
await results.LastOrDefaultAsync();
}
}
public class Repository
{
private readonly ITestOutputHelper output;
public Repository(ITestOutputHelper output)
{
this.output = output;
}
public IObservable<int> FetchRecords()
{
return Observable.Create<int>(async (observer) =>
{
var page = 1;
var products = await FetchPage(page);
while (!products.IsNullOrEmpty())
{
foreach (var product in products)
{
observer.OnNext(product);
}
page += 1;
products = await FetchPage(page);
}
observer.OnCompleted();
})
.ObserveOn(SynchronizationContext.Current);
}
private async Task<IEnumerable<int>> FetchPage(int page)
{
// Simulate fetching a paged query.
await Task.Delay(500).ToObservable().ObserveOn(new TaskPoolScheduler(new TaskFactory()));
output.WriteLine("Fetching page {0}", page);
if (page >= 4) return Enumerable.Empty<int>();
return Enumerable.Range(1, 3).Select(_ => page);
}
public async Task Save(string id)
{
await Task.Delay(50); //Simulates latency
}
}
public class ServiceClient
{
private readonly ITestOutputHelper output;
private readonly SemaphoreSlim semaphore;
public ServiceClient(ITestOutputHelper output)
{
this.output = output;
this.semaphore = new SemaphoreSlim(2);
}
public async Task<string> FetchMoreInformation(int id)
{
try
{
output.WriteLine("Calling the web client for {0}", id);
await semaphore.WaitAsync(); // Protection for the webapi not sending too many calls
await Task.Delay(1000); //Simulates latency
return id.ToString();
}
finally
{
semaphore.Release();
}
}
}
}
The Rx does not support backpressure, so there is no easy way to fetch the records from the DB at the same tempo that the records are processed. Maybe you could use a Subject<Unit> as a signaling mechanism, push a value every time a record is processed, and devise a way to use these signals at the producing site to fetch a new record from the DB when a signal is received. But it will be a messy and idiomatic solution. The TPL Dataflow is a more suitable tool than the Rx for doing this kind of work. It supports natively the BoundedCapacity configuration option.
Some comments regarding the code you've posted, that are not directly related to the backpressure issue:
The Merge operator with a maxConcurrent parameter imposes a limit on the concurrent subscriptions to the inner sequences, but this will have no effect in case the inner sequences are already up and running. So you have to ensure that the inner sequences are cold, and a handy way to do this is the Defer operator:
.Select(x => Observable.Defer(() =>
client.FetchMoreInformation(x).ToObservable()))
A more common way to convert asynchronous methods to deferred observable sequences is the FromAsync operator:
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
Btw the Do operator does not understand async delegates, so instead of:
.Do(async x => await repo.Save(x));
...which creates async void lambdas, it's better to do this:
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1);
Update: Here is an example of how you could use a SemaphoreSlim in order to implement backpressure in Rx:
const int boundedCapacity = 10;
using var semaphore = new SemaphoreSlim(boundedCapacity, boundedCapacity);
IObservable<int> results = repo
.FetchRecords(semaphore)
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
.Merge(1)
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1)
.Do(_ => semaphore.Release());
await results.DefaultIfEmpty();
And inside the FetchRecords method:
//...
await semaphore.WaitAsync();
observer.OnNext(product);
//...
This is a fragile solution, because it depends on propagating all elements through the pipeline. If in the future you decide to include filtering or throttling inside the pipeline, then the one-to-one relationship between WaitAsync and Release will be violated, with the most probable outcome being a deadlocked pipeline.

Better way to download html content from multiple pages [c#]

I'm writing web scraping program and here I have situation when I have 10 links on one page, and for every link I need to download html text to scrape data from them, and move on next page and repeat all process. When I do this synchronously for one link it took 5-10 sec to download html text (it is slow when I try open page with browser). So I looked for asynchronous way to implement this, and for 10 links it took 5-10 sec to download html text. I have to loop through 100 pages and it took 30 minutes to process all data.
I don't have too much experience with Tasks in C#, so I made this code and it works, but I'm not sure is it good or there exists better solution?
class Program
{
public static List<Task> tasks = new List<Tasks>();
public static List<Data> webData = new List<Data>();
public static async Task<string> GetHtmlText(string link)
{
using (HttpClient client = new HttpClient())
{
return await client.GetStringAsync(link);
}
}
public static void Main(string[] args)
{
for(int i = 0; i < 100; i++)
{
List<string> links = GetLinksFromPage(i); // returns 10 links from page //replaced with edit solution >>>
foreach (var link in links)
{
Task<string> task= Task.Run(() => GetHtmlText(link));
TaskList.Add(task);
}
Task.WaitAll(TaskList.ToArray()); // replaced with edit solution <<<
foreach(Task<string> task in TaskList)
{
string html = task.Result;
Data data = GetDataFromHtml(html);
webData.Add(data);
}
...
}
}
EDIT:
This made my day, setting DefaultConnectionLimit to 50
DefaultConnectionLimit
ServicePointManager.DefaultConnectionLimit = 50
var concurrentBag = new ConcurrentBag<string>();
var t = linksFromPage.Select(async link =>
{
var response = await GetLinkStringTaskAsync(link);
concurrentBag.Add(response);
});
await Task.WhenAll(t);

Parallel.For and httpclient crash the application C#

I want to avoid application crashing problem due to parallel for loop and httpclient but I am unable to apply solutions that are provided elsewhere on the web due to my limited knowledge of programming. My code is pasted below.
class Program
{
public static List<string> words = new List<string>();
public static int count = 0;
public static string output = "";
private static HttpClient Client = new HttpClient();
public static void Main(string[] args)
{
//input path strings...
List<string> links = new List<string>();
links.AddRange(File.ReadAllLines(input));
List<string> longList = new List<string>(File.ReadAllLines(#"a.txt"));
words.AddRange(File.ReadAllLines(output1));
System.Net.ServicePointManager.DefaultConnectionLimit = 8;
count = longList.Count;
//for (int i = 0; i < longList.Count; i++)
Task.Run(() => Parallel.For(0, longList.Count, new ParallelOptions { MaxDegreeOfParallelism = 5 }, (i, loopState) =>
{
Console.WriteLine(i);
string link = #"some link" + longList[i] + "/";
try
{
if (!links.Contains(link))
{
Task.Run(async () => { await Download(link); }).Wait();
}
}
catch (System.Exception e)
{
}
}));
//}
}
public static async Task Download(string link)
{
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
document.LoadHtml(await getURL(link));
//...stuff with html agility pack
}
public static async Task<string> getURL(string link)
{
string result = "";
HttpResponseMessage response = await Client.GetAsync(link);
Console.WriteLine(response.StatusCode);
if(response.IsSuccessStatusCode)
{
HttpContent content = response.Content;
var bytes = await response.Content.ReadAsByteArrayAsync();
result = Encoding.UTF8.GetString(bytes);
}
return result;
}
}
There are solutions for example this one, but I don't know how to put await keyword in my main method, and currently the program simply exits due to its absence before Task.Run(). As you can see I have already applied a workaround regarding async Download() method to call it in main method.
I have also doubts regarding the use of same instance of httpclient in different parallel threads. Please advise me whether I should create new instance of httpclient each time.
You're right that you have to block tasks somewhere in a console application, otherwise the program will just exit before it's complete. But you're doing this more than you need to. Aim for just blocking the main thread and delegating the rest to an async method. A good practice is to create a method with a signature like private async Task MainAsyc(args), put the "guts" of your program logic there, call it from Main like this:
MainAsync(args).Wait();
In your example, move everything from Main to MainAsync. Then you're free to use await as much as you want. Task.Run and Parallel.For are explicitly consuming new threads for I/O bound work, which is unnecessary in the async world. Use Task.WhenAll instead. The last part of your MainAsync method should end up looking something like this:
await Task.WhenAll(longList.Select(async s => {
Console.WriteLine(i);
string link = #"some link" + s + "/";
try
{
if (!links.Contains(link))
{
await Download(link);
}
}
catch (System.Exception e)
{
}
}));
There is one little wrinkle here though. Your example is throttling the parallelism at 5. If you find you still need this, TPL Dataflow is a great library for throttled parallelism in the async world. Here's a simple example.
Regarding HttpClient, using a single instance across threads is completely safe and highly encouraged.

Best practice for task/await in a foreach loop

I have some time consuming code in a foreach that uses task/await.
it includes pulling data from the database, generating html, POSTing that to an API, and saving the replies to the DB.
A mock-up looks like this
List<label> labels = db.labels.ToList();
foreach (var x in list)
{
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
List<response> res = await api.sendMessage(object); //POST
//put all the responses in the db
foreach (var r in res)
{
db.responses.add(r);
}
db.SaveChanges();
}
Time wise, generating the Html and posting it to the API seem to be taking most of the time.
Ideally it would be great if I could generate the HTML for the next item, and wait for the post to finish, before posting the next item.
Other ideas are also welcome.
How would one go about this?
I first thought of adding a Task above the foreach and wait for that to finish before making the next POST, but then how do I process the last loop... it feels messy...
You can do it in parallel but you will need different context in each Task.
Entity framework is not thread safe, so if you can't use one context in parallel tasks.
var tasks = myLabels.Select( async label=>{
using(var db = new MyDbContext ()){
// do processing...
var response = await api.getresponse();
db.Responses.Add(response);
await db.SaveChangesAsync();
}
});
await Task.WhenAll(tasks);
In this case, all tasks will appear to run in parallel, and each task will have its own context.
If you don't create new Context per task, you will get error mentioned on this question Does Entity Framework support parallel async queries?
It's more an architecture problem than a code issue here, imo.
You could split your work into two separate parts:
Get data from database and generate HTML
Send API request and save response to database
You could run them both in parallel, and use a queue to coordinate that: whenever your HTML is ready it's added to a queue and another worker proceeds from there, taking that HTML and sending to the API.
Both parts can be done in multithreaded way too, e.g. you can process multiple items from the queue at the same time by having a set of workers looking for items to be processed in the queue.
This screams for the producer / consumer pattern: one producer produces data in a speed different than the consumer consumes it. Once the producer does not have anything to produce anymore it notifies the consumer that no data is expected anymore.
MSDN has a nice example of this pattern where several dataflowblocks are chained together: the output of one block is the input of another block.
Walkthrough: Creating a Dataflow Pipeline
The idea is as follows:
Create a class that will generate the HTML.
This class has an object of class System.Threading.Tasks.Dataflow.BufferBlock<T>
An async procedure creates all HTML output and await SendAsync the data to the bufferBlock
The buffer block implements interface ISourceBlock<T>. The class exposes this as a get property:
The code:
class MyProducer<T>
{
private System.Threading.Tasks.Dataflow.BufferBlock<T> bufferBlock = new BufferBlock<T>();
public ISourceBlock<T> Output {get {return this.bufferBlock;}
public async ProcessAsync()
{
while (somethingToProduce)
{
T producedData = ProduceOutput(...)
await this.bufferBlock.SendAsync(producedData);
}
// no date to send anymore. Mark the output complete:
this.bufferBlock.Complete()
}
}
A second class takes this ISourceBlock. It will wait at this source block until data arrives and processes it.
do this in an async function
stop when no more data is available
The code:
public class MyConsumer<T>
{
ISourceBlock<T> Source {get; set;}
public async Task ProcessAsync()
{
while (await this.Source.OutputAvailableAsync())
{ // there is input of type T, read it:
var input = await this.Source.ReceiveAsync();
// process input
}
// if here, no more input expected. finish.
}
}
Now put it together:
private async Task ProduceOutput<T>()
{
var producer = new MyProducer<T>();
var consumer = new MyConsumer<T>() {Source = producer.Output};
var producerTask = Task.Run( () => producer.ProcessAsync());
var consumerTask = Task.Run( () => consumer.ProcessAsync());
// while both tasks are working you can do other things.
// wait until both tasks are finished:
await Task.WhenAll(new Task[] {producerTask, consumerTask});
}
For simplicity I've left out exception handling and cancellation. StackOverFlow has artibles about exception handling and cancellation of Tasks:
Keep UI responsive using Tasks, Handle AggregateException
Cancel an Async Task or a List of Tasks
This is what I ended up using: (https://stackoverflow.com/a/25877042/275990)
List<ToSend> sendToAPI = new List<ToSend>();
List<label> labels = db.labels.ToList();
foreach (var x in list) {
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
sendToAPI.add(the object with HTML);
}
int maxParallelPOSTs=5;
await TaskHelper.ForEachAsync(sendToAPI, maxParallelPOSTs, async i => {
using (NasContext db2 = new NasContext()) {
List<response> res = await api.sendMessage(i.object); //POST
//put all the responses in the db
foreach (var r in res)
{
db2.responses.add(r);
}
db2.SaveChanges();
}
});
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body) {
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext()) {
await body(partition.Current).ContinueWith(t => {
if (t.Exception != null) {
string problem = t.Exception.ToString();
}
//observe exceptions
});
}
}));
}
basically lets me generate the HTML sync, which is fine, since it only takes a few seconds to generate 1000's but lets me post and save to DB async, with as many threads as I predefine. In this case I'm posting to the Mandrill API, parallel posts are no problem.

Download HTML pages concurrently using the Async CTP

Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}

Categories

Resources