I'm writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page).
I have it working right now with no async methods and no multithreading.
However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))
This is the basic flow of my program
DataSet alldata;
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
//merge each table in dataForOnePage into allData
}
// PushAllDataToSql(alldata);
Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.
My idea was to just keep making new threads that are async for this line
DataSet dataForOnePage = ScrapeData(url);
and then as each one finishes, run
//merge each table in dataForOnePage into allData
Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?
Thank you.
Edit: Here is my ScrapeData method:
public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid)
{
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = #"https://domain.com?&id=" + pageid + #"restofurl";
string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html );
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData ;
}
If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData method to return a Task<T> instance using the async keyword, like so:
async Task<DataSet> ScrapeDataAsync(Uri url)
{
// Create the HttpClientHandler which will handle cookies.
var handler = new HttpClientHandler();
// Set cookies on handler.
// Await on an async call to fetch here, convert to a data
// set and return.
var client = new HttpClient(handler);
// Wait for the HttpResponseMessage.
HttpResponseMessage response = await client.GetAsync(url);
// Get the content, await on the string content.
string content = await response.Content.ReadAsStringAsync();
// Process content variable here into a data set and return.
DataSet ds = ...;
// Return the DataSet, it will return Task<DataSet>.
return ds;
}
Note that you'll probably want to move away from the WebClient class, as it doesn't support Task<T> inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.
However, this means that you will more than likely have to use the await keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await on those.
Once that is complete, you would normally call await on that, but you can't do that in this scenario because you would await on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T> in an array like so:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url));
}
There is the matter of merging the data into allData. To that end, you want to call the ContinueWith method on the Task<T> instance returned and perform the task of adding the data to allData:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
}
Then, you can wait on all the tasks using the WhenAll method on the Task class and await on that:
// After your loop.
await Task.WhenAll(tasks);
// Process allData
However, note that you have a foreach, and WhenAll takes an IEnumerable<T> implementation. This is a good indicator that this is suitable to use LINQ, which it is:
DataSet alldata;
var tasks =
from url in the8000Urls
select ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
await Task.WhenAll(tasks);
// Process allData
You can also choose not to use query syntax if you wish, it doesn't matter in this case.
Note that if the containing method is not marked as async (because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task returned when you call WhenAll:
// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();
// Process allData.
Namely, the point is, you want to collect your Task instances into a sequence and then wait on the entire sequence before you process allData.
However, I'd suggest trying to process the data before merging it into allData if you can; unless the data processing requires the entire DataSet, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.
You could also use TPL Dataflow, which is a good fit for this kind of problem.
In this case, you build a "dataflow mesh" and then your data flows through it.
This one is actually more like a pipeline than a "mesh". I'm putting in three steps: Download the (string) data from the URL; Parse the (string) data into HTML and then into a DataSet; and Merge the DataSet into the master DataSet.
First, we create the blocks that will go in the mesh:
DataSet allData;
var downloadData = new TransformBlock<string, string>(
async pageid =>
{
System.Net.WebClient webClient = null;
var url = "https://domain.com?&id=" + pageid + "restofurl";
return await webClient.DownloadStringTaskAsync(url);
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var parseHtml = new TransformBlock<string, DataSet>(
html =>
{
var dsPageData = new DataSet();
var doc = new HtmlDocument();
doc.LoadHtml(html);
// HTML Agility parsing
return dsPageData;
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var merge = new ActionBlock<DataSet>(
dataForOnePage =>
{
// merge dataForOnePage into allData
});
Then we link the three blocks together to create the mesh:
downloadData.LinkTo(parseHtml);
parseHtml.LinkTo(merge);
Next, we start pumping data into the mesh:
foreach (var pageid in the8000urls)
downloadData.Post(pageid);
And finally, we wait for each step in the mesh to complete (this will also cleanly propagate any errors):
downloadData.Complete();
await downloadData.Completion;
parseHtml.Complete();
await parseHtml.Completion;
merge.Complete();
await merge.Completion;
The nice thing about TPL Dataflow is that you can easily control how parallel each part is. For now, I've set both the download and parsing blocks to be Unbounded, but you may want to restrict them. The merge block uses the default maximum parallelism of 1, so no locks are necessary when merging.
I recommend reading my reasonably-complete introduction to async/await.
First, make everything asynchronous, starting at the lower-level stuff:
public static async Task<DataSet> ScrapeDataAsync(string pageid)
{
CookieAwareWebClient webClient = ...;
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = #"https://domain.com?&id=" + pageid + #"restofurl";
string html = await webClient.DownloadStringTaskAsync(url).ConfigureAwait(false);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData;
}
Then you can consume it as follows (using async with LINQ):
DataSet alldata;
var tasks = the8000urls.Select(async url =>
{
var dataForOnePage = await ScrapeDataAsync(url);
//merge each table in dataForOnePage into allData
});
await Task.WhenAll(tasks);
PushAllDataToSql(alldata);
And use AsyncContext from my AsyncEx library since this is a console app:
class Program
{
static int Main(string[] args)
{
try
{
return AsyncContext.Run(() => MainAsync(args));
}
catch (Exception ex)
{
Console.Error.WriteLine(ex);
return -1;
}
}
static async Task<int> MainAsync(string[] args)
{
...
}
}
That's it. No need for locking or continuations or any of that.
I believe you don't need async and await stuff here. They can help in desktop application where you need to move your work to non-GUI thread. In my opinion, it will be better to use Parallel.ForEach method in your case. Something like this:
DataSet alldata;
var bag = new ConcurrentBag<DataSet>();
Parallel.ForEach(the8000urls, url =>
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
// Add data for one page to temp bag
bag.Add(dataForOnePage);
});
//merge each table in dataForOnePage into allData from bag
PushAllDataToSql(alldata);
Related
I'm making an app that show some data collected from web in a windows form, today I have to wait sequentially to download all data before show them, how I can do it in parallel in a limited queue (with max concurrent tasks executing) to show result refreshing a datagridview while they are downloaded?
what I have today is a method
internal async Task<string> RequestDataAsync(string uri)
{
var wb = new System.Net.WebClient(); //
var sourceAsync = wb.DownloadStringTaskAsync(uri);
string data = await sourceAsync;
return data;
}
that I put on a foreach() and after it ends, parse data to a list of custom object, then convert that object to a DataTable and bind the datagridview to that.
I not sure if the best way is using LimitedConcurrencyLevelTaskScheduler from example on https://msdn.microsoft.com/library/system.threading.tasks.taskscheduler.aspx (that I not sure how can report to grid each time a resource is downlaoded) or there is a best way to do this.
I not like to start all tasks at same time, because sometimes can be that I have to request 100 downlads at same time, and I like that it will be executed for example 10 tasks at same time maximum.
I know that it is a question that involves control concurrent tasks and report while download that, but not sure what is best nowadays to do that.
I don't often recommend my book, but I think it would help you.
Concurrent asynchrony is done via Task.WhenAll (recipe 2.4 in my book):
List<string> uris = ...;
var tasks = uris.Select(uri => RequestDataAsync(uri));
string[] results = await Task.WhenAll(tasks);
To limit concurrency, use a SemaphoreSlim (recipe 11.5 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestDataAsync(uri); }
finally { semaphore.Release(); }
});
string[] results = await Task.WhenAll(tasks);
To process data as it arrives, introduce another async method (recipe 2.6 in my book):
List<string> uris = ...;
var semaphore = new SemaphoreSlim(10);
var tasks = uris.Select(async uri =>
{
await semaphore.WaitAsync();
try { await RequestAndProcessDataAsync(uri); }
finally { semaphore.Release(); }
});
await Task.WhenAll(tasks);
async Task RequestAndProcessDataAsync(string uri)
{
var data = await RequestDataAsync(uri);
var myObject = Parse(data);
_listBoundToDataTable.Add(myObject);
}
So I am trying to learn how to write asynchronous methods and have been banging my head to get asynchronous calls to work. What always seems to happen is the code hangs on "await" instruction until it eventually seems to time out and crash the loading form in the same method with it.
There are two main reason this is strange:
The code works flawlessly when not asynchronous and just a simple loop
I copied the MSDN code almost verbatim to convert the code to asynchronous calls here: https://msdn.microsoft.com/en-us/library/mt674889.aspx
I know there are a lot of questions already about this on the forms but I have gone through most of them and tried a lot of other ways (with the same result) and now seem to think something is fundamentally wrong after MSDN code wasn't working.
Here is the main method that is called by a background worker:
// this method loads data from each individual webPage
async Task LoadSymbolData(DoWorkEventArgs _e)
{
int MAX_THREADS = 10;
int tskCntrTtl = dataGridView1.Rows.Count;
Dictionary<string, string> newData_d = new Dictionary<string, string>(tskCntrTtl);
// we need to make copies of things that can change in a different thread
List<string> links = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.url].Value.ToString()).ToList());
List<string> symbols = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.symbol].Value.ToString()).ToList());
// we need to create a cancelation token once this is working
// TODO
using (LoadScreen loadScreen = new LoadScreen("Querying stock servers..."))
{
// we cant use the delegate becaus of async keywords
this.loaderScreens.Add(loadScreen);
// wait until the form is loaded so we dont get exceptions when writing to controls on that form
while ( !loadScreen.IsLoaded() );
// load the total number of operations so we can simplify incrementing the progress bar
// on seperate form instances
loadScreen.LoadProgressCntr(0, tskCntrTtl);
// try to run all async tasks since they are non-blocking threaded operations
for (int i = 0; i < tskCntrTtl; i += MAX_THREADS)
{
List<Task<string[]>> ProcessURL = new List<Task<string[]>>();
List<int> taskList = new List<int>();
// Make a list of task indexs
for (int task = i; task < i + MAX_THREADS && task < tskCntrTtl; task++)
taskList.Add(task);
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery =
from task in taskList select QueryHtml(loadScreen, links[task], symbols[task]);
// ***Use ToList to execute the query and start the tasks.
List<Task<string[]>> downloadTasks = downloadTasksQuery.ToList();
// ***Add a loop to process the tasks one at a time until none remain.
while (downloadTasks.Count > 0)
{
// Identify the first task that completes.
Task<string[]> firstFinishedTask = await Task.WhenAny(downloadTasks); // <---- CODE HANGS HERE
// ***Remove the selected task from the list so that you don't
// process it more than once.
downloadTasks.Remove(firstFinishedTask);
// Await the completed task.
string[] data = await firstFinishedTask;
if (!newData_d.ContainsKey(data.First()))
newData_d.Add(data.First(), data.Last());
}
}
// now we have the dictionary with all the information gathered from teh websites
// now we can add the columns if they dont already exist and load the information
// TODO
loadScreen.UpdateProgress(100);
this.loaderScreens.Remove(loadScreen);
}
}
And here is the async method for querying web pages:
async Task<string[]> QueryHtml(LoadScreen _loadScreen, string _link, string _symbol)
{
string data = String.Empty;
try
{
HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
var html = await client.GetStringAsync(_link); // <---- CODE HANGS HERE
doc.LoadHtml(html);
string percGrn = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'up_g')]//span[2]");
string percRed = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'down_r')]//span[2]");
// create somthing we'll nuderstand later
if ((String.IsNullOrEmpty(percGrn) && String.IsNullOrEmpty(percRed)) ||
(!String.IsNullOrEmpty(percGrn) && !String.IsNullOrEmpty(percRed)))
throw new Exception();
// adding string to empty gives string
string perc = percGrn + percRed;
bool isNegative = String.IsNullOrEmpty(percGrn);
double percDouble;
if (double.TryParse(Regex.Match(perc, #"\d+([.])?(\d+)?").Value, out percDouble))
data = (isNegative ? 0 - percDouble : percDouble).ToString();
}
catch (Exception ex) { }
finally
{
// update the progress bar...
_loadScreen.IncProgressCntr();
}
return new string[] { _symbol, data };
}
I could really use some help. Thanks!
In short when you combine async with any 'regular' task functions you get a deadlock
http://olitee.com/2015/01/c-async-await-common-deadlock-scenario/
the solution is by using configureawait
var html = await client.GetStringAsync(_link).ConfigureAwait(false);
The reason you need this is because you didn't await your orginal thread.
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery = from task in taskList select QueryHtml(loadScreen,links[task], symbols[task]);
What's happeneing here is that you mix the await paradigm with thre regular task handling paradigm. and those don't mix (or rather you have to use the ConfigureAwait(false) for this to work.
I have some time consuming code in a foreach that uses task/await.
it includes pulling data from the database, generating html, POSTing that to an API, and saving the replies to the DB.
A mock-up looks like this
List<label> labels = db.labels.ToList();
foreach (var x in list)
{
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
List<response> res = await api.sendMessage(object); //POST
//put all the responses in the db
foreach (var r in res)
{
db.responses.add(r);
}
db.SaveChanges();
}
Time wise, generating the Html and posting it to the API seem to be taking most of the time.
Ideally it would be great if I could generate the HTML for the next item, and wait for the post to finish, before posting the next item.
Other ideas are also welcome.
How would one go about this?
I first thought of adding a Task above the foreach and wait for that to finish before making the next POST, but then how do I process the last loop... it feels messy...
You can do it in parallel but you will need different context in each Task.
Entity framework is not thread safe, so if you can't use one context in parallel tasks.
var tasks = myLabels.Select( async label=>{
using(var db = new MyDbContext ()){
// do processing...
var response = await api.getresponse();
db.Responses.Add(response);
await db.SaveChangesAsync();
}
});
await Task.WhenAll(tasks);
In this case, all tasks will appear to run in parallel, and each task will have its own context.
If you don't create new Context per task, you will get error mentioned on this question Does Entity Framework support parallel async queries?
It's more an architecture problem than a code issue here, imo.
You could split your work into two separate parts:
Get data from database and generate HTML
Send API request and save response to database
You could run them both in parallel, and use a queue to coordinate that: whenever your HTML is ready it's added to a queue and another worker proceeds from there, taking that HTML and sending to the API.
Both parts can be done in multithreaded way too, e.g. you can process multiple items from the queue at the same time by having a set of workers looking for items to be processed in the queue.
This screams for the producer / consumer pattern: one producer produces data in a speed different than the consumer consumes it. Once the producer does not have anything to produce anymore it notifies the consumer that no data is expected anymore.
MSDN has a nice example of this pattern where several dataflowblocks are chained together: the output of one block is the input of another block.
Walkthrough: Creating a Dataflow Pipeline
The idea is as follows:
Create a class that will generate the HTML.
This class has an object of class System.Threading.Tasks.Dataflow.BufferBlock<T>
An async procedure creates all HTML output and await SendAsync the data to the bufferBlock
The buffer block implements interface ISourceBlock<T>. The class exposes this as a get property:
The code:
class MyProducer<T>
{
private System.Threading.Tasks.Dataflow.BufferBlock<T> bufferBlock = new BufferBlock<T>();
public ISourceBlock<T> Output {get {return this.bufferBlock;}
public async ProcessAsync()
{
while (somethingToProduce)
{
T producedData = ProduceOutput(...)
await this.bufferBlock.SendAsync(producedData);
}
// no date to send anymore. Mark the output complete:
this.bufferBlock.Complete()
}
}
A second class takes this ISourceBlock. It will wait at this source block until data arrives and processes it.
do this in an async function
stop when no more data is available
The code:
public class MyConsumer<T>
{
ISourceBlock<T> Source {get; set;}
public async Task ProcessAsync()
{
while (await this.Source.OutputAvailableAsync())
{ // there is input of type T, read it:
var input = await this.Source.ReceiveAsync();
// process input
}
// if here, no more input expected. finish.
}
}
Now put it together:
private async Task ProduceOutput<T>()
{
var producer = new MyProducer<T>();
var consumer = new MyConsumer<T>() {Source = producer.Output};
var producerTask = Task.Run( () => producer.ProcessAsync());
var consumerTask = Task.Run( () => consumer.ProcessAsync());
// while both tasks are working you can do other things.
// wait until both tasks are finished:
await Task.WhenAll(new Task[] {producerTask, consumerTask});
}
For simplicity I've left out exception handling and cancellation. StackOverFlow has artibles about exception handling and cancellation of Tasks:
Keep UI responsive using Tasks, Handle AggregateException
Cancel an Async Task or a List of Tasks
This is what I ended up using: (https://stackoverflow.com/a/25877042/275990)
List<ToSend> sendToAPI = new List<ToSend>();
List<label> labels = db.labels.ToList();
foreach (var x in list) {
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
sendToAPI.add(the object with HTML);
}
int maxParallelPOSTs=5;
await TaskHelper.ForEachAsync(sendToAPI, maxParallelPOSTs, async i => {
using (NasContext db2 = new NasContext()) {
List<response> res = await api.sendMessage(i.object); //POST
//put all the responses in the db
foreach (var r in res)
{
db2.responses.add(r);
}
db2.SaveChanges();
}
});
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body) {
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext()) {
await body(partition.Current).ContinueWith(t => {
if (t.Exception != null) {
string problem = t.Exception.ToString();
}
//observe exceptions
});
}
}));
}
basically lets me generate the HTML sync, which is fine, since it only takes a few seconds to generate 1000's but lets me post and save to DB async, with as many threads as I predefine. In this case I'm posting to the Mandrill API, parallel posts are no problem.
I have a method that uploads one file to server. Right now is working besides any bad coding on the method (Im new to Task library).
Here is the code that uploads a file to server:
private async void UploadDocument()
{
var someTask = await Task.Run<bool>(() =>
{
// open input stream
using (System.IO.FileStream stream = new System.IO.FileStream(_cloudDocuments[0].FullName, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
using (StreamWithProgress uploadStreamWithProgress = new StreamWithProgress(stream))
{
uploadStreamWithProgress.ProgressChanged += uploadStreamWithProgress_ProgressChanged;
// start service client
SiiaSoft.Data.FieTransferWCF ws = new Data.FieTransferWCF();
// upload file
ws.UploadFile(_cloudDocuments[0].FileName, (long)_cloudDocuments[0].Size, uploadStreamWithProgress);
// close service client
ws.Close();
}
}
return true;
});
}
Then I have a ListBox where I can drag & drop multiple files so what I want to do is to do a FOR LOOP within the ListBox files and then call UploadDocument(); but I want to first upload 1st file in the listBox then when completed continue with the second file and so on...
Any clue on the best way to do it?
Thanks a lot.
You should make your UploadDocument return Task. Then you can await the task in a loop. For example:
private async Task UploadAllDocuments()
{
string[] documents = ...; // Fetch the document names
foreach (string document in documents)
{
await UploadDocument(document);
}
}
private async Task UploadDocument(string document)
{
// Code as before, but use document instead of _cloudDocuments[0]
}
In fact, your UploadDocument can be made simpler anyway:
private Task UploadDocument()
{
return Task.Run<bool>(() =>
{
// Code as before
});
}
Wrapping that in an async method isn't particularly useful.
(You may well want to change the type to not be string - it's not clear what _cloudDocuments is.)
In general, you should always make an async method return Task or Task<T> unless you have to make it return void to comply with an event-handling pattern.
Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}