Download HTML pages concurrently using the Async CTP

Download HTML pages concurrently using the Async CTP - c#

Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?

I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}

Related

Multiple HttpClients with proxies, trying to achieve maximum download speed

I need to use proxies to download a forum. The problem with my code is that it takes only 10% of my internet bandwidth. Also I have read that I need to use a single HttpClient instance, but with multiple proxies I don't know how to do it. Changing MaxDegreeOfParallelism doesn't change anything.
public static IAsyncEnumerable<IFetchResult> FetchInParallelAsync(
this IEnumerable<Url> urls, FetchContext context)
{
var fetchBlcock = new TransformBlock<Url, IFetchResult>(
transform: url => url.FetchAsync(context),
dataflowBlockOptions: new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 128
}
);
foreach(var url in urls)
fetchBlcock.Post(url);
fetchBlcock.Complete();
var result = fetchBlcock.ToAsyncEnumerable();
return result;
}
Every call to FetchAsync will create or reuse a HttpClient with a WebProxy.
public static async Task<IFetchResult> FetchAsync(this Url url, FetchContext context)
{
var httpClient = context.ProxyPool.Rent();
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay,
context.isReloadWithCookie);
context.ProxyPool.Return(httpClient);
return result;
}
public HttpClient Rent()
{
lock(_lockObject)
{
if (_uninitiliazedDatacenterProxiesAddresses.Count != 0)
{
var proxyAddress = _uninitiliazedDatacenterProxiesAddresses.Pop();
return proxyAddress.GetWebProxy(DataCenterProxiesCredentials).GetHttpClient();
}
return _proxiesQueue.Dequeue();
}
}
I am a novice at software developing, but the task of downloading using hundreds or thousands of proxies asynchronously looks like a trivial task that many should have been faced with and found a correct way to do it. So far I was unable to find any solutions to my problem on the internet. Any thoughts of how to achieve maximum download speed?

Let's take a look at what happens here:
var result = await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);
You are actually awaiting before you continue with the next item. That's why it is asynchronous and not parallel programming. async in Microsoft docs
The await keyword is where the magic happens. It yields control to the caller of the method that performed await, and it ultimately allows a UI to be responsive or a service to be elastic.
In essence, it frees the calling thread to do other stuff but the original calling code is suspended from executing, until the IO operation is done.
Now to your problem:
You can either use this excellent solution here: foreach async
You can use the Parallel library to execute your code in different threads.
Something like the following from Parallel for example
Parallel.For(0, urls.Count,
index => fetchBlcock.Post(urls[index])
});

Starting Multiple Async Tasks and Process Them As They Complete (C#)

So I am trying to learn how to write asynchronous methods and have been banging my head to get asynchronous calls to work. What always seems to happen is the code hangs on "await" instruction until it eventually seems to time out and crash the loading form in the same method with it.
There are two main reason this is strange:
The code works flawlessly when not asynchronous and just a simple loop
I copied the MSDN code almost verbatim to convert the code to asynchronous calls here: https://msdn.microsoft.com/en-us/library/mt674889.aspx
I know there are a lot of questions already about this on the forms but I have gone through most of them and tried a lot of other ways (with the same result) and now seem to think something is fundamentally wrong after MSDN code wasn't working.
Here is the main method that is called by a background worker:
// this method loads data from each individual webPage
async Task LoadSymbolData(DoWorkEventArgs _e)
{
int MAX_THREADS = 10;
int tskCntrTtl = dataGridView1.Rows.Count;
Dictionary<string, string> newData_d = new Dictionary<string, string>(tskCntrTtl);
// we need to make copies of things that can change in a different thread
List<string> links = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.url].Value.ToString()).ToList());
List<string> symbols = new List<string>(dataGridView1.Rows.Cast<DataGridViewRow>()
.Select(r => r.Cells[dbIndexs_s.symbol].Value.ToString()).ToList());
// we need to create a cancelation token once this is working
// TODO
using (LoadScreen loadScreen = new LoadScreen("Querying stock servers..."))
{
// we cant use the delegate becaus of async keywords
this.loaderScreens.Add(loadScreen);
// wait until the form is loaded so we dont get exceptions when writing to controls on that form
while ( !loadScreen.IsLoaded() );
// load the total number of operations so we can simplify incrementing the progress bar
// on seperate form instances
loadScreen.LoadProgressCntr(0, tskCntrTtl);
// try to run all async tasks since they are non-blocking threaded operations
for (int i = 0; i < tskCntrTtl; i += MAX_THREADS)
{
List<Task<string[]>> ProcessURL = new List<Task<string[]>>();
List<int> taskList = new List<int>();
// Make a list of task indexs
for (int task = i; task < i + MAX_THREADS && task < tskCntrTtl; task++)
taskList.Add(task);
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery =
from task in taskList select QueryHtml(loadScreen, links[task], symbols[task]);
// ***Use ToList to execute the query and start the tasks.
List<Task<string[]>> downloadTasks = downloadTasksQuery.ToList();
// ***Add a loop to process the tasks one at a time until none remain.
while (downloadTasks.Count > 0)
{
// Identify the first task that completes.
Task<string[]> firstFinishedTask = await Task.WhenAny(downloadTasks); // <---- CODE HANGS HERE
// ***Remove the selected task from the list so that you don't
// process it more than once.
downloadTasks.Remove(firstFinishedTask);
// Await the completed task.
string[] data = await firstFinishedTask;
if (!newData_d.ContainsKey(data.First()))
newData_d.Add(data.First(), data.Last());
}
}
// now we have the dictionary with all the information gathered from teh websites
// now we can add the columns if they dont already exist and load the information
// TODO
loadScreen.UpdateProgress(100);
this.loaderScreens.Remove(loadScreen);
}
}
And here is the async method for querying web pages:
async Task<string[]> QueryHtml(LoadScreen _loadScreen, string _link, string _symbol)
{
string data = String.Empty;
try
{
HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
var html = await client.GetStringAsync(_link); // <---- CODE HANGS HERE
doc.LoadHtml(html);
string percGrn = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'up_g')]//span[2]");
string percRed = doc.FindInnerHtml(
"//span[contains(#class,'time_rtq_content') and contains(#class,'down_r')]//span[2]");
// create somthing we'll nuderstand later
if ((String.IsNullOrEmpty(percGrn) && String.IsNullOrEmpty(percRed)) ||
(!String.IsNullOrEmpty(percGrn) && !String.IsNullOrEmpty(percRed)))
throw new Exception();
// adding string to empty gives string
string perc = percGrn + percRed;
bool isNegative = String.IsNullOrEmpty(percGrn);
double percDouble;
if (double.TryParse(Regex.Match(perc, #"\d+([.])?(\d+)?").Value, out percDouble))
data = (isNegative ? 0 - percDouble : percDouble).ToString();
}
catch (Exception ex) { }
finally
{
// update the progress bar...
_loadScreen.IncProgressCntr();
}
return new string[] { _symbol, data };
}
I could really use some help. Thanks!

In short when you combine async with any 'regular' task functions you get a deadlock
http://olitee.com/2015/01/c-async-await-common-deadlock-scenario/
the solution is by using configureawait
var html = await client.GetStringAsync(_link).ConfigureAwait(false);
The reason you need this is because you didn't await your orginal thread.
// ***Create a query that, when executed, returns a collection of tasks.
IEnumerable<Task<string[]>> downloadTasksQuery = from task in taskList select QueryHtml(loadScreen,links[task], symbols[task]);
What's happeneing here is that you mix the await paradigm with thre regular task handling paradigm. and those don't mix (or rather you have to use the ConfigureAwait(false) for this to work.

Execute 4 tasks simultaneously, auto-starting another as and when each completes

I have a list of 100 urls. I need to fetch the html content of those urls. Lets say I don't use the async version of DownloadString and instead do the following.
var task1 = SyTask.Factory.StartNew(() => new WebClient().DownloadString("url1"));
What I want to achieve is to get the html string for at max 4 urls at a time.
I start 4 tasks for the first four urls. Assume the 2nd url completes, I want to immediately start the 5th task for the 5th url. And so on. This way at max 4 only 4 urls will be downloaded, and for all purposes there will always be 4 urls being downloaded, ie till all 100 are processed.
I can't seem to visualize how will I actually achieve this. There must be an established pattern for doing this. Thoughts?
EDIT:
Following up on #Damien_The_Unbeliever's comment to use Parallel.ForEach, I wrote the following
var urls = new List<string>();
var results = new Dictionary<string, string>();
var lockObj = new object();
Parallel.ForEach(urls,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
url =>
{
var str = new WebClient().DownloadString(url);
lock (lockObj)
{
results[url] = str;
}
});
I think the above reads better than creating individual tasks and using a semaphore to limit concurrency. That said having never used or worked with Parallel.ForEach, I am unsure if this correctly does what I need to do.

SemaphoreSlim sem = new SemaphoreSlim(4);
foreach (var url in urls)
{
sem.Wait();
Task.Factory.StartNew(() => new WebClient().DownloadString(url))
.ContinueWith(t => sem.Release());
}

Actually, Task.WaitAnyis much better for what you're trying to achieve than ContinueWith
int tasksPerformedCount = 0
Task[] tasks = //initial 4 tasks
while(tasksPerformedCount< 100)
{
//returns the index of the first task to complete, as soon as it completes
int index = Task.WaitAny(tasks);
tasksPerformedCount++;
//replace it with a new one
tasks[index] = //new task
}
Edit:
Another example of Task.WaitAny from http://www.amazon.co.uk/Exam-Ref-70-483-Programming-In/dp/0735676828/ref=sr_1_1?ie=UTF8&qid=1378105711&sr=8-1&keywords=exam+ref+70-483+programming+in+c
namespace Chapter1 {    
public static class Program     {         
public static void Main()         {
Task<int>[] tasks = new Task<int>[3];
tasks[0] = Task.Run(() => { Thread.Sleep(2000); return 1; });
tasks[1] = Task.Run(() => { Thread.Sleep(1000); return 2; });
tasks[2] = Task.Run(() => { Thread.Sleep(3000); return 3; });
while (tasks.Length > 0)
{
int i = Task.WaitAny(tasks);
Task<int> completedTask = tasks[i];
Console.WriteLine(completedTask.Result);
var temp = tasks.ToList();
temp.RemoveAt(i);
tasks = temp.ToArray();
}
}
}
}

Changing from Synchronous mindset to Asynchronous

I'm busy with a windows phone application that of course uses silverlight. This means that calling any webservices has to be done asynchronously, and since this is all good and well in regards to best practice in preventing your entire app in hanging when waiting for a resource, I'm still stuck in the "synchronous mindset"...
Because the way I see it now is that you end up having 2 methods that needs to handle one function, e.g:
1)The method that actually calls the webservice:
public void myAsyncWebService(DownloadStringCompletedEventHandler callback)
{
//Url to webservice
string servletUrl = "https://deangrobler.com/someService/etc/etc"
//Calls Servlet
WebClient client = new WebClient();
client.DownloadStringCompleted += callback;
client.DownloadStringAsync(new Uri(servletUrl, UriKind.Absolute));
}
2) and the method that handles the data when it eventually comes back:
private void serviceReturn(object sender, DownloadStringCompletedEventArgs e)
{
var jsonResponse = e.Result;
//and so on and so forth...
}
So instead of having to just create and call a single method that goes to the webservice, gets the returned result and sent it back to me like this:
public string mySyncWebService(){
//Calls the webservice
// ...waits for return
//And returns result
}
I have to in a Class call myAsyncWebService, AND create another method in the calling class that will handle the result returned by myAsyncWebService. Just, in my opinion, creates messy code. With synchronous calls you could just call one method and be done with it.
Am I just using Asynchronous calls wrong? Is my understanding wrong? I need some enlightment here, I hate doing this messy-async calls. It makes my code too complex and readability just goes to... hell.
Thanks for anyone willing to shift my mind!

You have to turn your mind inside out to program asynchronously. I speak from experience. :)
Am I just using Asynchronous calls wrong? Is my understanding wrong?
No. Asynchronous code is fairly difficult to write (don't forget error handling) and extremely difficult to maintain.
This is the reason that async and await were invented.
If you're able to upgrade to VS2012, then you can use Microsoft.Bcl.Async (currently in beta) to write your code like this:
string url1 = "https://deangrobler.com/someService/etc/etc";
string jsonResponse1 = await new WebClient().DownloadStringTaskAsync(url1);
string url2 = GetUriFromJson(jsonResponse1);
string jsonResponse2 = await new WebClient().DownloadStringTaskAsync(url2);
Easy to write. Easy to maintain.

Async is like when you make a telephone call and get an answering machine, if you want a return call you leave your number. The first method is your call asking for data, the second is the "number" you've left for the return call.

It all becomes much easier and readable if you use lambdas instead. This also enables you to access variables declared in the "parent" method, like in the following example:
private void CallWebService()
{
//Defined outside the callback
var someFlag = true;
var client = new WebClient();
client.DownloadStringCompleted += (s, e) =>
{
//Using lambdas, we can access variables defined outside the callback
if (someFlag)
{
//Do stuff with the result.
}
};
client.DownloadStringAsync(new Uri("http://www.microsoft.com/"));
}
EDIT: Here is another example with two chained service calls. It still isn't very pretty, but imho it is a little more readable than the OPs original code.
private void CallTwoWebServices()
{
var client = new WebClient();
client.DownloadStringCompleted += (s, e) =>
{
//1st call completed. Now make 2nd call.
var client2 = new WebClient();
client2.DownloadStringCompleted += (s2, e2) =>
{
//Both calls completed.
};
client2.DownloadStringAsync(new Uri("http://www.google.com/"));
};
client.DownloadStringAsync(new Uri("http://www.microsoft.com/"));
}

To avoid creating messy code, if you can't use the async / await pattern because you are on older framework, you will find helpful check CoRoutines in their Caliburn Micro implemantation. With this pattern you create an enumerable yielding at each turn a new asynchronous segment to execute: by the reader point of view asynchronous steps appear as a sequence, but walking among the steps ( so yielding the next one ) is done externally by asynchronously wait the single task. It is a nice pattern easy to implement and really clear to read.
BTW if you don't want to use Caliburn Micro as your MVVM tool because you are using something else, you can use just the coroutine facility, it is very insulated inside the framework.
Let me just post some code from an example in this blog post.
public IEnumerable<IResult> Login(string username, string password)
{
_credential.Username = username;
_credential.Password = password;
var result = new Result();
var request = new GetUserSettings(username);
yield return new ProcessQuery(request, result, "Logging In...");
if (result.HasErrors)
{
yield return new ShowMessageBox("The username or password provided is incorrect.", "Access Denied");
yield break;
}
var response = result.GetResponse(request);
if(response.Permissions == null || response.Permissions.Count < 1)
{
yield return new ShowMessageBox("You do not have permission to access the dashboard.", "Access Denied");
yield break;
}
_context.Permissions = response.Permissions;
yield return new OpenWith<IShell, IDashboard>();
}
Isn't it easy to read? But it is is actually asynchronous: each yield steps are executed in an asynchronous manner and the execution flow again after the yield statement as soon the previous task completed.

With synchronous calls you could just call one method and be done with it.
Sure, but if you do that from the UI thread you will block the entire UI. That is unacceptable in any modern application, in particular in Silverlight applications running in the browser or in the phone. A phone that is unresponsive for 30 seconds while a DNS lookup times out is not something anybody wants to use.
So on the UI thread, probably because the user did some action in the UI, you start an asynchronous call. When the call completes a method is called on a background thread to handle the result of the call. This method will most likely update the UI with the result of the asynchronous call.
With the introduction of async and await in .NET 4.5 some of this "split" code can be simplified. Luckily async and await is now available for Windows Phone 7.5 in a beta version using the NuGet package Microsoft.Bcl.Async.
Here is a small (and somewhat silly) example demonstrating how you can chain two web service calls using async. This works with .NET 4.5 but using the NuGet package linked above you should be able to do something similar on Windows Phone 7.5.
async Task<String> GetCurrencyCode() {
using (var webClient = new WebClient()) {
var xml = await webClient.DownloadStringTaskAsync("http://freegeoip.net/xml/");
var xElement = XElement.Parse(xml);
var countryName = (String) xElement.Element("CountryName");
return await GetCurrencyCodeForCountry(countryName);
}
}
async Task<String> GetCurrencyCodeForCountry(String countryName) {
using (var webClient = new WebClient()) {
var outerXml = await webClient.DownloadStringTaskAsync("http://www.webservicex.net/country.asmx/GetCurrencyByCountry?CountryName=" + countryName);
var outerXElement = XElement.Parse(outerXml);
var innerXml = (String) outerXElement;
var innerXElement = XElement.Parse(innerXml);
var currencyCode = (String) innerXElement.Element("Table").Element("CurrencyCode");
return currencyCode;
}
}
However, you still need to bridge between the UI thread and the async GetCurrencyCode. You can't await in an event handler but you can use Task.ContinueWith on the task returned by the async call:
void OnUserAction() {
GetCurrencyCode().ContinueWith(GetCurrencyCodeCallback);
}
void GetCurrencyCodeCallback(Task<String> task) {
if (!task.IsFaulted)
Console.WriteLine(task.Result);
else
Console.WriteLine(task.Exception);
}

c# .net 4.5 async / multithread?

I'm writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page).
I have it working right now with no async methods and no multithreading.
However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))
This is the basic flow of my program
DataSet alldata;
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
//merge each table in dataForOnePage into allData
}
// PushAllDataToSql(alldata);
Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.
My idea was to just keep making new threads that are async for this line
DataSet dataForOnePage = ScrapeData(url);
and then as each one finishes, run
//merge each table in dataForOnePage into allData
Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?
Thank you.
Edit: Here is my ScrapeData method:
public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid)
{
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = #"https://domain.com?&id=" + pageid + #"restofurl";
string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html );
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData ;
}

If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData method to return a Task<T> instance using the async keyword, like so:
async Task<DataSet> ScrapeDataAsync(Uri url)
{
// Create the HttpClientHandler which will handle cookies.
var handler = new HttpClientHandler();
// Set cookies on handler.
// Await on an async call to fetch here, convert to a data
// set and return.
var client = new HttpClient(handler);
// Wait for the HttpResponseMessage.
HttpResponseMessage response = await client.GetAsync(url);
// Get the content, await on the string content.
string content = await response.Content.ReadAsStringAsync();
// Process content variable here into a data set and return.
DataSet ds = ...;
// Return the DataSet, it will return Task<DataSet>.
return ds;
}
Note that you'll probably want to move away from the WebClient class, as it doesn't support Task<T> inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.
However, this means that you will more than likely have to use the await keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await on those.
Once that is complete, you would normally call await on that, but you can't do that in this scenario because you would await on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T> in an array like so:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url));
}
There is the matter of merging the data into allData. To that end, you want to call the ContinueWith method on the Task<T> instance returned and perform the task of adding the data to allData:
DataSet alldata = ...;
var tasks = new List<Task<DataSet>>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
}
Then, you can wait on all the tasks using the WhenAll method on the Task class and await on that:
// After your loop.
await Task.WhenAll(tasks);
// Process allData
However, note that you have a foreach, and WhenAll takes an IEnumerable<T> implementation. This is a good indicator that this is suitable to use LINQ, which it is:
DataSet alldata;
var tasks =
from url in the8000Urls
select ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
await Task.WhenAll(tasks);
// Process allData
You can also choose not to use query syntax if you wish, it doesn't matter in this case.
Note that if the containing method is not marked as async (because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task returned when you call WhenAll:
// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();
// Process allData.
Namely, the point is, you want to collect your Task instances into a sequence and then wait on the entire sequence before you process allData.
However, I'd suggest trying to process the data before merging it into allData if you can; unless the data processing requires the entire DataSet, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.

You could also use TPL Dataflow, which is a good fit for this kind of problem.
In this case, you build a "dataflow mesh" and then your data flows through it.
This one is actually more like a pipeline than a "mesh". I'm putting in three steps: Download the (string) data from the URL; Parse the (string) data into HTML and then into a DataSet; and Merge the DataSet into the master DataSet.
First, we create the blocks that will go in the mesh:
DataSet allData;
var downloadData = new TransformBlock<string, string>(
async pageid =>
{
System.Net.WebClient webClient = null;
var url = "https://domain.com?&id=" + pageid + "restofurl";
return await webClient.DownloadStringTaskAsync(url);
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var parseHtml = new TransformBlock<string, DataSet>(
html =>
{
var dsPageData = new DataSet();
var doc = new HtmlDocument();
doc.LoadHtml(html);
// HTML Agility parsing
return dsPageData;
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var merge = new ActionBlock<DataSet>(
dataForOnePage =>
{
// merge dataForOnePage into allData
});
Then we link the three blocks together to create the mesh:
downloadData.LinkTo(parseHtml);
parseHtml.LinkTo(merge);
Next, we start pumping data into the mesh:
foreach (var pageid in the8000urls)
downloadData.Post(pageid);
And finally, we wait for each step in the mesh to complete (this will also cleanly propagate any errors):
downloadData.Complete();
await downloadData.Completion;
parseHtml.Complete();
await parseHtml.Completion;
merge.Complete();
await merge.Completion;
The nice thing about TPL Dataflow is that you can easily control how parallel each part is. For now, I've set both the download and parsing blocks to be Unbounded, but you may want to restrict them. The merge block uses the default maximum parallelism of 1, so no locks are necessary when merging.

I recommend reading my reasonably-complete introduction to async/await.
First, make everything asynchronous, starting at the lower-level stuff:
public static async Task<DataSet> ScrapeDataAsync(string pageid)
{
CookieAwareWebClient webClient = ...;
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = #"https://domain.com?&id=" + pageid + #"restofurl";
string html = await webClient.DownloadStringTaskAsync(url).ConfigureAwait(false);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData;
}
Then you can consume it as follows (using async with LINQ):
DataSet alldata;
var tasks = the8000urls.Select(async url =>
{
var dataForOnePage = await ScrapeDataAsync(url);
//merge each table in dataForOnePage into allData
});
await Task.WhenAll(tasks);
PushAllDataToSql(alldata);
And use AsyncContext from my AsyncEx library since this is a console app:
class Program
{
static int Main(string[] args)
{
try
{
return AsyncContext.Run(() => MainAsync(args));
}
catch (Exception ex)
{
Console.Error.WriteLine(ex);
return -1;
}
}
static async Task<int> MainAsync(string[] args)
{
...
}
}
That's it. No need for locking or continuations or any of that.

I believe you don't need async and await stuff here. They can help in desktop application where you need to move your work to non-GUI thread. In my opinion, it will be better to use Parallel.ForEach method in your case. Something like this:
DataSet alldata;
var bag = new ConcurrentBag<DataSet>();
Parallel.ForEach(the8000urls, url =>
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
// Add data for one page to temp bag
bag.Add(dataForOnePage);
});
//merge each table in dataForOnePage into allData from bag
PushAllDataToSql(alldata);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Download HTML pages concurrently using the Async CTP - c#

Related

Multiple HttpClients with proxies, trying to achieve maximum download speed

Starting Multiple Async Tasks and Process Them As They Complete (C#)

Execute 4 tasks simultaneously, auto-starting another as and when each completes

Changing from Synchronous mindset to Asynchronous

c# .net 4.5 async / multithread?

Categories

Resources