Better way to download html content from multiple pages [c#]

Better way to download html content from multiple pages [c#] - c#

I'm writing web scraping program and here I have situation when I have 10 links on one page, and for every link I need to download html text to scrape data from them, and move on next page and repeat all process. When I do this synchronously for one link it took 5-10 sec to download html text (it is slow when I try open page with browser). So I looked for asynchronous way to implement this, and for 10 links it took 5-10 sec to download html text. I have to loop through 100 pages and it took 30 minutes to process all data.
I don't have too much experience with Tasks in C#, so I made this code and it works, but I'm not sure is it good or there exists better solution?
class Program
{
public static List<Task> tasks = new List<Tasks>();
public static List<Data> webData = new List<Data>();
public static async Task<string> GetHtmlText(string link)
{
using (HttpClient client = new HttpClient())
{
return await client.GetStringAsync(link);
}
}
public static void Main(string[] args)
{
for(int i = 0; i < 100; i++)
{
List<string> links = GetLinksFromPage(i); // returns 10 links from page //replaced with edit solution >>>
foreach (var link in links)
{
Task<string> task= Task.Run(() => GetHtmlText(link));
TaskList.Add(task);
}
Task.WaitAll(TaskList.ToArray()); // replaced with edit solution <<<
foreach(Task<string> task in TaskList)
{
string html = task.Result;
Data data = GetDataFromHtml(html);
webData.Add(data);
}
...
}
}
EDIT:
This made my day, setting DefaultConnectionLimit to 50
DefaultConnectionLimit
ServicePointManager.DefaultConnectionLimit = 50
var concurrentBag = new ConcurrentBag<string>();
var t = linksFromPage.Select(async link =>
{
var response = await GetLinkStringTaskAsync(link);
concurrentBag.Add(response);
});
await Task.WhenAll(t);

Related

How to achieve parallelism and asynchrony in Web crawling using BFS

This question is going to be quite long but I want to explain my code and thought process as thoroughly as possible, so here goes...
I am coding a web crawler in C# which is supposed to search through Wikipedia from a given source link and find a way to a destination link. For example, you can give it a toaster Wiki page link and a pancake Wiki link and it should output a route which takes you to pancake from toast. In other words - I want to find the shortest path between two Wiki articles.
I think I have correctly coded that up, I created two classes: one is called a CrawlerPage and here is its body:
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class CrawlerPage
{
public string mainLink;
private List<CrawlerPage> _pages = new();
public CrawlerPage(string mainLink)
{
this.mainLink = mainLink;
}
public async Task<List<CrawlerPage>> GetPages()
{
var pagesLinks = await Task.Run(() => GetPages(this));
foreach(var page in pagesLinks)
{
_pages.Add(new CrawlerPage(page));
}
return _pages;
}
private HashSet<string> GetPages(CrawlerPage page)
{
string result = "";
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = client.GetAsync(page.mainLink).Result)
{
using (HttpContent content = response.Content)
{
result = content.ReadAsStringAsync().Result;
}
}
}
var wikiLinksList = ParseLinks(result)
.Where(x => x.Contains("/wiki/") && !x.Contains("https://") && !x.Contains(".jpg") &&
!x.Contains(".png"))
.AsParallel()
.ToList();
var wikiLinksHashSet = new HashSet<string>();
foreach(var wikiLink in wikiLinksList)
{
wikiLinksHashSet.Add("https://en.wikipedia.org" + wikiLink);
}
HashSet<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new HashSet<string>() : nodes.AsParallel().ToList().ConvertAll(
r => r.Attributes.AsParallel().ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).AsParallel().ToHashSet();
}
return wikiLinksHashSet;
}
}
}
The class above is supposed to represent a Wiki page article. It contains its own link (mainLink field) and a list of every other page that is on that page (_pages field). GetPages() methods are basically reading a page in HTML and parsing them to a HashSet with links that are of my interest (with links to other articles, this way we can discard any other junk links).
Second class is a Crawler class that performs BFS (Breadth-first search). Code below:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class Crawler
{
private int _maxDepth;
private int _currDepth;
public Crawler(int maxDepth)
{
_currDepth = 0;
_maxDepth = maxDepth;
}
public async Task CrawlParallelAsync(string sourceLink, string destinationLink)
{
var sourcePage = new CrawlerPage(sourceLink);
var destinationPage = new CrawlerPage(destinationLink);
var visited = new HashSet<string>();
Queue <CrawlerPage> queue = new();
queue.Enqueue(sourcePage);
while (queue.Count > 0)
{
var currPage = queue.Dequeue();
Console.WriteLine(currPage.mainLink);
var currPageSubpages = await Task.Run(() => currPage.GetPages());
if (currPage.mainLink == destinationPage.mainLink || _currDepth == _maxDepth)
{
visited.Add(currPage.mainLink);
break;
}
if (visited.Contains(currPage.mainLink))
continue;
visited.Add(currPage.mainLink);
foreach (var page in currPageSubpages)
{
if (!visited.Contains(page.mainLink))
{
queue.Enqueue(page);
}
}
}
foreach (var visitedPage in visited)
{
Console.WriteLine(visitedPage);
}
}
}
}
Note that I am not incrementing currDepth - I want to make it so that if the depth of the search goes too far, the search would stop because of the route being too long.
The class above works as follows: it enqueues the page with sourceLink and performs standard BFS: it dequeues the page, checks if it has been visited, checks if this is the destination page and then gets every subpage of that page (using currPage.GetPages() and adds them to the queue. I believe that the algorithm works fine, although it is extremely sluggish and does not provide any use because of that.
My conclusion: it absolutely needs to be done asynchronously and parallel in order to be efficient. I have tried with Tasks as you can tell, but that doesn't improve the performance at all. My intuition tells me that every time we read subpages of a page, we should do that async and parallel and every time we start crawling on a page, we have to do that async and in parallel as well. I have no idea on how to achieve that, do I need to completely refactor my code? Should I create a new crawler every time I enqueue a subpage?
I'm lost, can you help me figure it out?

You could consider using the new (.NET 6) API Parallel.ForEachAsync. This method accepts an enumerable sequence, and invokes an asynchronous delegate for each element in the sequence, with a specific degree of parallelism. One overload of this method is particularly interesting, because it accepts an IAsyncEnumerable<T> as input, which is essentially an asynchronous stream of data. You could create such a stream dynamically with an iterator method (a method that yields), but it is probably easier to use a Channel<T> that exposes its contents as IAsyncEnumerable<T>. Here is a rough demonstration of this idea:
var channel = Channel.CreateUnbounded<CrawlerPage>();
channel.Writer.TryWrite(new CrawlerPage(sourceLink));
var cts = new CancellationTokenSource();
var options = new ParallelOptions()
{
MaxDegreeOfParallelism = 10,
CancellationToken = cts.Token
};
await Parallel.ForEachAsync(channel.Reader.ReadAllAsync(), async (page, ct) =>
{
CrawlerPage[] subpages = await GetPagesAsync(page);
foreach (var subpage in subpages) channel.Writer.TryWrite(subpage);
});
The parallel loop will continue crunching pages until the channel.Writer.Complete() method is called and then all remaining pages in the channel are consumed, or until the CancellationTokenSource is canceled.

Calling client.GetAsync(page.mainLink).Result makes your code wait synchronously. Use await client.GetAsync(page.mainLink). Doing so you should not use Task.Run. Task.Run can be used to have synchronous work be excecuted in asynchronously.
If you want parallelism you can await several Task using Task.WhenAll.

Asynchronous parallel web requests in ASP.NET MVC(C#)

I have a mini-project that requires to download html documents of multiple websites using C# and make it perform as fast as possible. In my scenario I might need to switch IP using proxies based on certain conditions. I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
Here's the code I have so far.
public class HTMLDownloader
{
public static List<string> URL_LIST = new List<string>();
public static List<string> HTML_DOCUMENTS = new List<string>();
public static void Main()
{
for (var i = 0; i < URL_LIST.Count; i++)
{
var html = Task.Run(() => Run(URL_LIST[i]));
HTML_DOCUMENTS.Add(html.Result);
}
}
public static async Task<string> Run(string url)
{
var client = new WebClient();
//Handle Proxy Credentials client.Proxy= new WebProxy();
string html = "";
try
{
html = await client.DownloadStringTaskAsync(new Uri(url));
//if(condition ==true)
//{
// return html;
//}
//else
//{
// Switch IP and try again
//}
}
catch (Exception e)
{
}
return html;
}
The problem here is that I'm not really taking advantage of sending multiple web requests because each request has to finish in order for the next one to begin. Is there a better approach to this? For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
Thanks

I want to take advantage of C# Asynchronous Tasks to make it execute as many requests as possible in order for the whole process to be fast and efficient.
You can use Task.WhenAll to get asynchronous concurrency.
For example, send 10 web requests at a time and then send a new request when one of those requests is finished.
To throttle asynchronous concurrency, use SemaphoreSlim:
public static async Task Main()
{
using var limit = new SemaphoreSlim(10); // 10 at a time
var tasks = URL_LIST.Select(Process).ToList();
var results = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(results);
async Task<string> Process(string url)
{
await limit.WaitAsync();
try { return await Run(url); }
finally { limit.Release(); }
}
}

One way is to use Task.WhenAll.
Creates a task that will complete when all of the supplied tasks have
completed.
The premise is, Select all the tasks into a List, await the list of task with Task.WhenAll, Select the results
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
await Task.WhenAll(tasks);
var results = tasks.Select(x => x.Result);
}
Note : The result of WhenAll will be the collection of results as well

First change your Main to be async.
Then you can use LINQ Select to run the Tasks in parallel.
public static async Task Main()
{
var tasks = URL_LIST.Select(Run);
string[] documents = await Task.WhenAll(tasks);
HTML_DOCUMENTS.AddRange(documents);
}
Task.WhenAll will unwrap the Task results into an array, once all the tasks are complete.

Parallel.For and httpclient crash the application C#

I want to avoid application crashing problem due to parallel for loop and httpclient but I am unable to apply solutions that are provided elsewhere on the web due to my limited knowledge of programming. My code is pasted below.
class Program
{
public static List<string> words = new List<string>();
public static int count = 0;
public static string output = "";
private static HttpClient Client = new HttpClient();
public static void Main(string[] args)
{
//input path strings...
List<string> links = new List<string>();
links.AddRange(File.ReadAllLines(input));
List<string> longList = new List<string>(File.ReadAllLines(#"a.txt"));
words.AddRange(File.ReadAllLines(output1));
System.Net.ServicePointManager.DefaultConnectionLimit = 8;
count = longList.Count;
//for (int i = 0; i < longList.Count; i++)
Task.Run(() => Parallel.For(0, longList.Count, new ParallelOptions { MaxDegreeOfParallelism = 5 }, (i, loopState) =>
{
Console.WriteLine(i);
string link = #"some link" + longList[i] + "/";
try
{
if (!links.Contains(link))
{
Task.Run(async () => { await Download(link); }).Wait();
}
}
catch (System.Exception e)
{
}
}));
//}
}
public static async Task Download(string link)
{
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
document.LoadHtml(await getURL(link));
//...stuff with html agility pack
}
public static async Task<string> getURL(string link)
{
string result = "";
HttpResponseMessage response = await Client.GetAsync(link);
Console.WriteLine(response.StatusCode);
if(response.IsSuccessStatusCode)
{
HttpContent content = response.Content;
var bytes = await response.Content.ReadAsByteArrayAsync();
result = Encoding.UTF8.GetString(bytes);
}
return result;
}
}
There are solutions for example this one, but I don't know how to put await keyword in my main method, and currently the program simply exits due to its absence before Task.Run(). As you can see I have already applied a workaround regarding async Download() method to call it in main method.
I have also doubts regarding the use of same instance of httpclient in different parallel threads. Please advise me whether I should create new instance of httpclient each time.

You're right that you have to block tasks somewhere in a console application, otherwise the program will just exit before it's complete. But you're doing this more than you need to. Aim for just blocking the main thread and delegating the rest to an async method. A good practice is to create a method with a signature like private async Task MainAsyc(args), put the "guts" of your program logic there, call it from Main like this:
MainAsync(args).Wait();
In your example, move everything from Main to MainAsync. Then you're free to use await as much as you want. Task.Run and Parallel.For are explicitly consuming new threads for I/O bound work, which is unnecessary in the async world. Use Task.WhenAll instead. The last part of your MainAsync method should end up looking something like this:
await Task.WhenAll(longList.Select(async s => {
Console.WriteLine(i);
string link = #"some link" + s + "/";
try
{
if (!links.Contains(link))
{
await Download(link);
}
}
catch (System.Exception e)
{
}
}));
There is one little wrinkle here though. Your example is throttling the parallelism at 5. If you find you still need this, TPL Dataflow is a great library for throttled parallelism in the async world. Here's a simple example.
Regarding HttpClient, using a single instance across threads is completely safe and highly encouraged.

Can this c# async process be more performant?

I am working on a program which makes multiple json calls to retrieve it's data.
The data however is pretty big and when running it without async it takes 17 hours to fully process.
The fetching of the data goes as follows:
Call to a service with a page number (2000 pages in total to be processed), which returns 200 records per page.
For each record it returns, an other service needs to be called to receive the data for the current record.
I'm new to the whole async functionality and I've made an attempt using async and await and already made a performance boost but was wondering if this is the correct way of using it and if there are any other ways to increase performance?
This is the code I currently have:
static void Main(string[] args)
{
MainAsyncCall().Wait();
Console.ReadKey();
}
public static async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(Task.Factory.StartNew(() => processPage(page)));
}
Task.WaitAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public static async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = client.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(Task.Factory.StartNew(() => processPlayer(d, localPage)));
}
}
Task.WaitAll(players.ToArray());
Console.WriteLine($"Finished Page: {page}");
}
public static async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = client.GetAsync(url).Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
Any suggestion is welcome!

This is what it should look like to me:
static void Main(string[] args)
{
// it's okay here to use wait because we're at the root of the application
new AsyncServerCalls().MainAsyncCall().Wait();
Console.ReadKey();
}
public class AsyncServerCalls
{
// dont use static async methods
public async Task MainAsyncCall()
{
ServicePointManager.DefaultConnectionLimit = 999999;
List<Task> allPages = new List<Task>();
for (int i = 0; i <= 10; i++)
{
var page = i;
allPages.Add(processPage(page));
}
await Task.WhenAll(allPages.ToArray());
Console.WriteLine("Finished all pages");
}
public async Task processPage(Int32 page)
{
List<Task> players = new List<Task>();
using (var client = new HttpClient())
{
string url = "<Request URL>";
var response = await client.GetAsync(url)// nope .Result;
var content = await response.Content.ReadAsStringAsync(); // again never use .Result;
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
dynamic data = item.data;
var localPage = page;
Console.WriteLine($"Processing Page: {localPage}");
foreach (dynamic d in data)
{
players.Add(processPlayer(d, localPage)); // no need to put the task unnecessarily on a different thread, let the current SynchronisationContext deal with that
}
}
await Task.WhenAll(players.ToArray()); // always await a task in an async method
Console.WriteLine($"Finished Page: {page}");
}
public async Task processPlayer(dynamic player, int page)
{
using (var client = new HttpClient())
{
string url = "<Request URL>";
HttpResponseMessage response = null;
response = await client.GetAsync(url); // don't use .Result;
var content = await response.Content.ReadAsStringAsync();
dynamic item = Newtonsoft.Json.JsonConvert.DeserializeObject(content);
Console.WriteLine($"{page}: Processed {item.name}");
}
}
}
So basially the points here are to make sure you let the SynchronisationContext do it's job. Inside a console program it should use the TaskSchedular.Default which is a ThreadPool SynchronisationContext. You can always force this by doing:
static void Main(string[] args)
{
Task.Run(() => new AsyncServerCalls().MainAsyncCall()).Wait();
Console.ReadKey();
}
Reference to Task.Run forcing Default
One thing you need to remember, which I got into trouble with last week is that you can fire hose the thread pool, i.e. spawn so many tasks that the your process just dies with insane CPU and Memory usage. So you may need to use a Semaphore to just limit the number of threads that going to be created.
I created a solution that processes a single file in multiple parts all at the same time Parallel Read it is still being worked on, but shows the uses of async stuff
Just to clarify the parallelism.
When you take a reference to all those tasks:
allPages.Add(processPage(page));
They all will be started.
When you do:
await Task.WhenAll(allPages);
This will block the current method execution until all those page processes have been executed (it won't block the current thread though, don't get these confused)
Danger Zone
If you don't want to block method execution on
Task.WhenAll
So, you can parallel all page processes for each page, then you can add that Task to an overall List<Task>.
However, the danger with this is the fire hosing... You are going to limit the number of threads you execute at some point, so where.... well that is up to you but just remember, it will happen at some point.

Download multiple files at once, no explicit controls

I wish to download around 100,000 files from a web site. The answers from this qustion have the same issues as what I tried.
I have tried two approaches, both of which use highly erratic amounts of bandwidth:
The first attempts to synchronously download the files:
ParallelOptions a = new ParallelOptions();
a.MaxDegreeOfParallelism = 30;
ServicePointManager.DefaultConnectionLimit = 10000;
Parallel.For(start, end, a, i =>
{
using (var client = new WebClient())
{
...
}
});
This works, but my throughput looks like this:
The second way involves using semaphore and async to do the parallelism more manually (without the semaphores it will obviously spawn too many work items):
Parallel.For(start, end, a, i =>
{
list.Add(getAndPreprocess(/*get URL from somewhere*/);
});
...
static async Task getAndPreprocess(string url)
{
var client = new HttpClient();
sem.WaitOne();
string content = "";
try
{
var data = client.GetStringAsync(url);
content = await data;
}
catch (Exception ex) { Console.WriteLine(ex.InnerException.Message); sem.Release(); return; }
sem.Release();
try
{
//try to use results from content
}
catch { return; }
}
My throughput now looks like this:
Is there a nice way to do this, such that it starts downloading anouther file when the speed falls, and stops adding when the aggregate speed is constant (like what you would expect a download manager to do)?
Additionally, even though the second form gives better results, I dislike that I have to use semaphores, as it is error prone.
What the standard way to do this?
Note: these are all small files (<50KB)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Better way to download html content from multiple pages [c#] - c#

Related

How to achieve parallelism and asynchrony in Web crawling using BFS

Asynchronous parallel web requests in ASP.NET MVC(C#)

Parallel.For and httpclient crash the application C#

Can this c# async process be more performant?

Download multiple files at once, no explicit controls

Categories

Resources