C# Crawler Moving single thread WebClient call to multi threading

C# Crawler Moving single thread WebClient call to multi threading - c#

We currently have a web crawler for our clients that do not have the ability to provide us with an xml file. The list is building so I need to have a more efficient way to crawl these sites. The logic of the crawler is simply:
Pass in www.example.com/widgets
Store the html and pass it to crawler function
crawl widgets page 1
IF widgets page 1 is the end of their product list
stop
else
go to widgets page 2
This repeats for every site in the queue. As you can see, if Site 1 has 5000 products, Site 2 cannot proceed until it is done. What would be the best way to multi thread this so that i can limit how many request i make to each site, but grab multiple sites at one time? I tried Parallel.ForEach but the result was very sporadic and unpredictable. Currently we handle this by having "groups" of stores fire off and the same time using Windows Task Manager. here is some example code:
foreach(site in ListofSites)
{
int page = 1;
bool continue_crawling = true;
while(continue_crawling)
{
HtmlDocument doc = new HtmlDocument();
var htmlWeb = new HtmlWeb();
htmlWeb.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36";
doc = htmlWeb.Load(URL + page);
string html = doc.DocumentNode.OuterHtml;
continue_crawling = ParseFile(html);
page++;
}
}
private bool Parse(string html)
{
//parse the file and see if we have enough data
return endofproduct;
}

All C# HTTP requests will go through the ServicePoint for the request URL domain. The ServicePoint will inherit the ConnectionLimit from the ServicePointManager.DefaultConnectionLimit. The default value is 2, in accordance with the 'good clients practice' of RFC2616:
A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.
What all that translates to is that launching 1 gazzilion HTTP requests to URLs in the same domain will only send at most 2 HTTP request, and another one will not start unless one finishes. To achieve higher concurrency you need to increase the ServicePoint connection limit for the domain of interest. Doing concurrent IO using threads (including PTL) is rather primitive, but should work if you fix the limitation. Doing async IO would be preferable, of course.

Related

C# HttpClient gets new SessionID every request

The title explains it mostly. I have declared my HttpClient, HttpClientHandler, and CookieContainer as class variables.
private HttpClient client;
private HttpClientHandler handler;
private CookieContainer cookies;
Then in the form creation I initialize the variables like so
public FrmMain()
{
InitializeComponent();
handler = new HttpClientHandler();
cookies = new CookieContainer();
handler.AllowAutoRedirect = true;
handler.UseCookies = true;
handler.CookieContainer = cookies;
client = new HttpClient(handler);
client.DefaultRequestHeaders.Connection.Clear();
client.DefaultRequestHeaders.ConnectionClose = false;
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");
}
Later on in the program, when I call the requests, I am able to log in to the url (in this case a device on my local net) just fine. As part of the troubleshooting for this, I started printing the cookie data to the console each time a request is made. When I initially log on, it gives me single cookie, a sessionID. Any subsequent requests that I make using the same client gives me a new sessionID. This causes my requests to get a return code of badRequest, most likely because it is trying to route me back to the login page. I know that I am successfully logging in with the first request because printing the response content gives me the HTML of the index page that I am redirected to upon a successful login. I've tested all the data I'm sending via Postman, where I'm able to do a log in request, then do whatever other requests I need without issue. The only difference between Postman and my program is that in my program I am getting a new sessionID for every request instead of it persisting. Anyone know why my cookies are not persisting despite the client handler, client, and cookie container all being declared in the class scope?

It turns out my issue was not a cookie issue. The cookie was supposed to change to a new sessionID after logging in. The sessionID never changes after that. HttpClient was saving the cookie persistently. The issue was with a hidden CSRFToken that I was submitting with the formdata. That changes with each request and while I was doing the steps to get the new one before each POST, I was not actually assigning it.
I'd like to thank Jonathan. If I hadn't jumped back in to try some tweaks to make sure the program was only loading the initializations once, I probably wouldn't be looking in the area where I was neglecting to assign the new CSRFToken.

Cookie for domain not being used for subdomains

I use HttpClient in my app to send my user/password to a service that returns some cookies that I can later use for all my other requests. The service is located at https://accounts.dev.example.com/login and returns two cookies that have Domain=.dev.example.com. The issue I'm finding is that, in some machines (Windows Domain Controllers), these cookies are not being used when I request resources in subdomains like https://accounts.dev.example.com/health-check, but according to the MDN docs a cookie for a domain can be used for requesting resources to subdomains:
Domain= Optional
Specifies those hosts to which the cookie will be sent. If not specified, defaults to the host portion of the current document location (but not including subdomains). Contrary to earlier specifications, leading dots in domain names are ignored. If a domain is specified, subdomains are always included.
Do you know how to properly configure HttpClient to pass the domain cookies to subdomain requests?
A bit more of details:
The cookies returned by my authentication service at https://accounts.dev.example.com/login look like this in the HTTP headers:
Set-Cookie: AK=112233;Version=1;Domain=.dev.example.com;Path=/;Max-Age=5400;Secure;HttpOnly,
Set-Cookie: AS=445566;Version=1;Domain=.dev.example.com;Path=/;Max-Age=5400;Secure;HttpOnly,
Then I can query C#'s CookieContainer with either of these calls in normal workstations:
cookies.GetCookies("https://accounts.dev.example.com")
cookies.GetCookies("https://dev.example.com")
Both of which will return the 2 cookies like:
$Version=1; AK=112233; $Path=/; $Domain=.dev.example.com
$Version=1; AS=445566; $Path=/; $Domain=.dev.example.com
But in the other machines (the Domain Controller's) the first call will return an empty list, while the second will return the 2 cookies.
Why this difference on the behaviour of CookieContainer.GetCookies depending on which machine is running the code?
My workstations are using Microsoft Windows 10 Home Single Language (.Net 4.0.30319.42000) and the DCs are using Microsoft Windows Server 2012 R2 Datacenter (.Net 4.0.30319.36399).
The code
This is a modified version of my code:
public static async Task<string> DoAuth(CookieContainer cookies,
Dictionary<string, string> postHeaders,
StringContent postBody)
{
try
{
using (var handler = new HttpClientHandler())
{
handler.CookieContainer = cookies;
using (var client = new HttpClient(handler, true))
{
foreach (var key in postHeaders.Keys)
client.DefaultRequestHeaders.Add(key, postHeaders[key]);
var response = await client.PostAsync("https://accounts.dev.example.com/login", postBody);
response.EnsureSuccessStatusCode();
// This line returns 0 in Domain Controllers, and 2 in all other machines
Console.Write(cookies.GetCookies("https://accounts.dev.example.com").Count);
return await response.Content.ReadAsStringAsync();
}
}
}
catch (HttpRequestException e)
{
...
throw;
}
}

As I couldn't find an answer to this (not in TechNet either), I decided to go with the following solution, which works, but not sure if there is a proper way of solving the issue:
foreach (Cookie cookie in cookies.GetCookies(new Uri("https://dev.example.com")))
{
cookies.Add(new Uri("https://accounts.dev.example.com"), new Cookie(cookie.Name, cookie.Value, cookie.Path, ".accounts.dev.example.com"));
}
So, I'm duplicating the cookie for each one of the subdomains that my app should send these cookies to.

The underlying issue seems to be a bug in the Set-Cookie header. It seems the cause of the issue is the Version= component in the Set-Cookie header. This makes the CookieContainer fall on its face and results in the strange $Version and $Domain cookies then being sent in subsequent client requests. As far as I can tell there is no way to remove these broken cookies either. Iterating GetCookies() with the originating domain does not reveal the erroneous cookies.

Why does my HttpWebRequest return a 503?

So i am just getting to learn HttpWebRequests and it's functions.
I've gotten to the point where I want to learn how to capture cookies in a CookieContainer and parse through them.
The issue is that some websites return a 503 error and I am not sure.
One of the websites will be used in this example.
From what Iäve read online a 503 error is this.
The HyperText Transfer Protocol (HTTP) 503 Service Unavailable server
error response code indicates that the server is not ready to handle
the request.
Common causes are a server that is down for maintenance or that is
overloaded. This response should be used for temporary conditions and
the Retry-After HTTP header should, if possible, contain the estimated
time for the recovery of the service.
Which doesnt seem to fit in at all since the website is up and running.
Why is my request returning a 503 status code and what should I do to resolve this issue in a propper manner?
static void Main(string[] args)
{
//1. Create a HTTP REQUEST
//Build the request
Uri site = new Uri("https://ucp.nordvpn.com/login/");
//Inizializing a new instance of the HttpWebRequest and casting it as a WebRequest
//And calling the Create function and using our site as a paramter which the Create function takes.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(site);
//Inizialize a new instance of the CookieContainer
CookieContainer cookies = new CookieContainer();
//The request has a CookieContainer, which is null by default, so we are just assinging the newly inizialized instance
//of our CookieContainer to our requests CookieContainer
request.CookieContainer = cookies;
//Print out the number of cookies before the response (of course it will be blank)
Console.WriteLine(cookies.GetCookieHeader(site));
//Get the response and print out the cookies again
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Console.WriteLine(cookies.GetCookieHeader(site));
}
Console.ReadKey();
}

The URL that you are trying to get to appears to be protected by CloudFlare. You can't use the basic HttpWebRequest for that type of request without some additional work. While I haven't tried this, it may be an option for you to get around that protection:
CloudFlareUtilities

The url you are trying to access is using cloud hosting which use many security measurement including which browser are accessing the site
for that to work you need to change the userAgent property of HttpWebRequest
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0";

Retrieve web page content like a browser

After I learned some things about differents technologies, I wanted to make a small project using UWP+NoSQL. I wanted to do a small UWP app that grabs the horoscope and display it on my raspberry Pi every morning.
So I took a WebClient, and I do the following:
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
But it seems that it detect that this request isn't coming from a browser, since the interesting part is not in the content(and when I check with the browser, it is in the initial HTML, according to fiddler).
I also tried with ScrapySharp but I got the same result. Any idea why?
(I've already done the UWP part, so I don't want to change the topic of my personal project just because it is detected as a "bot")
EDIT
It seems I wasn't clear enough. The issue is **not* that I'm unable to parse the HTML, the issue is that I don't receive expected HTML when using ScrapySharp/WebClient
EDIT2
Here is what I retrieve: http://pastebin.com/sXi4JJRG
And, I don't get(by example) the "Star ratings by domain" + the related images for each stars

You can read the entire content of the web page using the code snippet shown below:
internal static string ReadText(string Url, int TimeOutSec)
{
try
{
using (HttpClient _client = new HttpClient() { Timeout = TimeSpan.FromSeconds(TimeOutSec) })
{
_client.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("text/html"));
using (HttpResponseMessage _responseMsg = _client.GetAsync(Url))
{
using (HttpContent content = _responseMsg.Content)
{
return content.ReadAsString();
}
}
}
}
catch { throw; }
}
Or in a simple way:
public static void DownloadString (string address)
{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);
Console.WriteLine (reply);
}
(re: https://msdn.microsoft.com/en-us/library/fhd1f0sw(v=vs.110).aspx)

yes, WebClient won't give you expected result. many sites have scripts to load content. so to emulate browser you also should run page scripts.
I have never did similar things, so my answer pure theoretical.
To solve the problem you need "headless browser".
I know two project for this (I have never try ony of it):
http://webkitdotnet.sourceforge.net/ - it seems to be outdated
http://www.awesomium.com/

Ok, I think I know what's going on: I compared the real output (no fancy user agent strings) to the output as supplied by your pastebin and found something interesting. On line 213, your pastebin has:
<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hov...ck">Forecast Tarot Readings</div>
Mind the data-hov...ck near the end. In the real output, this was:
<li class="dropdown">Astrology
followed by about 600 lines of code, including the aforementioned 'interesting part'. On line 814, it says:
<div class="bot-explore-col-subtitle f14 blocksubtitle black">Forecast Tarot Readings</div>
which, starting with the ck in black, matches up with the rest of the pastebin output. So, either pastebin has condensed the output or the original output was.
I created a new console application, inserted your code, and got the result I expected, including the 600 lines of html you seem to miss:
static void Main(string[] args)
{
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
File.WriteAllText(#"D:\Temp\source-mywebclient.html", downloadString);
}
My WebClient is from System.Net. And changing the UserAgent hardly has any effect, a couple of links are a bit different.
So, to sum it up: Your problem has nothing to do with content that is inserted dynamically after the initial get, but possibly with webclient combined with UWP. There's another question regarding webclient and UWP on the site: (UWP) WebClient and downloading data from URL in that states you should use HttpClient. Maybe that's a solution?

Some time ago I used http://www.nrecosite.com/phantomjs_wrapper_net.aspx it worked well, and as Anton mentioned it is a headless browser. Maybe it will be some help.

I'm wondering if all the 'interesting part' you expect to see 'in the content' are images? You are aware of the fact you have to retrieve any images separately? The fact that a html page contains <image.../> tags does not magically display them as well. As you can see with Fiddler, after retrieving a page, the browser then retrieves all images, style sheets, javascript and all other items that are specified, but not included in the page. (you might need to clear the browser cache to see this happen...)

ASP Classic VBScript to ASP.NET C# Conversion

I am familiar with ASP.NET, but not with Visual Basic.
Here is the Visual Basic code:
myxml="http://api.ipinfodb.com/v3/ip-city/?key="&api_key&"&ip=" &UserIPAddress&"&format=xml"
set xml = server.CreateObject("MSXML2.DOMDocument.6.0")
xml.async = "false"
xml.resolveExternals = "false"
xml.setProperty "ServerHTTPRequest", true
xml.load(myxml)
response.write "<p><strong>First result</strong><br />"
for i=0 to 10
response.write xml.documentElement.childNodes(i).nodename & " : "
response.write xml.documentElement.childNodes(i).text & "<br/>"
NEXT
response.write "</p>"
What is going on in this code?
How can I convert this to ASP.NET (C#)?

Based on a quick glance at the site you linked to in a comment, it looks like the intended functionality is to make a request to a URL and receive the response. The first example given on that site is:
http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100
You can probably use something like the System.Net.WebClient object to make an HTTP request and receive the response. The example on MSDN can be modified for your URL. Maybe something like this:
var client = new WebClient();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
var data = client.OpenRead(#"http://api.ipinfodb.com/v3/ip-city/?key=<your_api_key>&ip=74.125.45.100");
var reader = new StreamReader(data);
var result = reader.ReadToEnd();
data.Close();
reader.Close();
(There's also the WebRequest class, which appears to share roughly the same functionality.)
At that point the result variable contains the response from the API. Which you can handle however you need to.

From the looks of the Visual Basic code, I think you should create two methods to "convert" this to an ASP.NET C# web page:
LoadXmlData method - use an XmlDocument to load from the URL via the XmlDocument's Load function. Read ASP.net load XML file from URL for an example.
BuildDisplay method - use an ASP.NET PlaceHolder or Panel to create a container to inject the paragraph tag and individual results into.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Crawler Moving single thread WebClient call to multi threading - c#

Related

C# HttpClient gets new SessionID every request

Cookie for domain not being used for subdomains

Why does my HttpWebRequest return a 503?

Retrieve web page content like a browser

ASP Classic VBScript to ASP.NET C# Conversion

Categories

Resources