HTMLAgilityPack not getting loading html of the webpage - c#

I am trying to Crawl through https://www.adecco.ch/en-us/job-results but i am not able to load html from this page its not loading any getting any thing in html document.
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants().ToList();

As mentioned in my comment, the content from the site is being sent back compressed and was not being decompressed before you tried loading it, so you were basically loading gibberish. This code should work fine -
var handler = new HttpClientHandler();
// this is the important bit
handler.AutomaticDecompression = System.Net.DecompressionMethods.All;
var httpClient = new HttpClient(handler);
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants().ToList();

Related

HTML Agility Pack is not working when filtering by class Name

I am trying to scrape some data from a commercial site. I am trying to drill down to a specific class, However I am having difficulty drilling down and filtering.
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//body");
var divs = nodes.Descendants("div").Where(x => x.HasClass("srp-main srp-main--isLarge"));
I am getting null for "divs". Where am I going wrong?
My ultimate goal is to drill down to div with class name = s_item__info clearfix. So help with that would be appreciated.

How to get the data programatically using Webclient / HttpClient?

I wanted to download the data from https://eauction.ccmc.gov.in/frm_scduled_items.aspx using the date listed in the dropdown.
private async Task Cbetest()
{
using (var client = new HttpClient())
{
client.BaseAddress = new Uri("https://eauction.ccmc.gov.in");
var content = new FormUrlEncodedContent(new[]
{
new KeyValuePair<string, string>("ctl00$ContentPlaceHolder1$gridedit$ctl14$ctl02","17/02/2016")
});
var result = await client.PostAsync("/frm_scduled_items.aspx", content);
string resultContent = await result.Content.ReadAsStringAsync();
Console.WriteLine(resultContent);
}
}
I wanted to download the data shown in the above image
You need to do a little extra work to simulate a post to begin scraping against a ASP.NET WebForms application. Mostly, you're going to need to pass along valid ViewState and EventValidation parameters, which you can retrieve from an initial GET request.
I'm using the HTML Agility Pack to ease with parsing the initial response, I recommend you look into it if you're planning to scrape HTML.
The following seems to get the results you're looking for, though I haven't looked too deeply in the response HTML.
using(var client = new HttpClient())
{
client.BaseAddress = new Uri("https://eauction.ccmc.gov.in");
var initial = await client.GetAsync("/frm_scduled_items.aspx");
var initialContent = await initial.Content.ReadAsStringAsync();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(initialContent);
var viewState = htmlDoc.DocumentNode.SelectSingleNode("//input[#id='__VIEWSTATE']").GetAttributeValue("value", string.Empty);
var eventValidation = htmlDoc.DocumentNode.SelectSingleNode("//input[#id='__EVENTVALIDATION']").GetAttributeValue("value", string.Empty);
var content = new FormUrlEncodedContent(new Dictionary<string, string>{
{"__VIEWSTATE", viewState},
{"__EVENTVALIDATION", eventValidation},
{"ctl00$ContentPlaceHolder1$drp_auction_date", "17/02/2016"}
});
var res = await client.PostAsync("/frm_scduled_items.aspx", content);
var resContent = await res.Content.ReadAsStringAsync();
Console.WriteLine(resContent);
}
From there you'll want to parse the resulting table to get useful information. If you want to crawl through the DataGrid's pages, you're going to need to get updated EventValidation and ViewState values and simulate additional posts for each page.

Html Agility Pack, SelectSingleNode

This code works
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
html = client.DownloadString("http://www.imdb.com/chart/moviemeter?ref_=nv_mv_mpm_8");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
MessageBox.Show(doc.DocumentNode.SelectSingleNode("//*[#id='main']/div/span/div/div/div[3]/table/tbody/tr[1]/td[2]/a").InnerText);
Html codes here:
Split
MessageBox shows the text which is "Split". But look this Html codes:
<div class="summary_text" itemprop="description">
Three girls are kidnapped by a man with a diagnosed 23 distinct personalities, and must try and escape before the apparent emergence of a frightful new 24th.
</div>
I want MessageBox to show the text which starts with "Three girls are kidn..." so i wrote this code:
WebClient client2 = new WebClient();
client2.Encoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc2 = new HtmlAgilityPack.HtmlDocument();
doc2.LoadHtml(client2.DownloadString("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1"));
MessageBox.Show(doc2.DocumentNode.SelectSingleNode("//*[#id='title - overview - widget']/div[3]/div[1]/div[1]").InnerText);
When i start this code,an unhandled exception of type "System.NullReferenceException" occurred
Xpaths are true, i've checked a hundred times so what should i do?
Can you try this?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1");
var desNodeText = doc.DocumentNode.Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "summary_text").InnerText;

HtmlAgilityPack HtmlWeb.Load returning empty Document

I have been using HtmlAgilityPack for the last 2 months in a Web Crawler Application with no issues loading a webpage.
Now when I try to load a this particular webpage, the document OuterHtml is empty, so this test fails
var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", pageHtml);
I can load another page from the site with no problems, such as setting
url = "http://www.prettygreen.com/news/";
In the past I once had an issue with encodings, I played around with htmlWeb.OverrideEncoding and htmlWeb.AutoDetectEncoding with no luck. I have no idea what could be the issue here with this webpage.
It seems this website requires cookies to be enabled. So creating a cookie container for your web request should solve the issue:
var url = "http://www.prettygreen.com/";
var htmlWeb = new HtmlWeb();
htmlWeb.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
var htmlDoc = htmlWeb.Load(url);
var outerHtml = htmlDoc.DocumentNode.OuterHtml;
Assert.AreNotEqual("", outerHtml);

HtmlAgilityPack - How to understand page redirected and load redirected page

With using HtmlAgilityPack and c# 4.0 how can you determine whether page is being redirected or not. I am using this method to load page.
HtmlDocument hdMyDoc = hwWeb.Load(srPageUrl);
And example redirection result i suppose
Returned inner html
<meta http-equıv="refresh" content="0;URL=http://www.pratikev.com/fractalv33/pratikEv/pages/home.jsp">
c# 4.0
For this case, parse the HTML is the best way.
var page = "...";
var doc = new HtmlDocument();
doc.Load(page);
var root = doc.DocumentNode;
var select = root.SelectNodes("//meta[contains(#content, 'URL')]");
try
{
Console.WriteLine("has redirect..");
Console.WriteLine(select[0].Attributes["content"].Value.Split('=')[1]);
}
catch
{
Console.WriteLine("have not redirect using HTML");
}
Assuming the document is relatively well-formed, I suppose you could do something like this:
static string GetMetaRefreshUrl(string sourceUrl)
{
var web = new HtmlWeb();
var doc = web.Load(sourceUrl);
var xpath = "//meta[#http-equiv='refresh' and contains(#content, 'URL')]";
var refresh = doc.DocumentNode.SelectSingleNode(xpath);
if (refresh == null)
return null;
var content = refresh.Attributes["content"].Value;
return Regex.Match(content, #"\s*URL\s*=\s*([^ ;]+)").Groups[1].Value.Trim();
}

Categories

Resources