HTML Agility Pack 2

HTML Agility Pack 2 - c#

I am tring to scrap This Website .
The below Xpath expression working fine with FirePath firebug extension
html/body/table/tbody/tr[3]/td
But using same xpath expression the below code gives me null :
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("http://www.edb.gov.hk/templates/sch_list_print.asp?district=cw");
var collection= doc.DocumentNode.SelectNodes("html/body/table/tbody/tr[3]/td");
Can anyone help me on this. Thanks.

this works, looking at the source of the page you are trying to scrape there is no tbody inside of table.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("http://www.edb.gov.hk/templates/sch_list_print.asp?district=cw");
var collection= doc.DocumentNode.SelectNodes("html/body/table/tr[3]/td");
change your xpath to
html/body/table/tr[3]/td

Related

HTML Agility Pack is not working when filtering by class Name

I am trying to scrape some data from a commercial site. I am trying to drill down to a specific class, However I am having difficulty drilling down and filtering.
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//body");
var divs = nodes.Descendants("div").Where(x => x.HasClass("srp-main srp-main--isLarge"));
I am getting null for "divs". Where am I going wrong?
My ultimate goal is to drill down to div with class name = s_item__info clearfix. So help with that would be appreciated.

Html Agility Pack, SelectSingleNode

This code works
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
html = client.DownloadString("http://www.imdb.com/chart/moviemeter?ref_=nv_mv_mpm_8");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
MessageBox.Show(doc.DocumentNode.SelectSingleNode("//*[#id='main']/div/span/div/div/div[3]/table/tbody/tr[1]/td[2]/a").InnerText);
Html codes here:
Split
MessageBox shows the text which is "Split". But look this Html codes:
<div class="summary_text" itemprop="description">
Three girls are kidnapped by a man with a diagnosed 23 distinct personalities, and must try and escape before the apparent emergence of a frightful new 24th.
</div>
I want MessageBox to show the text which starts with "Three girls are kidn..." so i wrote this code:
WebClient client2 = new WebClient();
client2.Encoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc2 = new HtmlAgilityPack.HtmlDocument();
doc2.LoadHtml(client2.DownloadString("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1"));
MessageBox.Show(doc2.DocumentNode.SelectSingleNode("//*[#id='title - overview - widget']/div[3]/div[1]/div[1]").InnerText);
When i start this code,an unhandled exception of type "System.NullReferenceException" occurred
Xpaths are true, i've checked a hundred times so what should i do?

Can you try this?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt4972582/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2240084082&pf_rd_r=1QW31NGD6JSE46F79CKQ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=moviemeter&ref_=chtmvm_tt_1");
var desNodeText = doc.DocumentNode.Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "summary_text").InnerText;

Convert "iso-8859-1" to "utf-8" with HTML Agility Pack and xpath

I'm trying to get a piece of web page, but I have a problem with special characters. how to convert the data to obtain a correct reading? the website use ISO 8859-1 and i must use UTF 8.
string url = "http://www.ta-meteo.fr/troyes.htm";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);
thanks.

I solved the problem
string url = "http://www.ta-meteo.fr/troyes.htm";
Encoding iso = Encoding.GetEncoding("iso-8859-1");
HtmlWeb web = new HtmlWeb()
{
AutoDetectEncoding = false,
OverrideEncoding = iso,
};
HtmlDocument doc = web.Load(url);
HtmlNode bulletinMatin = doc.DocumentNode.SelectSingleNode("//*[#id='blockdetday0']/div[1]/p[1]");
MessageBox.Show(bulletinMatin.InnerText);

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

HTMLAgilityPack stripping out html

I'm sure this question has asked before and i've looked before I cant find the answer, or maybe I am just doing something wrong.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(indivdualfix[0]);
HtmlWeb hwObject = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldocObject = hwObject.Load(indivdualfix[0]);
HtmlNode body = htmldocObject.DocumentNode.SelectSingleNode("//body");
body.Attributes.Remove("style");
foreach (var a in body.Attributes.ToArray())
a.Remove();
string bodywork = body.InnerHtml.ToString();
The string body still returns all the html coding. I might be missing something really small here. What needs to be doen to remove all the html coding basically.

Use body.InnerText not body.InnerHtml

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML Agility Pack 2 - c#

Related

HTML Agility Pack is not working when filtering by class Name

Html Agility Pack, SelectSingleNode

Convert "iso-8859-1" to "utf-8" with HTML Agility Pack and xpath

Can i read iframe through WebClient (i want the outer html)?

HTMLAgilityPack stripping out html

Categories

Resources