Fetch data from website using HtmlAgilityPack

Fetch data from website using HtmlAgilityPack - c#

I am developing an app in which I have to fetch data from website. The format of website is:
<div id="id1" class="class1">
<ol class="cls_ol">
<li>
<div class="class2">Content 1</div>
<div class="cls_img">
*** Code for some image ***
</div>
Content 2
</li>
<li> *** Same like above <li> *** </li>
<li> *** Same like above <li> *** </li>
</ol>
</div>
I use code for fetching this...
protected void Button1_Click(object sender, EventArgs e)
{
var obj = new HtmlWeb();
var document = obj.Load(" ** url of a website ** ");
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var i in bold)
{
Response.Write(i.InnerHtml);
}
But, the problem with my code is this, it also fetches the images of <div class="cls_img"></div>. I don't need this image. So, how to fetch all the content of <div id="id1" class="class1"> without fetch the image from <div class="cls_img">.

Step 1 - select and remove images inside the <div class="cls_img"> inside the <div class="class1"> tag:
var images = document.DocumentNode.SelectNodes(
"//div[#class='class1']//*//div[#class='cls_img']//img"
);
// note that if no nodes found "images" variable will hold a null value
foreach (var image in images)
{
image.Remove();
}
Step 2 - select <div class="class1"> elements (you already done it) - now without that images:
var bold = document.DocumentNode.SelectNodes("//div[#class='class1']");
foreach (var node in bold)
{
Console.Write(node.InnerHtml);
}

Loop through the nodes and find a node with the matching attribute of class="cls_img" and remove that node.
node.ParentNode.RemoveChild(node);

Related

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>

This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

How To Get Div inside Div htmlagilitypack

first .. sorry about my bad english
my question is how can i scrape div inside div in htmlagilitypack c#
this is test html code
<html>
<div class="all_ads">
<div class="ads__item">
<div class="test">
test 1
</div>
</div>
<div class="ads__item">
<div class="test">
test 2
</div>
</div>
<div class="ads__item">
<div class="test">
test 3
</div>
</div>
</div>
</html>
how to make a loop that get all ads then loop that control test inside ads

You can select all the nodes inside class all_ads as follow:-
var res = div.SelectNodes(".//div[#class='all_ads ads__item']");
.//div[#class='all_ads ads__item'] This will select all the nodes inside all_adswhich has class ads_item.

You have to use this path => //div[contains(#class, 'test')]
This means you need to select those div(s) that contains class with name ads__item.
and then select all those selected div(s) inner html. like
class Program
{
static void Main(string[] args)
{
string html = File.ReadAllText(#"Path to your html file");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var innerContent = doc.DocumentNode.SelectNodes("//div[contains(#class, 'test')]").Select(x => x.InnerHtml.Trim());
foreach (var item in innerContent)
Console.WriteLine(item);
Console.ReadLine();
}
}
Output:

Identify XPath from particular element

I'm working on a page, where page loads dynamically and the data gets added while scrolling. To identify the properties of an item, I identified the parent div, where to identify the address, I have to locate an XPath from the parent to span element.
Below is my DOM structure:
<div class = "parentdiv">
<div class = "search">
<div class="header">
<div class="data"></div>
<div class="address-data">
<div class="address" itemprop="address">
<a itemprop="url" href="/search/Los-Angeles-CA-90025">
<span itemprop="streetAddress">
Avenue
</span>
<br>
<span itemprop="Locality">Los Angeles</span>
<span itemprop="Region">CA</span>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
Here I want to locate the three spans, where I' currently in parent div.
Can someone guide how to locate an element using XPath from particular div?

You can try the following XPaths,
To locate the street address:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="streetAddress"]
To locate the locality/city:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="Locality"]
To locate the state:
//div[#class="parentdiv"]/div/div/a/span[#itemprop="Region"]

To print the list of <span> tagged WebElements with texts like Avenue with respect to div class = "parentdiv" node you can use the following block of code :
IList<IWebElement> myList = Driver.FindElements(By.CssSelector("div.parentdiv > div.address > a[itemprop=url] > span"));
foreach (IWebElement element in myList)
{
string my_add = element.GetAttribute("innerHTML");
Console.WriteLine(my_add);
}

Your DOM might become fairly large, since it adds elements while scrolling, so using CSS selectors might be quicker.
To get all the span tags in the div, use:
div[class='address'] span
To get a specific span by using the itemprop attribute use:
div[class='address'] span[itemprop='streetAddress']
div[class='address'] span[itemprop='Locality']
div[class='address'] span[itemprop='Region']
You can store the elements in a variable like so:
var streetAddress = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='streetAddress']"));
var locality = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='Locality']"));
var region = driver.FindElement(By.CssSelector("div[class='address'] span[itemprop='Region']"));

Html nodes issue with HtmlAgilityPack

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.
In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.
<div class='downloads' id='download_block'>
<h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
</div>
This is how it looks on the webpage
And this is what I have:
nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[#class='downloads']/ul[#class='parts']")
I can't just use an array-index to determine the position like:
nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node
...because they could change the amount of nodes and its hosting positions.
Note that also the urls will not contains the hosting names, are redirections like:
http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--
What could I do, in C# or else VB.Net?.

this should do, untested though:
doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value
also use contains because you never know if the text contains spaces.

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.
void Main()
{
var xml = #"
<div class=""downloads"" id=""download_block"">
<h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
</div>";
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(xml);
var nav = xmlDocument.CreateNavigator();
var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/#href").InnerText;
Console.WriteLine(text);
}
Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.
Sorry for the not so clean and error prone code, but it should get you in the right direction.

Give the snippet you supplied, this will help you get started.
var page = "<div class=\"downloads\" id=\"download_block\"> <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5> <h4>uploadable.ch</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>uploaded.net</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>novafile.com</h4> <ul class=\"parts\"> <li> text here </li> </ul></div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
attr.Value.Dump();
}

Parse through Each li tag in browser using 'WatiN'

I am using a watin dll to browse through a webpage, click on a link in li tag, go to the next page, fetch some data, go back to previous page and click the link in the next li tag.
I am able to do this with one link in li tag. I want to get all the li tag underul <classname> click on each link and perform the above procedure. How can I get all the li and loop through each page?
HTML code of the page is like this:
<ul id="ul_classname" class="search-result-set">
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
<li class="">
<div class="Div_Classname">
<h3 class="standard_font">
<a class="a class_name" href="link to be clicked">text to be displayed</a>
</h3>
<p class="word-wrap"></p>
</div>
</li>
</ul>

HTH!
private void CrawlSite()
{
int idx = 0;
do
{
idx = this.ClickLink(idx);
}
while (idx != -1);
}
private int ClickLink(int idx)
{
WatiN.Core.Browser browser = GetBrowser();
ListItemCollection listItems = browser.List("ul_classname").ListItems;
if (idx > listItems.Count - 1)
return -1;
Link lnk = listItems[idx].Link(Find.ByClass("a class_name"));
lnk.Click();
//TODO: get your data
browser.Back();
return idx + 1;
}

you can try with this code (Linq to XML)
var xdoc = XDocument.Load(yourFile);
var terms= from term in xdoc.Descendants("ul")
select new
{
Class= term.Attribute("class").Value
};
foreach(var li in terms)
{
Console.Write(li.Class);
}

Try this:
LinkCollection links = ie.Links;
foreach (var link in links)
{
link.Click();
// Do something
ie.Back();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fetch data from website using HtmlAgilityPack - c#

Loop through the nodes and find a node with the matching attribute of class="cls_img" and remove that node. node.ParentNode.RemoveChild(node);

Related

get div information with html agility pack

How To Get Div inside Div htmlagilitypack

Identify XPath from particular element

Html nodes issue with HtmlAgilityPack

Parse through Each li tag in browser using 'WatiN'

Categories

Resources