Retrieving specific URLs with HtmlAgilityPack C#

Retrieving specific URLs with HtmlAgilityPack C# - c#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.

You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Related

Selenium XPath Query - FindElement After Text

I am trying to get a link in a website which changes name on a daily basis. The structure is similar to this (but with many more levels):
<li>
<div class = "contentPlaceHolder1">
<div class="content">
<p>
<strong>'Today's File Here:<strong>
</p>
</div>
</div>
</li>
<li>...<li>
<li>...<li>
<li>...<li>
<li>
<div class = "contentPlaceHolder1">
<div class="content">
<div class="DocLink">
<li>
Download
</li>
</div>
</div>
</div>
</li>
<li>...<li>
etc...
If I find the text (which will remain constant) which is immediately above it in the page by using
IWebElement foundTextElement = chrome.FindElement(By.XPath("//p/strong['Today's File Here:']"));
How can I find the next link in the page by using XPath (or alternative solution)? I am unsure of how to search for the next element after this.
If I use
IWebElement link = chrome.FindElement(By.XPath("//a[#class='txtLnk'"));
then this finds the first link in the page. I only want the first occurance of it after 'foundTextElement'
I have had it working by navigating up the tree to the parent above <li>, and finding the 4th sibling using By.XPath("following-sibling::*[4]/div/div/div/li/a[#class='txtLnk']") but that seems a little precarious to me.
I could parse the HTML until it finds the next occurrence in the html, but was wondering whether there is a more clever way of doing this?
Thanks.

You can try this xpath. It's complicated, as we don't see the rest of the page to optimize it
//li[preceding-sibling::li[.//*[contains(text(),'File Here')]]][.//a[contains(#class,'txtLnk')]][1]
it searches first li which has inside a tag with txtLnk class and it is first found followed after li element with text containing File Here

By.XPath("//a[#class='txtLnk'")
Is a very generic selector, there might be other elements on the page using the same class
You can find this using a CssSelector, try this:
IWebElement aElement = chrome.FindElement(By.CssSelector("div.contentPlaceHolder1 div.content div.DocLink li a"));
Then you can get the href using:
string link = aElement.getAttribute("href") ;

Grouping Results in XPath

Introduction :
Suppose we have such a HTML code like this :
<div class="search-result">
<h2>TV-Series</h2>
<ul>
<li>
<div class="title">
Prison Break : Sequel - First Season
</div>
<span class="subtle count">10 subtitles</span>
</li>
<li>
<div class="title">
Prison Break - Fourth Season
</div>
<span class="subtle count">1232 subtitles</span>
</li>
</ul>
<h2>Popular</h2>
<ul>
<li>
<div class="title">
Prison Break - Fourth Season (2008)
</div>
<div class="subtle count">
1232 subtitles
</div>
</li>
<li>
<div class="title">
Prison Break - Third Season (2007)
</div>
<div class="subtle count">
644 subtitles
</div>
</li>
</ul>
</div>
The page is something like this :
And you can see the Original site here : SubScene
I'm writting a C# Desktop application , that get the information of this site .
Before I learn HTML Agility Pack , I use Regular Expression .
with this pattern : <h2>[\s\S]+?</ul> I separate Series ( like Tv-Series , Popular and ...) .
then with this pattern on Rgular Expression : <li>[\s\S]+?(.+)[\s\S]+?class="subtle count"[\s\S]+?(\d*)[\s\S]+?</li> I get categorized information from this site.
with MatchCollection & using Groups ( that difined with Parenthesis) , My method in Regex , Returned me Two-dimensional list for each Serie, that each Row is about a Movie and columns include : Movie Name , Number of Subtitles and Subtitle Dowunload Link .
and that Two-dimensional list became like a DataBase somthing like this :
NOW i learned HTML Agility Pack .
Question :
1- How can I Create such a that list in HTML Agility Pack with XPath ?
2- With which XPath I can create group like Regex as you saw before ?
Thank you so much .

The comment by Martin Honnen is correct, there isn't really much functionality to provide 'grouping' via XPath. However it is possible to use a loop and run a set of XPaths on sets of elements to extract the data you want.
First, you extract each of the title elements, then extract each of the list items from the titles, and run one file XPath to pull out the values you want from each one.
Note: This code is written using XPaths against an XDocument instead of with HTML Agility Pack, but the XPath should be the same regardless.
var titleNodes = d.XPathSelectElements("/div[#class='search-result']/h2");
foreach (var titleNode in titleNodes)
{
string title = titleNode.Value.Dump();
var listItems = titleNode.XPathSelectElements("following-sibling::ul[1]/li");
foreach (var listItem in listItems)
{
var itemData = listItem.XPathEvaluate("div[#class='title']/a/text() | *[#class='subtle count']/text()");
}
}
Note the use of the XPath | operator in the last expression to select the values of multiple different children in a single XPath call. The values are kind of 'grouped' like you wanted.

Getting text enclosed by <li> tags

Hi This is how my html file look like
<div class="panel-body sozluk">
<ol>
<li>kitap <code>isim</code> </li>
</span> </ol>
</div>
I am required to get values enclosed by the "li" tags.
This is my Xpath
//*[#id="wrap"]/div[2]/div[5]/div/div/div[1]/div[1]/div/div[1]/div[2]
This is what I have tried so far
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load("word.html");
var v = document.DocumentNode
.SelectNodes("//[#id='wrap']/div[2]/div[5]/div/div/div[1]/div[1]/div/div[1]/div[2]/ol ")
.Select(x => x.ChildNodes["li"].InnerText);
Application crashes everytime.How can I do this

First things first, your XPath is invalid because it missing the star symbol (*) at the beginning :
var v = document.DocumentNode
.SelectNodes("//[#id='wrap']/div[2]/div[5]/....")
^here, right after '//'
Such verbose XPath is fragile, always prefer selecting elements by id or class or some other attribute, possible example :
var v = document.DocumentNode
.SelectNodes("//*[#id='wrap']//div[#class='panel-body sozluk']/ol/li")
.Select(o => o.InnerText);

You need to look at your HTML first:
<div class="panel-body sozluk">
<ol>
<li>kitap <code>isim</code> </li>
</span> </ol>
</div>
This is invalid. You have a div, inside which you have an ol, inside which you have a li, inside which you have a code. However, you are closing a span inside your div. The span, if opened at all was opened outside the div which contains the closing of the span. Make sure you are having valid html, before you try to extract things from it. And structure your code, I am sure you would have observed this problem if your code was structured.

Your HTML is kinda messy, but if you don't mind using another package,
use Fizzler for HTMLAgilityPack, that will allow you to use jquery-like selectors to get them instead of xpath.
var liList = document.DocumentNode.QuerySelectorAll("li");

Related to predicates in HtmlAgilityPack

I want to fetch data from website. I am using HtmlAgilityPack. In the website content is like this
<div id="list">
<div class="list1">
<a href="example1.com" class="href1" >A1</a>
<a href="example4.com" class="href2" />
</div>
<div class="list2">
<a href="example2.com" class="href1" >A2</a>
<a href="example5.com" class="href2" />
</div>
<div class="list3">
<a href="example3.com" class="href1" >A3</a>
<a href="example6.com" class="href2" />
</div>
</div>
Now, I want to fetch the first two links which has class="href1". I am using code.
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#class='href1'][position()<3]");
But, it is not working. It gives all three links. I want to fetch only first two links. How to do this?
Hey! Now I want to do 1 thing also.
Above, I have only three links with class="href1". Suppose, I have 10 links with class="href1". And I want to fetch only four links from 6th link to 9th link. How to fetch these particular four links?

Try like wrapping the anchor selector in parentheses before applying the position() function:
var nodes = doc.DocumentNode.SelectNodes("(//a[#class='href1'])[position()<3]");

Why not just get them all and use the first two from the returned collection? Whatever xpath you would need to do this would be ultimately a hell of a lot less readable than using LINQ:
using System.Linq;
...
HtmlNodeCollection firstTwoHrefs = doc.DocumentNode
.SelectNodes("//a[#class='href']").Take(2);

Replacing certain containers inside a cloned "div" in jquery

I am trying to replace the contents of a selected "div" element, and append it to the parent control. So far I am able to clone and append it to the parent, but I want to know how I can replace certain tags inside.
to be specific here is the jquery i use to clone the target control
var x = $(parent).children('div[class="answer"]:first').children('div[class="ansitem"]:first').clone();
the html content inside the clone div is like this :
<div id="ansthumb_anstext_anscontrols">
<div id="image" class="ansthumb">
replace 1
</div>
<div id="atext" class="anstext">
<p class="atext_para">
<span id="mainwrapper_QRep_ARep_0_UName_0" style="color: rgb(51, 102, 255); font-weight: bold;">Replace 2 </span>
Replace 3
</p>
<p id="answercontrols">
<input name="ctl00$mainwrapper$QRep$ctl01$ARep$ctl01$AnsID" id="mainwrapper_QRep_ARep_0_AnsID_0" value='replace 4' type="hidden">
<a id="mainwrapper_QRep_ARep_0_Like_0" title="Like this answer" href="#">Like</a>
<a id="mainwrapper_QRep_ARep_0_Report_0" title="Report question" href="#">Report</a>
<span id="mainwrapper_QRep_ARep_0_lblDatetime_0" class="date"> replace 5 </span>
</p>
</div>
here i have marked all the areas I want to be replaced. The id's of the above div elements are named as such because it is generated within a repeater control.
I have gone through the jquery API and this function seems to be the thing I should be using as far as i understand.
replaceWith(content)
but the drawback of this way is i have to dump the entire html on to a string variable and include replacement text wherever needed. I think it is not the best way, and may be something like selecting particular tags and changing data would be the way to do it. Any help appreicated guys!
thanks

You could use the .html() and a couple other jQuery functions and use the surrounding elements as your selectors.
For example
<script type='text/javascript'>
$("#image").html("YourData1"); //replace 1
var secondSpan = $("#mainwrapper_QRep_ARep_0_UName_0");
$(secondSpan).html("YourData2"); //replace 2
$(secondSpan).after("YourData3"); //replace 3
$("#mainwrapper_QRep_ARep_0_AnsID_0").attr("value", "YourData4"); //replace 4
$("#mainwrapper_QRep_ARep_0_lblDatetime_0").html("YourData5"); //replace 5
</script>
Since these ids are defined by .NET, you can get the ClientID of the .NET control.
For example:
var secondSpan = $("#<%= UName.ClientID %>");
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Retrieving specific URLs with HtmlAgilityPack C# - c#

Related

Selenium XPath Query - FindElement After Text

Grouping Results in XPath

Getting text enclosed by <li> tags

Related to predicates in HtmlAgilityPack

Replacing certain containers inside a cloned "div" in jquery

Categories

Resources