Grouping Results in XPath - c#

Introduction :
Suppose we have such a HTML code like this :
<div class="search-result">
<h2>TV-Series</h2>
<ul>
<li>
<div class="title">
Prison Break : Sequel - First Season
</div>
<span class="subtle count">10 subtitles</span>
</li>
<li>
<div class="title">
Prison Break - Fourth Season
</div>
<span class="subtle count">1232 subtitles</span>
</li>
</ul>
<h2>Popular</h2>
<ul>
<li>
<div class="title">
Prison Break - Fourth Season (2008)
</div>
<div class="subtle count">
1232 subtitles
</div>
</li>
<li>
<div class="title">
Prison Break - Third Season (2007)
</div>
<div class="subtle count">
644 subtitles
</div>
</li>
</ul>
</div>
The page is something like this :
And you can see the Original site here : SubScene
I'm writting a C# Desktop application , that get the information of this site .
Before I learn HTML Agility Pack , I use Regular Expression .
with this pattern : <h2>[\s\S]+?</ul> I separate Series ( like Tv-Series , Popular and ...) .
then with this pattern on Rgular Expression : <li>[\s\S]+?(.+)[\s\S]+?class="subtle count"[\s\S]+?(\d*)[\s\S]+?</li> I get categorized information from this site.
with MatchCollection & using Groups ( that difined with Parenthesis) , My method in Regex , Returned me Two-dimensional list for each Serie, that each Row is about a Movie and columns include : Movie Name , Number of Subtitles and Subtitle Dowunload Link .
and that Two-dimensional list became like a DataBase somthing like this :
NOW i learned HTML Agility Pack .
Question :
1- How can I Create such a that list in HTML Agility Pack with XPath ?
2- With which XPath I can create group like Regex as you saw before ?
Thank you so much .

The comment by Martin Honnen is correct, there isn't really much functionality to provide 'grouping' via XPath. However it is possible to use a loop and run a set of XPaths on sets of elements to extract the data you want.
First, you extract each of the title elements, then extract each of the list items from the titles, and run one file XPath to pull out the values you want from each one.
Note: This code is written using XPaths against an XDocument instead of with HTML Agility Pack, but the XPath should be the same regardless.
var titleNodes = d.XPathSelectElements("/div[#class='search-result']/h2");
foreach (var titleNode in titleNodes)
{
string title = titleNode.Value.Dump();
var listItems = titleNode.XPathSelectElements("following-sibling::ul[1]/li");
foreach (var listItem in listItems)
{
var itemData = listItem.XPathEvaluate("div[#class='title']/a/text() | *[#class='subtle count']/text()");
}
}
Note the use of the XPath | operator in the last expression to select the values of multiple different children in a single XPath call. The values are kind of 'grouped' like you wanted.

Related

Selenium XPath Query - FindElement After Text

I am trying to get a link in a website which changes name on a daily basis. The structure is similar to this (but with many more levels):
<li>
<div class = "contentPlaceHolder1">
<div class="content">
<p>
<strong>'Today's File Here:<strong>
</p>
</div>
</div>
</li>
<li>...<li>
<li>...<li>
<li>...<li>
<li>
<div class = "contentPlaceHolder1">
<div class="content">
<div class="DocLink">
<li>
Download
</li>
</div>
</div>
</div>
</li>
<li>...<li>
etc...
If I find the text (which will remain constant) which is immediately above it in the page by using
IWebElement foundTextElement = chrome.FindElement(By.XPath("//p/strong['Today's File Here:']"));
How can I find the next link in the page by using XPath (or alternative solution)? I am unsure of how to search for the next element after this.
If I use
IWebElement link = chrome.FindElement(By.XPath("//a[#class='txtLnk'"));
then this finds the first link in the page. I only want the first occurance of it after 'foundTextElement'
I have had it working by navigating up the tree to the parent above <li>, and finding the 4th sibling using By.XPath("following-sibling::*[4]/div/div/div/li/a[#class='txtLnk']") but that seems a little precarious to me.
I could parse the HTML until it finds the next occurrence in the html, but was wondering whether there is a more clever way of doing this?
Thanks.
You can try this xpath. It's complicated, as we don't see the rest of the page to optimize it
//li[preceding-sibling::li[.//*[contains(text(),'File Here')]]][.//a[contains(#class,'txtLnk')]][1]
it searches first li which has inside a tag with txtLnk class and it is first found followed after li element with text containing File Here
By.XPath("//a[#class='txtLnk'")
Is a very generic selector, there might be other elements on the page using the same class
You can find this using a CssSelector, try this:
IWebElement aElement = chrome.FindElement(By.CssSelector("div.contentPlaceHolder1 div.content div.DocLink li a"));
Then you can get the href using:
string link = aElement.getAttribute("href") ;

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Getting text enclosed by <li> tags

Hi This is how my html file look like
<div class="panel-body sozluk">
<ol>
<li>kitap <code>isim</code> </li>
</span> </ol>
</div>
I am required to get values enclosed by the "li" tags.
This is my Xpath
//*[#id="wrap"]/div[2]/div[5]/div/div/div[1]/div[1]/div/div[1]/div[2]
This is what I have tried so far
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load("word.html");
var v = document.DocumentNode
.SelectNodes("//[#id='wrap']/div[2]/div[5]/div/div/div[1]/div[1]/div/div[1]/div[2]/ol ")
.Select(x => x.ChildNodes["li"].InnerText);
Application crashes everytime.How can I do this
First things first, your XPath is invalid because it missing the star symbol (*) at the beginning :
var v = document.DocumentNode
.SelectNodes("//[#id='wrap']/div[2]/div[5]/....")
^here, right after '//'
Such verbose XPath is fragile, always prefer selecting elements by id or class or some other attribute, possible example :
var v = document.DocumentNode
.SelectNodes("//*[#id='wrap']//div[#class='panel-body sozluk']/ol/li")
.Select(o => o.InnerText);
You need to look at your HTML first:
<div class="panel-body sozluk">
<ol>
<li>kitap <code>isim</code> </li>
</span> </ol>
</div>
This is invalid. You have a div, inside which you have an ol, inside which you have a li, inside which you have a code. However, you are closing a span inside your div. The span, if opened at all was opened outside the div which contains the closing of the span. Make sure you are having valid html, before you try to extract things from it. And structure your code, I am sure you would have observed this problem if your code was structured.
Your HTML is kinda messy, but if you don't mind using another package,
use Fizzler for HTMLAgilityPack, that will allow you to use jquery-like selectors to get them instead of xpath.
var liList = document.DocumentNode.QuerySelectorAll("li");

Get First Input in first ul and its first li

Here's the markup:
<h3>Customers</h3>
<ul>
<li>
<label for="customer115">Some Customer Name</label>
<input id="customer115" type="checkbox" value="115" name="customer115">
</li>
.. there are more <li> here...and so on
</ul>
<h3>Dealers</h3>
<ul>
<li>
<label for="dealer100">Some DealerName</label>
<input id="dealer100" type="checkbox" value="115" name="dealer115">
</li>
.. there are more <li> here...and so on
</ul>
I'm trying to get reference to the customer checkbox for example so I can do a click() on it via XPath. I'm doing this in Selenium so something like:
string sXPath = string.Format(string.Format("//h3[text()='{0}']/ul/li/input[1]", "Customers"));
IWebElement firstCompanyCheckbox = GetElementByXPath(sXPath);
firstCompanyCheckbox.Click();
So far I can't figure out how to get to this reference, the above xPath does not find it. I want to click that checkbox.
The ul is not a child of h3. It is a sibling. Adjust your XPath to use the following-sibling:: axis
//h3[text()='{0}']/following-sibling::ul/li/input[1]
If you want to ensure that you select the first ul and the first li, then add additional predicate filters:
//h3[text()='{0}']/following-sibling::ul[1]/li[1]/input[1]
You could also also simplify, in my opinion, that XPath by using an easier to read CSS Selector as well:
ul:first li:first input[type='checkbox']
I'm sure there will be a lot of debate as to which is preferable: CSS vs XPath. But typically when I see my QA going the route of a complex XPath query. I try to find ways to implement "id" attributes on the elements or simplify the DOM elements for selecting.

Related to predicates in HtmlAgilityPack

I want to fetch data from website. I am using HtmlAgilityPack. In the website content is like this
<div id="list">
<div class="list1">
<a href="example1.com" class="href1" >A1</a>
<a href="example4.com" class="href2" />
</div>
<div class="list2">
<a href="example2.com" class="href1" >A2</a>
<a href="example5.com" class="href2" />
</div>
<div class="list3">
<a href="example3.com" class="href1" >A3</a>
<a href="example6.com" class="href2" />
</div>
</div>
Now, I want to fetch the first two links which has class="href1". I am using code.
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#class='href1'][position()<3]");
But, it is not working. It gives all three links. I want to fetch only first two links. How to do this?
Hey! Now I want to do 1 thing also.
Above, I have only three links with class="href1". Suppose, I have 10 links with class="href1". And I want to fetch only four links from 6th link to 9th link. How to fetch these particular four links?
Try like wrapping the anchor selector in parentheses before applying the position() function:
var nodes = doc.DocumentNode.SelectNodes("(//a[#class='href1'])[position()<3]");
Why not just get them all and use the first two from the returned collection? Whatever xpath you would need to do this would be ultimately a hell of a lot less readable than using LINQ:
using System.Linq;
...
HtmlNodeCollection firstTwoHrefs = doc.DocumentNode
.SelectNodes("//a[#class='href']").Take(2);

Categories

Resources