HTML XPath Searching by class name - c#

I Have a problem with xpath in c#
I want to find all elements with this structure
I have 10 links which all of them have this structure:
<div class="PartialSearchResults-item" data-zen="true">
<div class="PartialSearchResults-item-title">
<a class="PartialSearchResults-item-title-link result-link"target="_blank" href='https://www.google.com/'> Google</a>
</div>
<p class="PartialSearchResults-item-url">www.google.com</p>
<p class="PartialSearchResults-item-abstract">Search the world.</p>
</div>
for example with this sample i want to get "Google" and "www.google.com" and "Search the world."
var titles = hd.DocumentNode.SelectNodes("//div[contains(#class, 'PartialSearchResults-item')]");
string link;
foreach (HtmlNode node in titles){
string description = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-abstract')]").InnerText;
link = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-url')]").InnerText;
string title = node.SelectSingleNode(".//a[contains(#class,'PartialSearchResults-item-title-link result-link')]").InnerText;}
But I get error null reference

The problem is in the query where you are getting the titles. You are looking for div which's class attribute contains PartialSearchResults-item, which is your item's root node. But there is also other nodes which are satisfying to your query, for example the div with class PartialSearchResults-item-title also satisfying to your query. Then after selecting this 2 divs you are iterating over them and trying to get sum child nodes, for the first iteration your code will work fine, because you have right node, but in the second iteration you have the node with class PartialSearchResults-item-title which only have one a, so you will get NullReferenceException in the second iteration when you are querying for the description, because you are trying to get value of the InnerText property of null object
string description = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-abstract')]").InnerText;
I would suggest to not use contains. In your case your root node has only one class PartialSearchResults-item, so you can query it like this
var titles = hd.DocumentNode.SelectNodes("//div[#class='PartialSearchResults-item']");

Related

Looping a node collection gives me unique nodes but selecting nodes inside from these give me the results of the first loop item

Context: Using the HTMLAgilityPack library, im looping a HtmlNodeCollection, printing the HTML of the node gives me the data that I need, but when im selecting nodes inside the html, all of them gives me the result of the first item I selected nodes in.
Writing the nodes html as node.InnerHtml gives me the unique htmls of them, all correct, but when I do SelectSingleNode, all of them give me the same data.
Due to the project, I cannot disclose the website. What I can say is that theres 17 nodes, all of them are a div with the class k-user-item. All Items are unique, meaning they all are different.
Thanks for the help!
Code:
var nodes = w.DocumentNode.SelectNodes("//div[contains(#class, 'k-user-item')]");
List<Sales> saleList = new List<Sales>();
foreach (HtmlNode node in nodes)
{
//This line prints correct html, selecting single nodes gives me always the same data of the first item from the loop.
//Debug.WriteLine(node.InnerHtml);
string payout = node.SelectSingleNode("//*[#class=\"k-item--buy-date\"]").InnerText;
string size = node.SelectSingleNode("//*[#class=\"k-panel-title\"]").SelectNodes("//span")[1].InnerText;
var trNodes = node.SelectNodes("//tr");
string status = trNodes[1].SelectSingleNode("//b").InnerText;
string orderId = trNodes[2].SelectNodes("//td")[1].SelectSingleNode("//span").InnerHtml;
string sellDate = node.SelectSingleNode("//*[#class=\"k-panel-heading\"]").SelectNodes("//small")[1].InnerHtml;
}
This issue was solved by adding to the XPath a "." on to the start.
Not adding the dot onto the XPath means that the node will search in the whole document and not just the exact node html.

How to find a link in HTML under a certain header AND parse it

I am currently attempting to parse a link from an HTML doc based off the header above it, but no matter what I try, the program is unable to find it.
Here is the method I have that isn't working:
public string findMajorURL(string collegeURL, string major)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(collegeURL);
var root = doc.DocumentNode;
var htmlNodes = root.Descendants();
//Find html node containing the major heading
foreach(HtmlNode node in htmlNodes)
{
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
List<string> links = target.Descendants("a").Select(a => a.Attributes["href"].Value).ToList();
return links.First()+ "__IT WORKED__";
}
}
return "Major not found";
}
This is what the HTML looks like that I am attempting to parse:
<div style="padding-left: 20px">
<h3 id="ent1629">Biological Sciences </h3>
Go to information for this department.
<br>
<p>...</p>
<div id="data_c_1629" style="display: none">...</div>
<!--script language="javascript">hideshow(data_c_1630)</script-->
The major the user inputs is supposed to match the heading, Biological Sciences. Based off of the header, I want to get the link under it, which in this case is preview_entity.php?catoid=5&ent_oid=1629&returnto=818
WARNING: I cannot use XPath withthe version of Visual Studio that I have, so I'm assuming using LINQ somehow would be the best way to go, but again I'm not sure.
EDIT It turns out that the Inner Text is not matching the major, however, I don't see how that's possible, as I took it directly from the html code. Any ideas as to what's wrong?
According to the HTML snippet posted, node inside your if block references <h3> element and target references next sibling of <h3> which is <a>. That said, you don't need to do target.Descendants("a"). Just get href attribute from target directly :
if (node.InnerText == major)
{
HtmlNode target = node.NextSibling;
return target.GetAttributeValue("href", "")+ "__IT WORKED__";
}

Select a Node which has specified subnodes

I have to write a web scraper. My php page is:
<a href="Something.php">
<div class="SPECIFIEDCLASS" title="other something">
</div>
</a>
What I wrote so far is:
var diiv = doc.DocumentNode.SelectNodes("//a/div[#class='SPECIFIEDCLASS']");
var hrefLiist = diiv.Select(q => q.GetAttributeValue("href", "not found")).ToList()
but its not working.
Your XPath expression selects div tags with the specified class within a tags.
But what you want are the a tags with div tags with the specified class. You should instead use this XPath expression:
var diiv = doc.DocumentNode.SelectNodes("//a[div[#class='SPECIFIEDCLASS']]");
For a more visual explanation:
Your XPath does this to each a tag:
Get a tag.
Get child div tag.
Select div tags with Class = "SPECIFIEDCLASS". So ultimately, the div tags are themselves selected
The correct XPath should do this:
Get a tag.
Select a tags where:
Child div tag has Class = "SPECIFIEDCLASS". Here the a tags are selected.

Inner text of Node ignoring inner text of children

Pardon me if it sounds too simple to be asked here but since this is my very first day with html-agility-pack, I am unable to sort out a way to select the inner text of a node which is the direct child of the node and ignoring inner text of the children nodes.
For example
<div id="div1">
<div class="h1"> this needs to be selected
<small> and not this</small>
</div>
</div>
currently I am trying this
HtmlDocument page = new HtmlWeb().Load(url);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']");
string selText = s.innerText;
which returns the whole text (e.g- this needs to be selected and not this).
Any suggestions??
The div could possibly have multiple text nodes if there is text before and after its children. As I similarly indicated here, I think the best way to get all the direct text content of a node is to do something like:
HtmlDocument page = new HtmlWeb().Load(url);
var nodes = page.DocumentNode.SelectNodes("//div[#id='div1']//div[#class='h1']/text()");
StringBuilder sb = new StringBuilder();
foreach(var node in nodes)
{
sb.Append(node.InnerText);
}
string content = sb.ToString();
You can use the /text() option to get all text nodes directly under a specific tag. If you only need the first one, add [1] to it:
page.LoadHtml(text);
var s = page.DocumentNode.SelectSingleNode("//div[#id='div1']//div[#class='h1']/text()[1]");
string selText = s.InnerText;

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Categories

Resources