Select a Node which has specified subnodes - c#

I have to write a web scraper. My php page is:
<a href="Something.php">
<div class="SPECIFIEDCLASS" title="other something">
</div>
</a>
What I wrote so far is:
var diiv = doc.DocumentNode.SelectNodes("//a/div[#class='SPECIFIEDCLASS']");
var hrefLiist = diiv.Select(q => q.GetAttributeValue("href", "not found")).ToList()
but its not working.

Your XPath expression selects div tags with the specified class within a tags.
But what you want are the a tags with div tags with the specified class. You should instead use this XPath expression:
var diiv = doc.DocumentNode.SelectNodes("//a[div[#class='SPECIFIEDCLASS']]");
For a more visual explanation:
Your XPath does this to each a tag:
Get a tag.
Get child div tag.
Select div tags with Class = "SPECIFIEDCLASS". So ultimately, the div tags are themselves selected
The correct XPath should do this:
Get a tag.
Select a tags where:
Child div tag has Class = "SPECIFIEDCLASS". Here the a tags are selected.

Related

Only grab some innertext from a SelectNode with HtmlAgilityPack

I've been using HtmlAgilityPack in order to parse some html in a web page. The current html looks like this:
div class="price__child price__price flex-child__auto tooltip-container">
<div class="price__min-order tooltip-container js-minOrder">
<i>⚠️</i>
<div class="price__min-order-tooltip tooltip">
Minimum order of $15.00.
</div>
</div>
$1.75
</div>
I only want to retrieve the text of the price at the very end, in this case, the $1.75. Doing something like below will return that number, but also the all of the other text within the larger div.
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');
Is there a way to exclude/not grab the innertext from the price__min-order tooltip-container js-minOrder and also the price__min-order-tooltip tooltip, and only grab the 1.75 from the larger div?
I found the way to do it. If you call child node and remove, it will get rid of it.
var priceNode = node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
?.ChildNodes[1];
priceNode?.Remove();
return node
.SelectSingleNode(".//div[contains(#class, 'price__child price__price')]")
.InnerText
.Trim().Replace(" ", "")
.TrimStart('$');

HTML XPath Searching by class name

I Have a problem with xpath in c#
I want to find all elements with this structure
I have 10 links which all of them have this structure:
<div class="PartialSearchResults-item" data-zen="true">
<div class="PartialSearchResults-item-title">
<a class="PartialSearchResults-item-title-link result-link"target="_blank" href='https://www.google.com/'> Google</a>
</div>
<p class="PartialSearchResults-item-url">www.google.com</p>
<p class="PartialSearchResults-item-abstract">Search the world.</p>
</div>
for example with this sample i want to get "Google" and "www.google.com" and "Search the world."
var titles = hd.DocumentNode.SelectNodes("//div[contains(#class, 'PartialSearchResults-item')]");
string link;
foreach (HtmlNode node in titles){
string description = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-abstract')]").InnerText;
link = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-url')]").InnerText;
string title = node.SelectSingleNode(".//a[contains(#class,'PartialSearchResults-item-title-link result-link')]").InnerText;}
But I get error null reference
The problem is in the query where you are getting the titles. You are looking for div which's class attribute contains PartialSearchResults-item, which is your item's root node. But there is also other nodes which are satisfying to your query, for example the div with class PartialSearchResults-item-title also satisfying to your query. Then after selecting this 2 divs you are iterating over them and trying to get sum child nodes, for the first iteration your code will work fine, because you have right node, but in the second iteration you have the node with class PartialSearchResults-item-title which only have one a, so you will get NullReferenceException in the second iteration when you are querying for the description, because you are trying to get value of the InnerText property of null object
string description = node.SelectSingleNode(".//*[contains(#class,'PartialSearchResults-item-abstract')]").InnerText;
I would suggest to not use contains. In your case your root node has only one class PartialSearchResults-item, so you can query it like this
var titles = hd.DocumentNode.SelectNodes("//div[#class='PartialSearchResults-item']");

How to extract text inside a div tag using htmlagilitypack

I want to extract the text "Some text goes here" between the div class.
I am using html agility pack, and c#
<div class="productDescriptionWrapper">
Some Text Goes here...
<div class="emptyClear"> </div>
</div>
this is what I have :
Description = doc.DocumentNode.SelectNodes("//div[#class=\"productDescriptionWrapper\").Descendants("div").Select(x => x.InnerText).ToList();
I get this error :
An unhandled exception of type 'System.NullReferenceException'
I know how to extract if the text is b/w a <h1> or <p> instead of "div" in Descendants i will have to give "h1" or "p".
Somebody please assist.
Use single quotes such as
//div[#class='productDescriptionWrapper']
to get all descendants of all types use:
//div[#class='productDescriptionWrapper']//*,
to get all descendants of a specific type
such as a p then use //div[#class='productDescriptionWrapper']//p.
to get all descendants that are either a div or a p:
//div[#class='productDescriptionWrapper']//*[self::div or self::p]
say you wanted to get all non blank descendant text nodes then use:
//div[#class='productDescriptionWrapper']//text()[normalize-space()]
There is no way you can get null reference exception given doc is created from HTML snippet you posted. Anyway, if you meant to get text within the outer <div>, but not from the inner one, then use xpath /text() which mean get direct child text nodes.
For example, given this HTML snippet :
var html = #"<div class=""productDescriptionWrapper"">
Some Text Goes here...
<div class=""emptyClear"">Don't get this one</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
..this expression return text from the outer <div> only :
var Description = doc.DocumentNode
.SelectNodes("//div[#class='productDescriptionWrapper']/text()")
.Select(x => x.InnerText.Trim())
.First();
//Description :
//"Some Text Goes here..."
..while in contrast, the following return all the text :
var Description = doc.DocumentNode
.SelectNodes("//div[#class='productDescriptionWrapper']")
.Select(x => x.InnerText.Trim())
.First();
//Description :
//"Some Text Goes here...
//Don't get this one"

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Categories

Resources