XPATH: how to get child nodes - c#

i have following html
<ul class="enh-toggle">
<li>
Design<sup>1</sup><span class="accordion"></span>
<ul id="design">
<li>
<strong>Dimensions</strong>
<ul><li>length:12.3cm</li></ul>
</li>
</ul>
</li>
</ul>
i use the following code to get ul[id='design']
HTMLNode node = doc.DocumentNode.SelectSingleNode("//ul[#class='enh-toggle']//ul[#id='design']");
this just work perfect...
now my question is how can i get the strong tag text. i use the following code but it don't works
string text = node.SelectSingleNode("/li/strong").InnerText;

variation on the "li/strong" answers:
string text = node.SelectSingleNode("./li/strong").InnerText;

A single slash in XPath is the root of the document. You just want to select the direct descendants, so you don't need to give a context:
string text = node.SelectSingleNode("li/strong").InnerText;

I think it should just be:
string text = node.SelectSingleNode("li/strong").InnerText;
..without the leading /.

Related

find link with multiple keywords in c# with HTML Agility Pack

I am writing a program that parse a website.
I manage to find a link in the website, but I needed to pass the exact Innertext words to find it.
I'm looking for a way to do the same thing but to find it by partial inner text
example:
innertext is: "hi my name is"
I want to be able to find it by putting only
"hi my"
foreach (var title in htmlNodes)
{
if (keywords == title.SelectSingleNode("div/h1").InnerText)
{
if (color == title.SelectSingleNode("div/p").InnerText)
{
Console.WriteLine(title.SelectSingleNode("div/p/a").GetAttributeValue("href", "pas d'addresse"));
}
}
}
here keywords need to match exactly the innertext in div/h1. I want it to be partial.
here is the html code :
<article>
<div class="inner-article">
<a style = "height:150px;" href="/shop/shirts/c712g63kx/p1us9bkh7">
<img width = "150" height="150" src="//assets.supremenewyork.com/146319/vi/qW2Nur88W30.jpg" alt="Qw2nur88w30">
</a>
<h1>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Tiger Stripe Rayon Shirt</a>
</h1>
<p>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Teal</a>
</p>
</div>
</article>
thank you all for your answers!
I found out how to resolve my problem. It was actually quite simple. here is the code:
if ((title.SelectSingleNode("div/h1").InnerText).Contains(keywords))
Now the problem is to do it with case insensitive.

XPath retrieving values from multiple tags inside a node

I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.

how to get <li> tags from Html String

strinh HTML ="<div class="two-clm-listing"><h4>IAS Test Guide</h4>
<ul>
<li>IAS Time Table</li>
<li>IAS Subjects</li>
<li>IAS Books & Material</li>
</ul>
"
i have a string of html tags Now i need to get only li> tags from that string
and after getting these li> tags i need to add some HTML classes to that li> tags
You can try this:
HtmlStr = HtmlStr.Replace("<li>","<li class=\"NewClass\">");

How to get a particular text inside HTML using c#?

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?
Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

How to unwrap an element if it exists with CsQuery?

I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();

Categories

Resources