Computing multiple node sets - c#

var cats = doc.DocumentNode.SelectNodes("xpath1 | xpath2");
I use the | operator to compute multiple nodes and html agilitypack puts them in a single NodeCollection containg all the results, how do I know if the Node is a result of xpath1 or xpath2?
example
var cats = doc.DocumentNode.SelectNodes("//*[contains(#name,'thename')]/../../div/ul/li/a | //*[contains(#name,'thename')]/../../div/ul/li/a/../div/ul/li/a");
I am trying to build a tree like structure from that the first xpath returns a single element the second xpath returns single or multiple elements, the first xpath is a main tree node and the second xpath are the childeren of that node, and i want to build a List<string,List<string>> from that based on the inner text of the results.
To make it more simple consider the following Html:
<ul>
<li>
<h1>Node1</h1>
<ul>
<li>Node1_1</li>
<li>Node1_2</li>
<li>Node1_3</li>
<li>Node1_4</li>
</ul>
</li>
<li>
<h1>Node2</h1>
<ul>
<li>Node2_1</li>
<li>Node2_2</li>
</ul>
</li>
<li>
<h1>Node3</h1>
<ul>
<li>Node3_1</li>
<li>Node3_2</li>
<li>Node3_3</li>
</ul>
</li>
</ul>
var cats = doc.DocumentNode.SelectNodes("//ul/li/h1 | //ul/li/ul/li")

Why not just do:
var head = doc.DocumentNode.SelectNodes("xpath1");
var children = head.SelectNodes("xpath2");
?
For the code in the example you would do:
var containerNodes = doc.DocumentNode.SelectNodes("//ul/li");
foreach(var n in containerNodes)
{
var headNode = n.SelectSingleNode("h1");
var subNodes = n.SelectNodes("ul/li");
}

Related

HTML Agility Pack - Select node after particular paragraph

I have this kind of situation : various files with the following HTML. I need to retreive only the list after "targetWord" paragraph (of course it changes position in the pages I need to parse). How can I do with HTML Agility Pack?
<p>Word1</p>
<ul>
<li>listobject1</li>
<li>listobject2</li>
<li>listobject3</li>
</ul>
<p>targetWord</p>
<ul>
<li>listobject4</li>
<li>listobject5</li>
<li>listobject6</li>
</ul>
<p>Word2</p>
<ul>
<li>listobject7</li>
<li>listobject8</li>
<li>listobject9</li>
</ul>
I need to obtain with my code only the list nodes after targetWord:
foreach (var node in retreivedNodes)
{
s[i] = node.InnerText;
i++;
console.writeline (s[i]);
}
OUTPUT:
listobject4
listobject5
listobject6
You need to craft an xpath expression to match your requirement
Assuming that I have loaded a HAP.HtmlDocument with your snippet as var htmlSnippet then
htmlSnippet.DocumentNode.SelectNodes('//p[text()="targetWord"]/following-sibling::ul[1]//li')
will return the nodeset of li children of the first ul node following your target word p tag.

How to get specific data using HtmlAgilityPack

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???
Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

How to get a particular text inside HTML using c#?

How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?
Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.

XPATH: how to get child nodes

i have following html
<ul class="enh-toggle">
<li>
Design<sup>1</sup><span class="accordion"></span>
<ul id="design">
<li>
<strong>Dimensions</strong>
<ul><li>length:12.3cm</li></ul>
</li>
</ul>
</li>
</ul>
i use the following code to get ul[id='design']
HTMLNode node = doc.DocumentNode.SelectSingleNode("//ul[#class='enh-toggle']//ul[#id='design']");
this just work perfect...
now my question is how can i get the strong tag text. i use the following code but it don't works
string text = node.SelectSingleNode("/li/strong").InnerText;
variation on the "li/strong" answers:
string text = node.SelectSingleNode("./li/strong").InnerText;
A single slash in XPath is the root of the document. You just want to select the direct descendants, so you don't need to give a context:
string text = node.SelectSingleNode("li/strong").InnerText;
I think it should just be:
string text = node.SelectSingleNode("li/strong").InnerText;
..without the leading /.

li in htmlagilitypack c#

I want to get label and strong values from the following li
<div class="property-summary">
<h3>Listing summary</h3>
<ul>
<li>
<label>Reference</label>
<strong>BR-S-4301</strong>
</li>
<li>
<label>Type</label>
<strong>Apartment</strong>
</li>
<li>
<label>City</label>
<strong>Dubai</strong>
</li>
<li>
<label>Community</label>
<strong>Palm Jumeirah</strong>
</li>
<li>
<label>Subcommunity</label>
<strong>Tiara Residences</strong>
</li>
</ul>
</div>
Here is my c# code
var dataNode = rootNode.SelectNodes("//div[normalize-space(#class)='property-summary']");
Now how to get it? below is not working for me
var Node = dataNode .SelectSingleNode(".//li/strong");
There are couple of ways to do it.
1
var labelNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/label");
var strongNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/strong");
foreach (var node in labelNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
foreach (var node in strongNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
2
var liNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li");
foreach (var node in liNodes)
{
Debug.WriteLine(node.SelectSingleNode("label").InnerText.Trim());
Debug.WriteLine(node.SelectSingleNode("strong").InnerText.Trim());
}
check for existence of nodes before writing any real code.
If you want to get all the label tags, you can use
IEnumerable<HtmlNode> labels = dataNode.Descendants("label");
And same for strong tags
IEnumerable<HtmlNode> strongs = dataNode.Descendants("strong");
You can also use:
var dataNode = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']")[0];
HtmlNodeCollection strongs = dataNode.SelectNodes(".//li/strong");
HtmlNodeCollection labels = dataNode.SelectNodes(".//li/label");
To get text from strongs or labels use:
foreach (var strong in strongs)
{
string strongText = strong.InnerText.Trim();
}
You may consider switching to these HTML parsing libraries which provide excellent jQuery selectors like features.
http://nsoup.codeplex.com/
http://github.com/jamietre/csquery

Categories

Resources