HtmlAgilityPack selecting childNodes not as expected - c#

I am attempting to use the HtmlAgilityPack library to parse some links in a page, but I am not seeing the results I would expect from the methods. In the following I have a HtmlNodeCollection of links. For each link I want to check if there is an image node and then parse its attributes but the SelectNodes and SelectSingleNode methods of linkNode seems to be searching the parent document not the childNodes of linkNode. What gives?
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("/img[#alt]");
}
}
Is there any other way I could get the alt attribute of the image childnode of linkNode if it exists?

You should remove the forwardslash prefix from "/img[#alt]" as it signifies that you want to start at the root of the document.
HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");

With an xpath query you can also use "." to indicate the search should start at the current node.
HtmlNode imageNode = linkNode.SelectSingleNode(".//img[#alt]");

Also, watch out for null checks; SelectNodes returns null instead of blank collection.
HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[#href]");
**if(linkNodes!=null)**
{
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
if (linkTitle == string.Empty)
{
**HtmlNode imageNode = linkNode.SelectSingleNode("img[#alt]");**
}
}
}

Related

How to get div by class in HtmlAgilityPack?

I'm following this tutorial, but I have a problem, I don't know how to get htmlNode by class name .
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(e.Result);
HtmlNode divContainer = htmlDoc.GetElementbyId("directoryItems");//My problem here,I want to get by class name html
if (divContainer != null)
{
HtmlNodeCollection nodes = divContainer.SelectNodes("//table/tr");
....
}
Try this:
HtmlNodeCollection divContainer = htmlDoc.DocumentNode.SelectNodes("//div[#class='myClass']");
this will return a collection of div nodes with class="myClass"
Assuming that you want to select a <div> element having class attribute value equals "directoryItems", and you know there will be only one element meets the criteria (or you want to simply select the first occurrence if there are more then one), you can use .SelectSingleNode() method with following XPath query :
HtmlNode divContainer = htmlDoc.DocumentNode
.SelectSingleNode("//div[#class='directoryItems']");

HtmlAgilityPack HtmlNodeCollection returning NULL , shouldn't

I made a simple program for fetching youtube users in comments.
This is the code
string html;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.youtube.com/watch?v=ER5EnjskCvE");
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
List<string> data = new List<string>();
HtmlNodeCollection nodeCollection = doc.DocumentNode.SelectNodes("//*[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img");
foreach (HtmlNode node in nodeCollection)
{
data.Add(node.GetAttributeValue("alt",null));
}
But i have a problem that my nodeCollection is returning null.
For the XPath i used copy XPath option in chrome under F12
try this replace "*" , "div"
"/html/body//div[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img"

How to get text between two div tags with some class attribute with HTMLagility

I want to get some text from two html div from HTML file.
After some searches i decided to use HTMLAgility Pack for doing this.
I wrote this code :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*div[#class='item']");
string value = node.InnerText;
'result' is my content of the File.
But i get this exception : 'Expression must evaluate to a node-set'
And this is some of mt file's content :
<div class="Clear" style="height:15px;"></div>
<div class='Container Select' id="Container_1">
<div class='Item'><div class='Part Lable'>موضوع : </div><div class='Part ...
try either
"//*/div[#class='item']"
or simply
"//div[#class='item']"
have you tried using XPath
for example if I wanated to find a if a node is selected in my example I would do the following
string xpath = null;
XmlNode configNode = configDom.DocumentElement;
// collect selected nodes in node list
XmlNodeList nodeList =
configNode.SelectNodes(#"//*[#status='checked']");
in your case you would do the following
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*/div[#class='item']");
string value = node.InnerText;

How to use HtmlAgilityPack to get this two content?

Html code:
<div>
<div>Name</div>
Date
</div>
How to use HtmlAgilityPack to get the Name and Date values?
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode div in doc.DocumentElement.SelectNodes("//div"])
{
string parent = div.InnerText; //this will give you Name
foreach (HtmlNode child in div.ChildNodes)
{
string childDiv = child.InnerText; //this will give you the child
}
}
var doc = new HtmlDocument();
doc.LoadHtml("<div><div>Name</div>Date</div>");
var nodes = doc.DocumentNode
.DescendantNodes()
.Where(n => n.NodeType == HtmlNodeType.Text);
foreach (var node in nodes)
{
var value = node.InnerText;
}
Hard-coded:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<div><div>Name</div>Date</div>");
Console.WriteLine(((HtmlTextNode)doc.DocumentNode.ChildNodes[0].ChildNodes[0]).Text);
Console.WriteLine(((HtmlTextNode)doc.DocumentNode.ChildNodes[1]).Text);
If what you're looking for, is all the text nodes (as in Kevin Babcock's answer!). You have to remember that text in Xml (and with HtmlAgility) are nodes.

How to delete a node if it has no parent node

I'm using the HTML agility pack to clean up input to a WYSIWYG. This might not be the best way to do this but I'm working with developers who explode on contact with regex so it will have to suffice.
My WYSIWYG content looks something like this (for example):
<p></p>
<p></p>
<p><span><input id="textbox" type="text" /></span></p>
I need to strip the empty paragraph tags. Here's how I'm doing it at the moment:
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
foreach (HtmlNode node in nodes)
{
node.InnerHtml = node.InnerHtml.Trim();
if (node.InnerHtml == string.Empty)
node.ParentNode.RemoveChild(node);
}
However, because the HTML is not a complete document the paragraph tags do not have a parent node and RemoveChild will therefore fail since ParentNode is null.
I can't find another way to remove tag though, can anyone point me at an alternate method?
Technically, first-level elements are children of the document root, so the following code should work:
if (node.InnerHtml == String.Empty) {
HtmlNode parent = node.ParentNode;
if (parent == null) {
parent = doc.DocumentNode;
}
parent.RemoveChild(node);
}
You want to remove from the collection, right?
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
if (nodes == null)
return;
for (int i = 0; i < nodes.Count - 1; i++)
{
nodes[i].InnerHtml = nodes[i].InnerHtml.Trim();
if (nodes[i].InnerHtml == string.Empty)
nodes.Remove(i);
}

Categories

Resources