Html Agility Pack, SelectNodes from a node

Html Agility Pack, SelectNodes from a node - c#

Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".

It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");

var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");

Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");

You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps

This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.

Related

Htmlagilitypack doesnt get nodes.

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}

Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

How do I use HTML Agility Pack to extract text from a specific class?

For example, I want to extract the first definition from http://www.urbandictionary.com/define.php?term=potato . It's raw text, though.
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.urbandictionary.com/define.php?term=potato"));
var root = html.DocumentNode;
var p = root.Descendants()
.Where(n => n.GetAttributeValue("class", "").Equals("meaning"))
.Single()
.Descendants("")
.Single();
var content = p.InnerText;
This is the code I use to try and extract the meaning class. This doesn't work at all, though... How do I extract the class from Urban Dictionary?

If you change your code as below it works
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.urbandictionary.com/define.php?term=potato"));
var root = html.DocumentNode;
var p = root.SelectNodes("//div[#class='meaning']").First();
var content = p.InnerText;
The text I'm using in SelectNodes is XPath and means all div elements with class named meaning. You need to use First or FirstOrDefault as the page contains multiple div elements with that class name, so Single would throw an exception.

Alternatively you can use, if you wanted to use the same "style" as the link you were using.
var p = root.Descendants()
.Where(n => n.GetAttributeValue("class", "").Equals("meaning"))
.FirstOrDefault();
Though Tone's answer is more elegant, one liners are usually better.

Finding node using HTML agility pack

Here is the google chrome dev tool to get the elment im looking for.
Here are all the different ways I have tried to get the nodes..
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webObject.Html);
// HtmlNode footer = doc.DocumentNode.Descendants().SingleOrDefault(y => y. == "boardPickerInner");
// "//div[#class='boardPickerInner']"
//var y = (from HtmlNode node in doc.DocumentNode.SelectNodes("//")
// where node.InnerText == "boardPickerInner"
// select node.InnerHtml);
HtmlAgilityPack.HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//nameAndIcons");
var xq = doc.DocumentNode.SelectSingleNode("//td[#class='nameAndIcons']");
var x = doc.DocumentNode.SelectSingleNode("");
HtmlNode nodes = doc.DocumentNode.SelectSingleNode("//[#class='nameAndIcons']");
var boards = nodes.SelectNodes("//*[#class='nameAndIcons']");
Can someone explain what I am doing wrong..?

It looks like you have multiple span elements with class="nameAndIcons". So in order to get them all you could use the SelectNodes function:
var nodes = doc.DocumentNode.SelectNodes("//span[#class='nameAndIcons'"])

How to parse a simple page using html agility pack?

I am trying to parse this page, but there aren't much unique info for me to uniquely identify the sections I want.
Basically I am trying to get the most of the data right to the flash video. So:
Alternating Floor Press
Type: Strength
Main Muscle Worked: Chest
Other Muscles: Abdominals, Shoulders, Triceps
Equipment: Kettlebells
Mechanics Type: Compound
Level: Beginner
Sport: No
Force: N/A
And also the image links that shows before and after states.
Right now I use this:
HtmlAgilityPack.HtmlDocument doc = web.Load ( "http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press" );
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants ( "a" );
foreach ( var link in threadLinks )
{
string str = link.InnerHtml;
Console.WriteLine ( str );
}
This gives me a lot of stuff I don't need but also prints what I need. Should I be parsing this printed data by trying to see where my goal data might be inside it?

You can select the id of the nodes you are interested in:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.bodybuilding.com/exercises/detail/view/name/alternating-floor-press");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.SelectNodes("//*[#id=\"exerciseDetails\"]");
foreach (var link in threadLinks)
{
string str = link.InnerText;
Console.WriteLine(str);
}
Console.ReadKey();

For a given <a> node, to get the text shown, try .InnerText.
Right now you are using the contents of all <a> tags within the document. Try narrowing down to only the ones you need. Look for other elements which contain the particular <a> tags you are after. For example, do they all sit inside a <div> with a certain class?
E.g. if you find the <a> tags you are interested in all sit within <div class="foolinks"> then you can do something like:-
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
.First(dn => dn.Attributes["class"] == "foolinks").Descendants("a");
--UPDATE--
Given the information in your comment, I would try:-
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("div")
.First(dn => dn.Id == "exerciseDetails").Descendants("a");
--UPDATE--
If you are having trouble getting it to work, try splitting it up into variable assignments and stepping through the code, inspecting each variable to see if it holds what you expect.
E.g,
var divs = doc.DocumentNode.Descendants("div");
var div = divs.FirstOrDefault(dn => dn.Id == "exerciseDetails");
if (div == null)
{
// couldn't find the node - do whatever is appropriate, e.g. throw an exception
}
IEnumerable<HtmlNode> threadLinks = div.Descendants("a");
BTW - I'm not sure if the .Id property maps to the id attribute of the node as you suggest it does. If not, you could try dn => dn.Attributes["id"] == "exerciseDetails" instead.

Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(

I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}

To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html Agility Pack, SelectNodes from a node - c#

Related

Htmlagilitypack doesnt get nodes.

How do I use HTML Agility Pack to extract text from a specific class?

Finding node using HTML agility pack

How to parse a simple page using html agility pack?

Parsing html -> xml and querying with Xpath

Categories

Resources