HTMLAgilityPack - Get element in class by class - c#

I wish to get the value from the H2 (highlighted) element within 'listicle-page' class shown below. Currently the code gets all values in the DIV element while I need to just get the value of H2 that is contained within the class below.
Consider the following HTML:
Please see code below -
private void getFact()
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://www.rd.com/culture/interesting-facts/");
var headerNames = doc.DocumentNode.SelectNodes("//div[#class='listicle-page']").ToList();
foreach(var item in headerNames)
{
MessageBox.Show(item.InnerText);
}
}

Your XPath //div[#class='listicle-page'] matches div node with all of its descendants. If you need to select child h2 node only, then explicitly specify it by adding /h2:
//div[#class='listicle-page']/h2

Related

XmlElement InnerText property

I'm delving into the world of XmlDocument building and thought I'd try to re-build (at least, in part) the Desktop tree given by Microsoft's program UISpy.
So far I am able to grab a child of the desktop and write that to a XML document, and then grab each child of that and write those to an XML document.
So far the code looks like this...
using System.Windows.Automation;
using System.Xml;
namespace MyTestApplication
{
internal class TestXmlStuff
{
public static void Main(string[] args)
{
XmlDocument xDocument = new XmlDocument();
AutomationElement rootElement = AutomationElement.RootElement;
TreeWalker treeWalker = TreeWalker.ContentViewWalker;
XmlNode rootXmlElement = xDocument.AppendChild(xDocument.CreateElement("Desktop"));
AutomationElement autoElement = rootElement.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, "GitHub"));
string name = autoElement.Current.Name;
while (autoElement != null)
{
string lct = autoElement.Current.LocalizedControlType.Replace(" ", "");
lct = (lct.Equals("") ? "Cusotm" : lct);
XmlElement temp = (XmlElement)rootXmlElement.AppendChild(xDocument.CreateElement(lct));
//temp.InnerText = lct;
string outerXML = temp.OuterXml;
rootXmlElement = temp;
autoElement = treeWalker.GetNextSibling(autoElement);
}
}
}
}
...and the resulting XML file...
Now, when I add a line to change the InnerText Property of each XML element, like temp.InnerText = lct I get an oddly formated XML file.
What I expected from this was that each InnerText would be on the same line as the start and end tags of the XML element, but instead all but the last element's InnerText is located on a new line.
So my question is, why is that? Is there something else I could be doing with my XML elements to have their InnerText appear on the same line?
As I said in a comment, XML isn't a display format, so it gets formatted however IE chooses to do so.
To get closer to what you were expecting, you might want to consider using an attribute rather than innertext:
XmlElement temp = (XmlElement)rootXmlElement.AppendChild(xDocument.CreateElement(lct));
var attr = xDocument.CreateAttribute("type");
attr.Value = lct;
temp.Attributes.Append(attr);
IE displays the attributes within the opening element, which may be good enough for your purposes.
From the XML perspective, what you're currently creating is called Mixed Content - you have an element that contains both text and other elements. From a hierarchical perspective, those text nodes and other elements occupy the same position within the hierarchy - so I'd assume that this is why IE is displaying them as "equals" - both nested under their parent element and at the same indentation level.

How to get div by class in HtmlAgilityPack?

I'm following this tutorial, but I have a problem, I don't know how to get htmlNode by class name .
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(e.Result);
HtmlNode divContainer = htmlDoc.GetElementbyId("directoryItems");//My problem here,I want to get by class name html
if (divContainer != null)
{
HtmlNodeCollection nodes = divContainer.SelectNodes("//table/tr");
....
}
Try this:
HtmlNodeCollection divContainer = htmlDoc.DocumentNode.SelectNodes("//div[#class='myClass']");
this will return a collection of div nodes with class="myClass"
Assuming that you want to select a <div> element having class attribute value equals "directoryItems", and you know there will be only one element meets the criteria (or you want to simply select the first occurrence if there are more then one), you can use .SelectSingleNode() method with following XPath query :
HtmlNode divContainer = htmlDoc.DocumentNode
.SelectSingleNode("//div[#class='directoryItems']");

How do I populate child elements in XML?

I have an xml (Foo.xml)template which is defined as follows:
<Parent:Request xmlns:user="http://xxx.com/">
<Parent:ElemA></Parent:ElemA>
<Parent:ChildNode>
<ElemB></ElemB>
<ElemC></ElemC>
</Parent:ChildNode>
<Parent:ParentName></Parent:ParentName>
</Parent:Request>
In my code, I am able to set the parent elements in the xmltemplate as follows:
public void FooA( MyDomainObject DoM)
{
private readonly XNamespace myNS = "http://ANameSpace.com/";
XElement fooRequestDoc = XElement.Load("Templates/Foo.xml");
XElement ElemA_El = fooRequestDoc.Descendants(myNS + "ElemA").FirstOrDefault();
ElemA_El.SetValue(DoM.ElemA);
}
In this case, if ElemA has a value of "ElementA", then the ElemA_El parameter would be set to this value.
My question is, how do I set a specific Child Note elements such as ElemB or ElemC?
I've tried using "Element" (since I understand it's used to retreive child elements) as follows:
XElement ElemB_El = fooRequestDoc.Element(myNS + "ChildNode");
But it's returning the entire block rather than just ElemB which I seek.
If you know the name of the tag you could do something like this:
XElement ElemB_El = (from node in fooRequestDoc.Descendants() where node.Name == myNS + "ElemB" select node).FirstOrDefault();
If you don't know the name of the tag you can take every Descendants of ChildNode like this:
var nodes = (from node in fooRequestDoc.Descendants(myNS + "ChildNode").Elements() select node).ToList();

how do i get the text from a nested <p> tag in an external html using html agility pack?

I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements

get html node inner text segmented?

I am trying to parse html page and I am facing a problem which is that I want to get the inner text of a node segmented i.e iterate on html node children assuming each text segment as a in child:
<node1>
This text I WANT on iterate#1
<innernode>This text I WANT on iterate#2</innernode>
This text I WANT on iterate#3
<innernode>This text I WANT on iterate#4</innernode>
This text I WANT on iterate#5
</node1>
I am using htmlagilitypack as a parser but I think that I will face this problem with any other html parser
Depending on your .NET version, you could use an extension method that works on the node you want.
I havent used the html agility pack, so this is a mix of C# and psuedo-code.
eg
public static List<string> GetTextSegments(this HtmlNode node)
{
string nodesText = ... // get the nodes text
yield nodesText;
List<HtmlNode> innerNodes = ... // get the list of inner nodes with a
// query like node.SelectNodes("//innerNodes")
foreach(HtmlNode iNode in innerNodes)
{
string iNodeText = ... // get iNodes text
yield iNodeText;
}
}
You could then call this like so:
HtmlNode nodeOfTypeNode1 = ... //
foreach(string text : nodeOfTypeNode1.getTextSegments())
{
Console.WriteLine(text);
}
To get your goal, use SelectNodes with XPath.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);//content is the variable containing your html.
var items = doc.DocumentNode.SelectNodes("/node1//text()");
foreach (var item in items)
{
Console.WriteLine(item.OuterHtml.Replace("\r\n",""));
}

Categories

Resources