how to access child node from node in htmlagility pack - c#

<html>
<body>
<div class="main">
<div class="submain"><h2></h2><p></p><ul></ul>
</div>
<div class="submain"><h2></h2><p></p><ul></ul>
</div>
</div>
</body>
</html>
I loaded the html into an HtmlDocument. Then I selected the XPath as submain. Then I dont know how to access to each tags i.e h2, p separately.
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class=\"submain\"]");
foreach (HtmlAgilityPack.HtmlNode node in nodes) {}
If I Use node.InnerText I get all the texts and InnerHtml is also not useful. How to select separate tags?

The following will help:
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class=\"submain\"]");
foreach (HtmlAgilityPack.HtmlNode node in nodes) {
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNode allH2Nodes= node.SelectNodes(".//h2"); //That will search in depth too
//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node = node.ChildNodes["h2"];
}

You are looking for Descendants
var firstSubmainNodeName = doc
.DocumentNode
.Descendants()
.Where(n => n.Attributes["class"].Value == "submain")
.First()
.InnerText;

From memory, I believe that each Node has its own ChildNodes collection, so within your for…each block you should be able to inspect node.ChildNodes.

Related

How can I get this text from h4?

(Sorry about my english, I'm brazilian)
I'm trying to get the InnerText from a h4 tag using the HtmlAgilityPack, I managed to get that type of value in 3 of 4 tags in the web site that I need. But the last one is the most important and it just returns an empty value.
Is it possible, that the structure of how the website was build requires a different way to get this value?
This is the specific h4 that I'm trying to extract InnetText ("356.386.496,02"):
<h4 class="text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3">
<span class="align-middle fs-12 fs-lg-12 pr-4">R$</span>
"356.386.496,02"
</h4>
I've tried this:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
var nodes = htmlDocument.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
//Result in console:
//=>
Note that the SelectNodes method doesn't return null, it find the h4 node perfectly, but the InnerText value is "".
try to replace "356.386.496,02" with 356.386.496,02 or with ""356.386.496,02""
this solution should be work
public static void Main()
{
var html =
#"<h4 class=""text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3"">
<span class=""align-middle fs-12 fs-lg-12 pr-4"">R$</span>
""56.386.496,02""
</h4>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerText);
}
}

Html Agility Pack get specific content from a div

I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}

Select link inside div tag

I would like to get a link (URL to be specific) inside a div class. This is the code I have that gets me the text inside div class (Some text...).
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='content']"))
{
//saves text (node.InnerText) in array
}
This is the HTML from the site. I would like to get www.google.com
<div class="content">
<p>Some text...
LINK
</p>
</div>
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='novica']/p/a[#href='www.google.com']"))
{
//saves text (node.InnerText) in array
}
That code is not valid based on your writing, however you have 2 options:
Once you have the node for the div, use .GetElementsByTagName("a") or the children to pull out the link, then get it's href attribute.
Amend your SelectNodes() XPath to get the a tag instead: //div[#class='novica']/p/a.
The first is obviously better if you do need the .InnerText of that element to get Some text..., however the second would be faaster.
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='novica']"))
{
var links = node.Descendants("a").Select(n => n.InnerText).ToList();
}

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.
I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?
<div id="content">
<h4>Header</h4>
<ul>
<li><a href...></a> THIS IS WHAT I WANT TO GET</li>
</ul>
</div>
When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!
What parts are constant:
The 'id' in the DIV?
The h4
Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);
if ( doc.DocumentNode != null )
{
var divs = doc.DocumentNode
.SelectNodes("//div")
.Where(e => e.Descendants().Any(e => e.Name == "h4"));
// You now have all of the divs with an 'h4' inside of it.
// The rest of the element structure, if constant needs to be examined to get
// the rest of the content you're after.
}
If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?
Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense
EDIT
Think this should work
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
{
if(p.Attributes["id"].Value == "content")
{
foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
{
if(p.PreviousSibling.InnerText() == "Header")
{
foreach(HtmlNode liNodes in p.ChildNodes)
{
//liNodes represent all childNode
}
}
}
}
If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:
//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");
//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
.SelectNodes("//h4/following-sibling::*[1]//li");
//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
Console.WriteLine(listElement.InnerText);
}

How to get all input elements in a form with HtmlAgilityPack without getting a null reference error

Example HTML:
<html><body>
<form id="form1">
<input name="foo1" value="bar1" />
<!-- Other elements -->
</form>
<form id="form2">
<input name="foo2" value="bar2" />
<!-- Other elements -->
</form>
</body></html>
Test code:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"D:\test.html");
foreach (HtmlNode node in doc.GetElementbyId("form2").SelectNodes(".//input"))
{
Console.WriteLine(node.Attributes["value"].Value);
}
The statement doc.GetElementbyId("form2").SelectNodes(".//input") gives me a null reference.
Anything I did wrong? thanks.
You can do the following:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.Load(#"D:\test.html");
HtmlNode secondForm = doc.GetElementbyId("form2");
foreach (HtmlNode node in secondForm.Elements("input"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null)
{
Console.WriteLine(valueAttribute.Value);
}
}
By default HTML Agility Pack parses forms as empty node because they are allowed to overlap other HTML elements. The first line, (HtmlNode.ElementsFlags.Remove("form");) disables this behavior allowing you to get the input elements inside the second form.
Update:
Example of form elements overlap:
<table>
<form>
<!-- Other elements -->
</table>
</form>
The element begins inside a table but is closed outside the table element. This is allowed in the HTML specification and HTML Agility Pack has to deal with it.
Just get them in array:
HtmlNodeCollection resultCollection = doc.DocumentNode.SelectNodes("//*[#type='text']");

Categories

Resources