How can I get this text from h4? - c#

(Sorry about my english, I'm brazilian)
I'm trying to get the InnerText from a h4 tag using the HtmlAgilityPack, I managed to get that type of value in 3 of 4 tags in the web site that I need. But the last one is the most important and it just returns an empty value.
Is it possible, that the structure of how the website was build requires a different way to get this value?
This is the specific h4 that I'm trying to extract InnetText ("356.386.496,02"):
<h4 class="text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3">
<span class="align-middle fs-12 fs-lg-12 pr-4">R$</span>
"356.386.496,02"
</h4>
I've tried this:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
var nodes = htmlDocument.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
//Result in console:
//=>
Note that the SelectNodes method doesn't return null, it find the h4 node perfectly, but the InnerText value is "".

try to replace "356.386.496,02" with 356.386.496,02 or with ""356.386.496,02""
this solution should be work
public static void Main()
{
var html =
#"<h4 class=""text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3"">
<span class=""align-middle fs-12 fs-lg-12 pr-4"">R$</span>
""56.386.496,02""
</h4>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerText);
}
}

Related

Get innerText from <div class> with an <a href> child

I am working with a webBrowser in C# and I need to get the text from the link. The link is just a href without a class.
its like this
<div class="class1" title="myfirstClass">
<a href="link.php">text I want read in C#
<span class="order-level"></span>
Shouldn't it be something like this?
HtmlElementCollection theElementCollection = default(HtmlElementCollection);
theElementCollection = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement curElement in theElementCollection)
{
if (curElement.GetAttribute("className").ToString() == "class1")
{
HtmlElementCollection childDivs = curElement.Children.GetElementsByName("a");
foreach (HtmlElement childElement in childDivs)
{
MessageBox.Show(childElement.InnerText);
}
}
}
This is how you get the element by tag name:
String elem = webBrowser1.Document.GetElementsByTagName("div");
And with this you should extract the value of the href:
var hrefLink = XElement.Parse(elem)
.Descendants("a")
.Select(x => x.Attribute("href").Value)
.FirstOrDefault();
If you have more then 1 "a" tag in it, you could also put in a foreach loop if that is what you want.
EDIT:
With XElement:
You can get the content including the outer node by calling element.ToString().
If you want to exclude the outer tag, you can call String.Concat(element.Nodes()).
To get innerHTML with HtmlAgilityPack:
Install HtmlAgilityPack from NuGet.
Use this code.
HtmlWeb web = new HtmlWeb();
HtmlDocument dc = web.Load("Your_Url");
var s = dc.DocumentNode.SelectSingleNode("//a[#name="a"]").InnerHtml;
I hope it helps!
Here I created the console app to extract the text of anchor.
static void Main(string[] args)
{
string input = "<div class=\"class1\" title=\"myfirstClass\"><a href=\"link.php\">text I want read in C#<span class=\"order-level\"></span>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
foreach (HtmlNode item in doc.DocumentNode.Descendants("div"))
{
var link = item.Descendants("a").First();
var text = link.InnerText.Trim();
Console.Write(text);
}
Console.ReadKey();
}
Note this is htmlagilitypack question so tag the question properly.

How to extract data from webpage using C#

I'm trying to extract text from this HTML tag
<span id="example1">sometext</span>
And I have this code:
using System;
using System.Net;
using HtmlAgilityPack;
namespace GC_data_console
{
class Program
{
public static void Main(string[] args)
{
using (var client = new WebClient())
{
// Download the HTML
string html =
client.DownloadString("https://www.requestedwebsite.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in
doc.DocumentNode.SelectNodes("//span"))
{
HtmlAttribute href = link.Attributes["id='example1'"];
if (href != null)
{
Console.WriteLine(href.Value.ToString());
Console.ReadLine();
}
}
}
}
}
}
But I am still not getting the text sometext.
But if I insert:
HtmlAttribute href = link.Attributes["id"];
I'll get all the IDs names. What am I doing wrong?
You need to first understand difference between HTML Node and HTMLAttribute. You code is nowhere near to solve the problem.
HTMLNode represents the tags used in HTML such as span,div,p,a and lot other. HTMLAttribute represents attribute which are used for the HTMLNodes such as href attribute is used for a, and style,class, id, name etc. attributes are used for almost all the HTML tags.
In below HTML
<span id="firstName" style="color:#232323">Some Firstname</span>
span is HTMLNode while id and style are the HTMLAttributes. and you can get value Some FirstName by using HtmlNode.InnerText property.
Also selecting HTMLNodes from HtmlDocument is not that straight forward. You need to provide proper XPath to select node you want.
Now in your code if you want to get the text written in <span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>, which is part of HTML of someurl.com, you need to write following code.
using (var client = new WebClient())
{
string html = client.DownloadString("https://www.someurl.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
var nodes = doc.DocumentNode.SelectNodes("//span")
.Where(d => d.Attributes.Contains("id"))
.Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
The above code will select all the span tags which are directly under the document node of the HTML. Tags which are located deep inside the hierarchy you need to use different XPath.
This should help you resolve your issue.

How to get text between two div tags with some class attribute with HTMLagility

I want to get some text from two html div from HTML file.
After some searches i decided to use HTMLAgility Pack for doing this.
I wrote this code :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*div[#class='item']");
string value = node.InnerText;
'result' is my content of the File.
But i get this exception : 'Expression must evaluate to a node-set'
And this is some of mt file's content :
<div class="Clear" style="height:15px;"></div>
<div class='Container Select' id="Container_1">
<div class='Item'><div class='Part Lable'>موضوع : </div><div class='Part ...
try either
"//*/div[#class='item']"
or simply
"//div[#class='item']"
have you tried using XPath
for example if I wanated to find a if a node is selected in my example I would do the following
string xpath = null;
XmlNode configNode = configDom.DocumentElement;
// collect selected nodes in node list
XmlNodeList nodeList =
configNode.SelectNodes(#"//*[#status='checked']");
in your case you would do the following
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*/div[#class='item']");
string value = node.InnerText;

how do i get the text from a nested <p> tag in an external html using html agility pack?

I am trying to get some text from an external site. The text I am trying to get is nested in a paragraph tag. The div has has a class value
html code snippet:
<div class="discription"><p>this is the text I want to grab</p></div>
current c# code:
public String getDiscription(string url)
{
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//div[#class='discription']");
if (nodes != null)
{
foreach (var node in nodes)
{
string Description = node.InnerHtml;
return Description;
}
} else
{
string error = "could not find text";
return error;
}
}
what I dont understand is the syntax of the xpath //div[#class='discription'] I know it is wrong what should the xpath be?
use //div[#class='discription']/p.
Breakdown:
//div - All div elements
[#class='discription'] - With a class attribute whose value is discription
/p - Select the child p elements

how to access child node from node in htmlagility pack

<html>
<body>
<div class="main">
<div class="submain"><h2></h2><p></p><ul></ul>
</div>
<div class="submain"><h2></h2><p></p><ul></ul>
</div>
</div>
</body>
</html>
I loaded the html into an HtmlDocument. Then I selected the XPath as submain. Then I dont know how to access to each tags i.e h2, p separately.
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class=\"submain\"]");
foreach (HtmlAgilityPack.HtmlNode node in nodes) {}
If I Use node.InnerText I get all the texts and InnerHtml is also not useful. How to select separate tags?
The following will help:
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class=\"submain\"]");
foreach (HtmlAgilityPack.HtmlNode node in nodes) {
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNode allH2Nodes= node.SelectNodes(".//h2"); //That will search in depth too
//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node = node.ChildNodes["h2"];
}
You are looking for Descendants
var firstSubmainNodeName = doc
.DocumentNode
.Descendants()
.Where(n => n.Attributes["class"].Value == "submain")
.First()
.InnerText;
From memory, I believe that each Node has its own ChildNodes collection, so within your for…each block you should be able to inspect node.ChildNodes.

Categories

Resources