How to extract values from nested nodes with HtmlAgilityPack c# - c#

I have the following situation where I need to extract some text within a few nested divs using HtmlAgilityPack with c#
<div class = "content">
<div data-type = "container">
<div class = "level1">
<div class = "level2">
<span>some_text</span>
</div>
</div>
</div>
</div>
the text i need to get is "some_text", I have tried everything but still cant get my head around this.

var doc = new HtmlDocument();
doc.Load("YOUR_HTML_FILENAME.html");
var node = doc.DocumentNode.SelectSingleNode("//span");
string someText = string.Empty;
if (node != null)
someText = node.InnerText; //result >> some_text

Related

How can I get this text from h4?

(Sorry about my english, I'm brazilian)
I'm trying to get the InnerText from a h4 tag using the HtmlAgilityPack, I managed to get that type of value in 3 of 4 tags in the web site that I need. But the last one is the most important and it just returns an empty value.
Is it possible, that the structure of how the website was build requires a different way to get this value?
This is the specific h4 that I'm trying to extract InnetText ("356.386.496,02"):
<h4 class="text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3">
<span class="align-middle fs-12 fs-lg-12 pr-4">R$</span>
"356.386.496,02"
</h4>
I've tried this:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
var nodes = htmlDocument.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
//Result in console:
//=>
Note that the SelectNodes method doesn't return null, it find the h4 node perfectly, but the InnerText value is "".
try to replace "356.386.496,02" with 356.386.496,02 or with ""356.386.496,02""
this solution should be work
public static void Main()
{
var html =
#"<h4 class=""text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3"">
<span class=""align-middle fs-12 fs-lg-12 pr-4"">R$</span>
""56.386.496,02""
</h4>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//h4[#class='text-black--opacity-60 fs-20 fs-sm-42 fs-lg-40 w-100 mt-3']");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerText);
}
}

Cannot get content of specific div with html agility pack

I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.
Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff
Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");

How to get specific data using HtmlAgilityPack

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???
Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

HTML Agility Pack - Get div without class or id (C#)

I've been trying to follow some solutions here on StackOverflow but I need some help.
This is the source HTML:
<div class="myclass">
<div style="font-size:2em;"> STRING_N1 </div>
<div> STRING_N2 </div>
</div>
And this is my current code:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlcode);
var res = doc.DocumentNode.SelectNodes("//div[#class='myclass']");
foreach (var item in res)
{
var firstDiv = item.SelectSingleNode("div");
var content1 = firstDiv.ChildNodes[0].InnerText.Trim();
richTextBox1.AppendText(content1.ToString());
}
So far so good, I can extract "STRING_N1" without a problem. However, I can't figure it out on how to extract STRING_N2 without having a class or id.
Thank you.
You can use LINQ to get descendant divs:
var divs = doc.DocumentNode.SelectNodes("//div[#class='myclass']")
.SelectMany(x => x.Descendants("div"));
var contents = divs.Select(x => x.InnerText.Trim());
richTextBox1.AppendText(string.Join(Environment.NewLine, contents);

Html Agility Pack: how to parse a webresponse and get a specified html element in c#

I googled my problem and found Html Agility Pack to parse html in c#. But there is no good examples and I can't use it to my purpose. I have a html document and it has a part like this:
<div class="pray-times-holder">
<div class="pray-time">
<div class="labels">
Time1:</div>
04:28:24
</div>
<div class="pray-time">
<div class="labels">
Time2:</div>
06:04:41
</div>
</div>
I want to get the value for Time1 and Time2. e.g. Time1 has value 04:28:24 and Time2 has value 06:04:41 and I want to get these values. Can you help me please?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode
.Descendants("div")
.Where(n => n.Attributes["class"] != null && n.Attributes["class"].Value == "pray-time")
.Select(n => n.InnerText.Replace("\r\n","").Trim())
.ToArray();
This console application code:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class = 'labels']"))
{
Console.WriteLine(node.NextSibling.InnerText.Trim());
}
will output this:
04:28:24
06:04:41

Categories

Resources