HTML Agility Pack - Get div without class or id (C#) - c#

I've been trying to follow some solutions here on StackOverflow but I need some help.
This is the source HTML:
<div class="myclass">
<div style="font-size:2em;"> STRING_N1 </div>
<div> STRING_N2 </div>
</div>
And this is my current code:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlcode);
var res = doc.DocumentNode.SelectNodes("//div[#class='myclass']");
foreach (var item in res)
{
var firstDiv = item.SelectSingleNode("div");
var content1 = firstDiv.ChildNodes[0].InnerText.Trim();
richTextBox1.AppendText(content1.ToString());
}
So far so good, I can extract "STRING_N1" without a problem. However, I can't figure it out on how to extract STRING_N2 without having a class or id.
Thank you.

You can use LINQ to get descendant divs:
var divs = doc.DocumentNode.SelectNodes("//div[#class='myclass']")
.SelectMany(x => x.Descendants("div"));
var contents = divs.Select(x => x.InnerText.Trim());
richTextBox1.AppendText(string.Join(Environment.NewLine, contents);

Related

How to extract values from nested nodes with HtmlAgilityPack c#

I have the following situation where I need to extract some text within a few nested divs using HtmlAgilityPack with c#
<div class = "content">
<div data-type = "container">
<div class = "level1">
<div class = "level2">
<span>some_text</span>
</div>
</div>
</div>
</div>
the text i need to get is "some_text", I have tried everything but still cant get my head around this.
var doc = new HtmlDocument();
doc.Load("YOUR_HTML_FILENAME.html");
var node = doc.DocumentNode.SelectSingleNode("//span");
string someText = string.Empty;
if (node != null)
someText = node.InnerText; //result >> some_text

Html Agility Pack get specific content from a div

I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}

Cannot get content of specific div with html agility pack

I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.
Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff
Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");

li in htmlagilitypack c#

I want to get label and strong values from the following li
<div class="property-summary">
<h3>Listing summary</h3>
<ul>
<li>
<label>Reference</label>
<strong>BR-S-4301</strong>
</li>
<li>
<label>Type</label>
<strong>Apartment</strong>
</li>
<li>
<label>City</label>
<strong>Dubai</strong>
</li>
<li>
<label>Community</label>
<strong>Palm Jumeirah</strong>
</li>
<li>
<label>Subcommunity</label>
<strong>Tiara Residences</strong>
</li>
</ul>
</div>
Here is my c# code
var dataNode = rootNode.SelectNodes("//div[normalize-space(#class)='property-summary']");
Now how to get it? below is not working for me
var Node = dataNode .SelectSingleNode(".//li/strong");
There are couple of ways to do it.
1
var labelNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/label");
var strongNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/strong");
foreach (var node in labelNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
foreach (var node in strongNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
2
var liNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li");
foreach (var node in liNodes)
{
Debug.WriteLine(node.SelectSingleNode("label").InnerText.Trim());
Debug.WriteLine(node.SelectSingleNode("strong").InnerText.Trim());
}
check for existence of nodes before writing any real code.
If you want to get all the label tags, you can use
IEnumerable<HtmlNode> labels = dataNode.Descendants("label");
And same for strong tags
IEnumerable<HtmlNode> strongs = dataNode.Descendants("strong");
You can also use:
var dataNode = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']")[0];
HtmlNodeCollection strongs = dataNode.SelectNodes(".//li/strong");
HtmlNodeCollection labels = dataNode.SelectNodes(".//li/label");
To get text from strongs or labels use:
foreach (var strong in strongs)
{
string strongText = strong.InnerText.Trim();
}
You may consider switching to these HTML parsing libraries which provide excellent jQuery selectors like features.
http://nsoup.codeplex.com/
http://github.com/jamietre/csquery

Html Agility Pack: how to parse a webresponse and get a specified html element in c#

I googled my problem and found Html Agility Pack to parse html in c#. But there is no good examples and I can't use it to my purpose. I have a html document and it has a part like this:
<div class="pray-times-holder">
<div class="pray-time">
<div class="labels">
Time1:</div>
04:28:24
</div>
<div class="pray-time">
<div class="labels">
Time2:</div>
06:04:41
</div>
</div>
I want to get the value for Time1 and Time2. e.g. Time1 has value 04:28:24 and Time2 has value 06:04:41 and I want to get these values. Can you help me please?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode
.Descendants("div")
.Where(n => n.Attributes["class"] != null && n.Attributes["class"].Value == "pray-time")
.Select(n => n.InnerText.Replace("\r\n","").Trim())
.ToArray();
This console application code:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class = 'labels']"))
{
Console.WriteLine(node.NextSibling.InnerText.Trim());
}
will output this:
04:28:24
06:04:41

Categories

Resources