How to get value of nested img src with Html Agility Pack? - c#

I'm trying to get a nested img srcs with Html Agility pack and I've tried multiple things with no success. Basically there are multiple img srcs I need to grab, all are nested. There are 17 of these I need to grab but can't figure it out for the life of me. Here is the barebones html, I need the value of src in the last line:
<div class="largeTitle">
<article class="articleItem" data-id="0000">
<a href="#blank_link"> class="img">
<img class=" lazyloaded" data-src="#blank_link" alt="test" onerror="script"
src="image_link.jpg">
</a>
</article>
<article class="articleItem" data-id="0001">
<a href="#blank_link"> class="img">
<img class=" lazyloaded" data-src="#blank_link" alt="test" onerror="script"
src="image_link.jpg">
</a>
</article>
</div>

With the url you mentioned in comments, you can do:
var web = new HtmlWeb();
var doc = web.Load("https://www.investing.com/");
var images = doc.DocumentNode.SelectNodes("//*[contains(#class,'js-articles')]//a[#class='img']//img");
foreach(var image in images)
{
string source = image.Attributes["data-src"].Value;
string label = image.Attributes["alt"].Value;
Console.WriteLine($"\"{label}\" {source}");
}

Related

How to get content of <div> with HtmlAgilityPack - C#

I have html source:
<div class="lit-plot">
<b class="red">خلاصه داستان :</b>
Content
</div>
I want to get the value of <div> (not <b> and only the string "Content") with HtmlAgilityPack. What is the best way to do this?
Here is what am I doing. movieDesHTMLSource is given html source. I don't know how to access the InnerHtml!
string movieDes;
//Exctact the movie's description HTML source
var movieDesHTMLSource = new HtmlAgilityPack.HtmlDocument();
movieDesHTMLSource.LoadHtml(postPageHTMLDes[95].InnerHtml);
var src = movieDesHTMLSource.DocumentNode.SelectNodes("//div[contains(#class,'lit-plot')]");
Use Xpath text() to retrieve just the text inside div tag.
var html = #"<body>
<div class='lit-plot'>
<b class='red'>خلاصه داستان :</b>
Content
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//div[contains(#class,'lit-plot')]/text()");
foreach (HtmlNode node in htmlNodes)
{
Console.WriteLine(node.InnerText.Trim());
}
Fiddle here : https://dotnetfiddle.net/mXFs8k
I recommend that you wrap your content inside <p> or <span> etc tags then you can easily target it using HtmlAgilityPack.

I've been triying to get data from website with HtmlAgilityPack

Firstly, I tried a lot of ways but I couldn't solve my problem. I don't know how to place my node way in SelectSingleNode(?) method. I create a html path to reach my node in my c# code but if I run this code, I take NullReferenceException because of my html path. I just want you that how can I create my html way or any other solution?
This is example of html code:
<html>
<body>
<div id="container">
<div id="box">
<div class="box">
<div class="boxContent">
<div class="userBox">
<div class="userBoxContent">
<div class="userBoxElement">
<ul id ="namePart">
<li>
<span class ="namePartContent>
</span>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
And this my C# code:
namespace AgilityTrial
{
class Program
{
static void Main(string[] args)
{
Uri url = new Uri("https://....");
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string path = #"//html/body/div[#id='container']/div[#id='classifiedDetail']"+
"/div[#class='classifiedDetail']/div[#class='classifiedDetailContent']"+
"/div[#class='classifiedOtherBoxes']/div[#class='classifiedUserBox']"+
"/div[#class='classifiedUserContent']/ul[#id='phoneInfoPart']/li"+
"/span[#class='pretty-phone-part show-part']";
var tds = doc.DocumentNode.SelectSingleNode(path);
var date = tds.InnerHtml;
Console.WriteLine(date);
}
}
}
Take as an example your namePartContent span node. If you want to fetch that data you would simply do this:
doc.DocumentNode.SelectSingleNode(".//span[#class='namePartContent']")?.InnerText;
It will search/fetch a single span node with namePartContent as its class, begining at the root node, in your case <html>;

Html nodes issue with HtmlAgilityPack

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.
In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.
<div class='downloads' id='download_block'>
<h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
</div>
This is how it looks on the webpage
And this is what I have:
nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[#class='downloads']/ul[#class='parts']")
I can't just use an array-index to determine the position like:
nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node
...because they could change the amount of nodes and its hosting positions.
Note that also the urls will not contains the hosting names, are redirections like:
http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--
What could I do, in C# or else VB.Net?.
this should do, untested though:
doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value
also use contains because you never know if the text contains spaces.
The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.
void Main()
{
var xml = #"
<div class=""downloads"" id=""download_block"">
<h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
</div>";
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(xml);
var nav = xmlDocument.CreateNavigator();
var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/#href").InnerText;
Console.WriteLine(text);
}
Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.
Sorry for the not so clean and error prone code, but it should get you in the right direction.
Give the snippet you supplied, this will help you get started.
var page = "<div class=\"downloads\" id=\"download_block\"> <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5> <h4>uploadable.ch</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>uploaded.net</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>novafile.com</h4> <ul class=\"parts\"> <li> text here </li> </ul></div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
attr.Value.Dump();
}

Parsing images out of a list using HtmlAgilityPack

There's a html page like this
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
...
I want to get all the hrefs from the lis but like this that I can still get the relation between the li and the a tag.
So first li goes to first a tag, second to second and so on..
I have this code but it always returns the same a href context:
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
String href = node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value
}
How can I improve that code?
you probably want to add all href
string href="";
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
href+= node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value+",";
}

YouTube HTML Agility Pack C#

I am trying to retrieve all the video ids off the YouTube's search results page.
Each result has this code:
<a href="/watch?v=aYIC-ebAD3o" class="ux-thumb-wrap result-item-thumb">
<span class="video-thumb ux-thumb-128 ">
<span class="clip">
<img onload="tn_load(5)" alt="Thumbnail" src="//i2.ytimg.com/vi/aYIC-ebAD3o/default.jpg" >
</span>
</span>
<span class="video-time">4:16</span>
<span dir="ltr" class="yt-uix-button-group addto-container short video-actions" data-video-ids="aYIC-ebAD3o" data-feature="thumbnail">
<button type="button" class="start master-sprite yt-uix-button yt-uix-button-short yt-uix-tooltip" onclick=";return false;" title="" data-button-action="yt.www.addtomenu.add" role="button" aria-pressed="false">
<img class="yt-uix-button-icon yt-uix-button-icon-addto" src="//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="">
<span class="yt-uix-button-content">
<span class="addto-label">Add to</span>
</span>
</button>
<button type="button" class="end yt-uix-button yt-uix-button-short yt-uix-tooltip yt-uix-button-empty" onclick=";return false;" title="" data-button-menu-id="shared-addto-menu" data-button-action="yt.www.addtomenu.load" role="button" aria-pressed="false">
<img class="yt-uix-button-arrow" src="//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="">
</button>
</span>
<span class="video-in-quicklist">Added to queue </span>
</a>
<div class="result-item-main-content">
And I am trying to parse out the "data-video-ids" class data. Whats the best way to do this with the HTML Agility Pack?
I have tried this:
foreach(HtmlNode node in doc.DocumentNode.
SelectNodes("//span[#class='data-video-ids']"))
{
string text = node.InnerText;
lblTest2.Text += text + Environment.NewLine;
}
Any ideas?
I think you will be better off in the longrun if you use one of YouTube's APIs.
I would only use web requests and HtmlAgilityPack as a last resort when no API exists. The main reason for this is if YouTube ever changes their page, it breaks your code. Open APIs are generally geared to be backwards compatible so your application should work indefinitely in most cases.
Here is a code example from Youtube's API:
YouTubeQuery query = new YouTubeQuery(YouTubeQuery.DefaultVideoUri);
//order results by the number of views (most viewed first)
query.OrderBy = "viewCount";
// search for puppies and include restricted content in the search results
// query.SafeSearch could also be set to YouTubeQuery.SafeSearchValues.Moderate
query.Query = "puppy";
query.SafeSearch = YouTubeQuery.SafeSearchValues.None;
Feed<Video> videoFeed = request.Get<Video>(query);
printVideoFeed(videoFeed);
Looks simple, right?
The 'data-video-ids' you're trying to filter out is not a class but an attribute - please try out the following expression in SelectNodes:
"//span[#data-video-ids]"
To retrieve the attribute value you could try this approach (since HtmlAgilityPack doesn't support attribute selection you have to get an element first and then select the actual attribute):
foreach(HtmlNode node in doc.DocumentNode.
SelectNodes("//span[#data-video-ids]"))
{
var videoIds = node.Attributes["data-video-ids"];
if (videoIds == null) continue;
string text = videoIds.Value;
lblTest2.Text += text + Environment.NewLine;
}

Categories

Resources