C# HTMLAgilityPack getting src. xpath is not valid

C# HTMLAgilityPack getting src. xpath is not valid - c#

I am not able to get the xpath right. I am trying to get the image of any IMDB movie but it just seems not to work. This is my code of it.
// Getting the node
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id=\"title - overview - widget\"]/div[2]/div[3]/div[1]/a/img");
// Getting the attribute data
HtmlAttributeCollection attr = node.Attributes;
the attribute is null. every time but. the xpath does not work and i dont know why. it seems good to me.

You can use a simpler xpath
var url = "http://www.imdb.com/title/tt0816692/";
using (var client = new HttpClient())
{
var html = await client.GetStringAsync(url);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var img = doc.DocumentNode.SelectSingleNode("//img[#title='Trailer']")
?.Attributes["src"]?.Value;
//or
var poster = doc.DocumentNode.SelectSingleNode("//div[#class='poster']//img")
?.Attributes["src"]?.Value;
}

Related

Trying to get a table from a website using Html Agility returns null

I am trying to get a table from a website using the Html Agility Pack in C# but it always returns null and I don't understand why.
This is my code:
using (var httpClient = new HttpClient())
{
var response = await httpClient.GetAsync("some website");
var htmlBody = await response.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(htmlBody);
var table = doc.DocumentNode.SelectSingleNode("/html/body/div/div/div/div[2]/div[5]/div/div/table");
}
I have also tried this XPath but it still doesn't work:
var table = doc.DocumentNode.SelectSingleNode("//*[#id=\"__layout\"]/div/div[2]/div[5]/div/div/table");
The variable table is always null after I run this. Is there something wrong with my code or is it an issue with the XPath I'm using?

HTML Agility Pack is not working when filtering by class Name

I am trying to scrape some data from a commercial site. I am trying to drill down to a specific class, However I am having difficulty drilling down and filtering.
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//body");
var divs = nodes.Descendants("div").Where(x => x.HasClass("srp-main srp-main--isLarge"));
I am getting null for "divs". Where am I going wrong?
My ultimate goal is to drill down to div with class name = s_item__info clearfix. So help with that would be appreciated.

XDocument Load - cannot open

I'm trying to load rss feed by XDocument.
The url is:
http://www.ft.com/rss/home/uk
XDocument doc = XDocument.Load(url);
But I'm getting an error:
Cannot open 'http://www.ft.com/rss/home/uk'. The Uri parameter must be a file system relative or absolute path.

XDocument.Load does not take URL's, only files as stated in the documentation.
Try something like the following code which I totally did not test:
using(var httpclient = new HttpClient())
{
var response = await httpclient.GetAsync("http://www.ft.com/rss/home/uk");
var xDoc = XDocument.Load(await response.Content.ReadAsStreamAsync());
}

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?

Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# HTMLAgilityPack getting src. xpath is not valid - c#

Related

Trying to get a table from a website using Html Agility returns null

HTML Agility Pack is not working when filtering by class Name

XDocument Load - cannot open

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

Can i read iframe through WebClient (i want the outer html)?

Categories

Resources