Html Agility xpath id error - Yahoo Finance - c#

I am trying to get a company's sector on yahoo finance using HTML Agility Pack but I keep getting object reference not set to instance of an object exception. Why does my code throw this exception? I already double checked the xpath Id numerous times.
string Url = "http://www.finance.yahoo.com/q/pr?s=MSFT+Profile";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string xpathid = "//*[#id=\"yfncsumtab\"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td[2]/a";
string sector = doc.DocumentNode.SelectNodes(xpathid)[0].InnerText;
Console.WriteLine(sector);
this is the line that is throwing the exception:
string sector = doc.DocumentNode.SelectNodes(xpathid)[0].InnerText;

Probably because SelectNodes is returning null...but you are trying to access it anyways.
You need to state which line is throwing the exception.
Jamming several operations into one line of code makes debugging more difficult than it needs to be.
[edit] Your updated post confirms what I suggested.

Related

C# Best Buy Web Scraping - Can't get add to cart element

I'm writing a simple web scraping application to retrieve information on certain PC components.
I'm using Best Buy as my test website and I'm using the HTMLAgilityPack as my scraper.
I'm able to retrieve the title and the price; however, I can't seem to get the availability.
So, I'm trying to read the Add to Cart button element's text. If it's available, it'll read "Add to Cart", otherwise, it'll read "Unavailable".
But, when I get the XPath and try to save it to a variable, it returns null. Can someone please help me out?
Here's my code.
var url = "https://www.bestbuy.com/site/pny-nvidia-geforce-gt-710-verto-2gb-ddr3-pci-express-2-0-graphics-card-black/5092306.p?skuId=5092306";
HtmlWeb web = new HtmlWeb();
HtmlDocument pageDocument = web.Load(url);
string titleXPath = "/html/body/div[3]/main/div[2]/div[3]/div[1]/div[1]/div/div/div[1]/h1";
string priceXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[1]/div/div/div/div/div[2]/div/div/div/span[1]";
string availabilityXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div/div/div[1]/button";
var title = pageDocument.DocumentNode.SelectSingleNode(titleXPath);
var price = pageDocument.DocumentNode.SelectSingleNode(priceXPath);
bool availability = pageDocument.DocumentNode.SelectSingleNode(availabilityXPath) != null ? true : false;
Console.WriteLine(title.InnerText);
Console.WriteLine(price.InnerText);
Console.WriteLine(availability);
It correctly outputs the title and price, but availability is always null.
Try string availabilityXPath = "//button[. = 'Add to Cart']"
In web scraping, while a long generated xpath will always work on the same static page, when you're dealing with multiple pages across the same store, the location of certain elements can drift and break your xpaths. Yours is breaking at /html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div and I suspect that's what's happening here.
Learning to write one from scratch will be invaluable (and much easier to debug!).

Html Agility Pack xpath throws null exception

I am trying to parse this page.
To select the nodes I need I use XPath, my XPath works fine in my browser, but when using it in my project it returns a null exception.
The XPath for title works fine, but the one for description does not.
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://nl.aliexpress.com/item/4000646776468.html?spm=a2g0o.productlist.0.0.531f7aa3iGAnCb&algo_pvid=0b20aa21-fd7f-4826-81a5-c9aac5254da8&algo_expid=0b20aa21-fd7f-4826-81a5-c9aac5254da8-0&btsid=8849a0ec-e95d-447f-a6f9-34dcd58f1381&ws_ab_test=searchweb0_0,searchweb201602_4,searchweb201603_53");
ProductModel product = new ProductModel {
Title = document.DocumentNode.SelectSingleNode("//head/title").InnerText,
Description = document.DocumentNode.SelectSingleNode("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div/div[2]/div[1]/div/div/div/div[1]/p[2]").InnerText};
return View(product);
It indeed turned out to be a problem with content being dynamically rendered.
For those who come across the some problem, take a look at selenium if you are using c#.
I switched to node using the puppeteer libary.

c# Html agility pack HtmlDocument does not contain all Elements from the website

at the time I´m making a chatbot. The bot should be able to define a word, so I tried getting the span Element from Google (https://www.google.de/webhp?sourceid=chrome-instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test) where the definition is writen in, wich didn't work. It turns out that the htmlDocument does not contain the hole website.
string Url = "https://www.google.de/webhp?sourceid=chrome- instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id='uid_0']/div[1]/div/div[1]/div[2]/div/ol/li[1]/div/div/div[2]/div/div[1]/span");
if (!String.IsNullOrEmpty(node.InnerText))
output += node.InnerText;
node is not set to an Instance of an object
I let me give the InnerHtml of the document and put it in a gist: https://gist.github.com/MarcelBulpr/bb44a527d8202eb7fffb4e21fb8b4fed
it seems that the website does not load the result of the search request.
Does anyone know how to work around this?
Thanks in advance

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

text returning as NULL using htmlagility pack + xpath

I'm currently playing around with htmlagility pack, however, I don't seem to be getting any data back from the following url:
http://cloud.tfl.gov.uk/TrackerNet/LineStatus
This is the code i'm using:
var url = #"http://cloud.tfl.gov.uk/TrackerNet/LineStatus";
var webGet = new HtmlWeb();
var doc = webGet.Load(url);
However, when I check the contents of 'doc', the text value is set to null. I've tried other url's and i'm receiving the HTML used on the site. Is it just this particular url, or am I doing something wrong. Any help would be appreciated.
HtmlAgilityPack is an HTML parser, thus you won't be successful in trying to parse a non-HTML webpage such as the XML your want to parse.

Categories

Resources