I'm writing a simple web scraping application to retrieve information on certain PC components.
I'm using Best Buy as my test website and I'm using the HTMLAgilityPack as my scraper.
I'm able to retrieve the title and the price; however, I can't seem to get the availability.
So, I'm trying to read the Add to Cart button element's text. If it's available, it'll read "Add to Cart", otherwise, it'll read "Unavailable".
But, when I get the XPath and try to save it to a variable, it returns null. Can someone please help me out?
Here's my code.
var url = "https://www.bestbuy.com/site/pny-nvidia-geforce-gt-710-verto-2gb-ddr3-pci-express-2-0-graphics-card-black/5092306.p?skuId=5092306";
HtmlWeb web = new HtmlWeb();
HtmlDocument pageDocument = web.Load(url);
string titleXPath = "/html/body/div[3]/main/div[2]/div[3]/div[1]/div[1]/div/div/div[1]/h1";
string priceXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[1]/div/div/div/div/div[2]/div/div/div/span[1]";
string availabilityXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div/div/div[1]/button";
var title = pageDocument.DocumentNode.SelectSingleNode(titleXPath);
var price = pageDocument.DocumentNode.SelectSingleNode(priceXPath);
bool availability = pageDocument.DocumentNode.SelectSingleNode(availabilityXPath) != null ? true : false;
Console.WriteLine(title.InnerText);
Console.WriteLine(price.InnerText);
Console.WriteLine(availability);
It correctly outputs the title and price, but availability is always null.
Try string availabilityXPath = "//button[. = 'Add to Cart']"
In web scraping, while a long generated xpath will always work on the same static page, when you're dealing with multiple pages across the same store, the location of certain elements can drift and break your xpaths. Yours is breaking at /html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div and I suspect that's what's happening here.
Learning to write one from scratch will be invaluable (and much easier to debug!).
Related
I am trying to parse Google play store HTML page in C# .NET core. Unfortunately, Google does not provide APIs to get the mobile application info (such as version, last update ...), while Apple does. This is why I am trying to parse the HTML page and then get the info needed.
However, it seems they published a new version recently, where a user has to press on an arrow button to be able to see the info of the app displayed in a popup.
In order to understand more, consider the example of WhatsApp application: https://play.google.com/store/apps/details?id=com.whatsapp&hl=en
In order to get the info of this app (like release date, version ...), the user has to press now on the arrow near "About this app".
Previously, the below code was working perfectly:
var id = "com.whatsapp";
var language = "en";
var url = string.Format("https://play.google.com/store/apps/details?id={0}&hl={1}", id, language);
string result;
WebClient client = new WebClient();
client.Encoding = System.Text.UTF8Encoding.UTF8;
result = client.DownloadString(url);
MatchCollection matches = Regex.Matches(result, "<div class=\"hAyfc\">.*?
<span class=\"htlgb\"><div class=\"IQ1z0d\"><span class=\"htlgb\">(?<content>.*?)
</span></div></span></div>");
objAndroidDetails.updated = matches[0].Groups["content"].Value;
objAndroidDetails.version = matches[3].Groups["content"].Value;
...
But now, it's not the case anymore for two reasons:
The regular expression is not valid anymore
The client.DownloadString(url) downloads only the code before triggering the button to display the info, thus I will not be able to extract it bcz it's not available :)) .
So, anybody can help me to solve the issue #2 ? I need somehow to trigger the button in order to be able to match the HTML code needed and extract it.
Thanks
I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.
I want to make a desktop weather application in C#. I want it to pull the weather from weather.com. I am very new to this subject. I am using the HtmlAgilityPack.dll. I have tried the following code to pull today's weather (degrees):
string webUrl = "http://www.weather.com/weather/today/l/90025:4:US";
HtmlWeb HTMLweb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = HTMLweb.Load(webUrl);
string degrees = doc.DocumentNode.SelectNodes("//*[#id=\"wx-local-wrap\"]/div[2]/div[2]/div/div/div/div/section/div/div/div[1]/div/section/section[1]/div[2]/span[1]/span")[0].InnerText;
MessageBox.Show("{0}°F", degrees);
However, when I run this code it throws the NullReferenceException. What am I doing wrong and how can I fix it?
Thank you.
Handling webpages like this is an exhaustive task and any change to the webpage by its developers will render your application useless.
Therefore, use XML or an API to retrieve weather data instead. This can be a good place to start:
http://openweathermap.org/current
It supports XML and JSON where you provide parameters such as cityID, cityName or a geographic coordinates and it returns results in clear structured XML easy to parse using XmlReader
Hope that helped :)
I am trying to get a company's sector on yahoo finance using HTML Agility Pack but I keep getting object reference not set to instance of an object exception. Why does my code throw this exception? I already double checked the xpath Id numerous times.
string Url = "http://www.finance.yahoo.com/q/pr?s=MSFT+Profile";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string xpathid = "//*[#id=\"yfncsumtab\"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td[2]/a";
string sector = doc.DocumentNode.SelectNodes(xpathid)[0].InnerText;
Console.WriteLine(sector);
this is the line that is throwing the exception:
string sector = doc.DocumentNode.SelectNodes(xpathid)[0].InnerText;
Probably because SelectNodes is returning null...but you are trying to access it anyways.
You need to state which line is throwing the exception.
Jamming several operations into one line of code makes debugging more difficult than it needs to be.
[edit] Your updated post confirms what I suggested.
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet Escaño) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link