Trouble Scraping .HTM File

Trouble Scraping .HTM File - c#

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.

Related

Webclient.DownloadString() does not retrieve current whole page

I know there is another question with practically identical title here: Webclient.DownloadString does not retrieve the whole page
But the solution doesn't help me, maybe somebody else have the same problem.
I'm trying to get the html code of this URL:
https://cubebrush.co/?freebies=true
To achieve that, I'm using the following code in C#:
WebClient webClient = new WebClient();
string webString = webClient.DownloadString("https://cubebrush.co/?freebies=true");
But the retrieved html lacks some information, for example, all the button tags inside the website. This can be quickly checked using the library HtmlAgilityPack and checking for all the tags inside the website with the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webString);
HashSet<string> hs = new HashSet<string>();
foreach (var dec in doc.DocumentNode.Descendants())
{
hs.Add(dec.Name);
}
If we run this, it will show 26 tags, but none of them will be a button tag. This makes sense, since the initial webString also lacks that "button information".
I've tried to copy webString into a file, to check if, as the initial commented post says, was a problem with the visualizer, but it doesn't, visualizer and file looks exactly equal.
Can somebody tells me what I'm doing wrong? Thanks!

Get data from the website open in the WebBrowser

I am in the same situation at the guy who asked this question. I need to get some data from a website saved as a string.
My problem here is, that the website i need to save data from, requires the user to be logged in to view the data...
So here my plan was to make the user go to the website using the WebBrowser, then login and when the user is on the right page, click a button which will automaticly save the data.
I want to use a similar method to the one used, in the top answer at the other question that i linked to in the start.
string data = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
I tried doing things like this:
string data = webBrowser1.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
But you can't do "webBrowser1.DocumentNode.SelectNodes"
I also saw that the answer on the other question says, that he uses HtmlAgilityPack, but i tried to download it, and i have no idea what to do with it..
Not the best with C#, so please don't comment too complicated answers. Or at least try to make it understandable.
Thanks in advance :)

Here is the an example of HtmlAgilityPack usage:
public string GetData(string htmlContent)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlContent);
if (htmlDoc.DocumentNode != null)
{
string data = htmlDoc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
if(!string.IsNullOrEmpty(data))
return data;
}
return null;
}
Edit: If you want to emulate some actions in browser I would suggest you to use Selenium instead of regular WebBrowser control. Here is the link where to download it: http://www.seleniumhq.org/ or use NuGet to download it. This is a good question on how to use it: How do I use Selenium in C#?.

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);

You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

HTMLAgilityPack missing child nodes that exist on website being scraped

I'm running the following piece of code, it returns back the correct number of number of divs found for 'callTable'.. but they are all empty, the innerHTML is empty and it doesn't find any children for any of them, even though if you inspect the elements on the actual site, they have children.
I thought maybe it had to do with having a table within a div, so I tested it by looking within 'box-content' divs. Those seem to be loading correctly though. It is possible it has to do with the callTable has 'table-layout: fixed'?
Anyway, can't seem to find anyone else having this error after poking around. Anyone have some thoughts? Much appreciated!!
string Url = "https://malwr.com/analysis/MWI5MThhZWZhNDI0NDEyYThmOWMxMjc3MzRmZjQ1MDg"+id
HtmlWeb web = new HtmlWeb();
HtmlDocument webpage = web.Load(Url);
HtmlNodeCollection callTable = webpage.DocumentNode.SelectNodes("//div[#class='calltable']");//[contains(#class, 'calltable')]");
//Just a test
HtmlNodeCollection boxContentTest = webpage.DocumentNode.SelectNodes("//div[#class='box-content']");

asp.net C# get final page source of a webpage

I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}

I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Trouble Scraping .HTM File - c#

Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine. I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.

Related

Webclient.DownloadString() does not retrieve current whole page

Get data from the website open in the WebBrowser

C# htmlAgility Webscrape html node inside the first Html node

HTMLAgilityPack missing child nodes that exist on website being scraped

asp.net C# get final page source of a webpage

Categories

Resources