Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.
All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.
At present Im calling the required page via this code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[#class=\"nav\"]");
An example bit of code that Ive seen saying to use the WebBrowser control:
if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);
Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.
It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.
Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.
I struggled all day to get this right so here is a FedEx tracking example of what the accepted answer is referring to (I think):
Dim body As String
body = "data={""TrackPackagesRequest"":{""appType"":""WTRK"",""appDeviceType"":""DESKTOP"",""supportHTML"":true,""supportCurrentLocation"":true,""uniqueKey"":"""",""processingParameters"":{},""trackingInfoList"":[{""trackNumberInfo"":{""trackingNumber"":" & Chr(34) & "YOUR TRACKING NUMBER HERE" & Chr(34) & ",""trackingQualifier"":"""",""trackingCarrier"":""""}}]}}"
body = body & "&action=trackpackages&locale=en_US&version=1&format=json"
With CreateObject("MSXML2.XMLHTTP")
.Open("POST", "https://www.fedex.com/trackingCal/track", False)
.setRequestHeader("Referer", "https://www.fedex.com/apps/fedextrack/?tracknumbers=YOUR TRACKING NUMBER HERE")
.setRequestHeader("User-Agent", "Mozilla/5.0")
.setRequestHeader("X-Requested-With", "XMLHttpRequest")
.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8")
.send(body)
Dim Reply = .responseText
End With
Alternatively have you considered building a browser into your application using Cefsharp.net and then using Dev Tools through the .net interface?
You may have noticed that even dynamically AJAX/JS generated HTML can be found using e.g. Inspect Element option in Firefox. So that code is sitting on your computer even if you can't scrape it using traditional HTML scraping methods.
Another option to consider.
https://cefsharp.github.io/
Related
This question already has answers here:
C# .NET: Scraping dynamic (JS) websites
(1 answer)
htmlagilitypack and dynamic content issue
(3 answers)
Closed 4 years ago.
UPDATE #2 Continuing on this effort (see original and update #1 below). ScrapySharp had potential but no matter what I tried, the process consumed all available memory and didn't produce anything. I did find that, due to jQuery, in a test WebBrowser control, the correct web page does not load. It seems the site has a function that determines how you got to the page requested, validates something about your browser, and redirects you to a generic sign-up page.
Thoughts on how to appease the gate keeper?
UPDATE #1 (details of original, not unique question below):
First of all - THANK YOU! For the suggestions!
I tried to use ScrapySharp as #John pointed out as:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
However, it resulted in a memory leak. To get a sense of how it works, I tried:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri(url));
HtmlNode rawHTML = PageResult.Html;
var imgNodes = rawHTML.SelectNodes("//img");
Which also created a memory leak. What am I missing with my implementation of it?
ORIGINAL QUESTION:
I'm trying to get my application to grab specific images from a web site. So far I've been using HtmlAgilityPack but it only grabs the basic HTML. I don't know how to explain the tags from the missing elements other than they show up when using Inspect in Chrome (but Regex and HtmlAgilityPack can't seem to access/see them), and they have a "data-v-??????" identifier inside the tag. Here's an example:
<div data-v-1a7a6550="" class="product-extra-images"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_1Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_2Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"></div>
Please let me know if you need additional details. Here's a sample of my latest code that couldn't extract the elements (in case it helps):
var htmlDoc = new HtmlAgilityPack.HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true,
OptionReadEncoding = false
};
var imgNodes = htmlDoc.DocumentNode.SelectNodes("//div");
foreach (var imgNode in imgNodes)
{
//decode the string
var img = HttpUtility.HtmlDecode(imgNode.InnerText).Trim();
imagesouces.Add(img);
}
File.WriteAllLines(#"C:\Users\user\Desktop\WriteText.txt", imagesouces);
I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.
I;m using Selenium with PhantomJsdriver
using (IWebDriver driver = new PhantomJSDriver())
{
driver.Navigate().GoToUrl("http://www.google.com");
var content = driver.PageSource; >> wrong content return
}
content always get "<html><head></head><body></body></html>"
but driver.PageSource is properly get the full site content.
What's wrong that? Really strange behavior.
You are experiencing an issue with timing. The content is retrieved before the entire DOM content is loaded. The easiest way to check for that is to add Thread.Sleep(2000) before retrieving content. This however is not a good practice so utilize the events that the driver provides you with before retrieving content, or wait for a specific DOM element to be loaded before retrieving content.
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!
I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.