I;m using Selenium with PhantomJsdriver
using (IWebDriver driver = new PhantomJSDriver())
{
driver.Navigate().GoToUrl("http://www.google.com");
var content = driver.PageSource; >> wrong content return
}
content always get "<html><head></head><body></body></html>"
but driver.PageSource is properly get the full site content.
What's wrong that? Really strange behavior.
You are experiencing an issue with timing. The content is retrieved before the entire DOM content is loaded. The easiest way to check for that is to add Thread.Sleep(2000) before retrieving content. This however is not a good practice so utilize the events that the driver provides you with before retrieving content, or wait for a specific DOM element to be loaded before retrieving content.
Related
I did not find a solution how to download a whole Webpage
All I want is to navigate to https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?
and download it. Is it possible to download the page with Selenium?
I used the following Code to navigate to the page:
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
}
You can retrieve the page source content with driver.PageSource command. And save it to the file.
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
await File.WriteAllTextAsync("PageSource.html", driver.PageSource);
}
For downloading json it will work well.
But for html pages, note:
If the page has been modified after loading (for example, by JavaScript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server.
References
https://www.selenium.dev/selenium/docs/api/dotnet/html/P_OpenQA_Selenium_IWebDriver_PageSource.htm
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/file-system/how-to-write-to-a-text-file
I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.
Context:
I'm developing a desktop application in C# to scrape / analyse product information from individual web pages in a small number of domains. I use HtmlAgilityPack to capture and parse pages to fetch the data needed. I code different parsing rules for different domains.
Issue:
Pages from one particular domain, when displayed through a browser, can show perhaps 60-80 products. However when I parse through HtmlAgilityPack I only get 20 products maximum. Looking at the raw html in Firefox "View Page Source" there also appears to be only 20 of the relevant product divs present. I conclude that the remaining products must be loaded in via a script, perhaps to ease the load on the server. Indeed I can sometimes see this happening in the browser as there is a short pause while 20 more products load, then another 20 etc.
Question:
How can I access, through HtmlAgilityPack or otherwise, the full set of product divs present once all the scripting is complete?
You could use the WebBrowser in System.Windows.Forms to load the data, and agility pack to parse it. It would look something like this :
var browser = new WebBrowser();
browser.Navigate("http://whatever.com");
while (true)
{
if(browser.ReadyState == WebBrowserReadyState.Complete && browser.IsBusy != true)
{
break;
}
//not for production
Thread.Sleep(1000)
}
var doc = new HtmlAgilityPack.HtmlDocument();
var dom = (IHTMLDocument3)browser.Document.DomDocument;
StringReader reader = new StringReader(dom.documentElement.outerHTML);
doc.Load(reader);
see here for more details
Ok, I've got something working using the Selenium package (available via NuGet). The code looks like this:
private HtmlDocument FetchPageWithSelenium(string url)
{
IWebDriver driver = new FirefoxDriver();
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
driver.Navigate().GoToUrl(url);
// Scroll to the bottom of the page and pause for more products to load.
// Do it four times as there may be 4x20 products to retrieve.
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
Thread.Sleep(2000);
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
HtmlDocument webPage = new HtmlDocument();
webPage.LoadHtml(driver.PageSource.ToString());
driver.Quit();
return webPage;
}
This returns an HtmlAgilityPack HtmlDocument ready for further analysis having first forced the page to fully load by repeatedly scrolling to the bottom. Two issues outstanding:
The code launches Firefox and then stops it again when complete. That's a bit clumsy and I'd rather all that happened invisibly. It's suggested that you can avoid this by using a PhantomJS driver instead of the Firefox driver. This didn't help though as it just pops up a Windows console window instead.
It's a bit slow due to the time taken to load the browser and pause while the scripting loads the supplementary content. I can probably live with it though.
I'll try to rework the #swestner code as well to get it running in a WPF app and see which is the tidier solution.
Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.
All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.
At present Im calling the required page via this code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[#class=\"nav\"]");
An example bit of code that Ive seen saying to use the WebBrowser control:
if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);
Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.
It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.
Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.
I struggled all day to get this right so here is a FedEx tracking example of what the accepted answer is referring to (I think):
Dim body As String
body = "data={""TrackPackagesRequest"":{""appType"":""WTRK"",""appDeviceType"":""DESKTOP"",""supportHTML"":true,""supportCurrentLocation"":true,""uniqueKey"":"""",""processingParameters"":{},""trackingInfoList"":[{""trackNumberInfo"":{""trackingNumber"":" & Chr(34) & "YOUR TRACKING NUMBER HERE" & Chr(34) & ",""trackingQualifier"":"""",""trackingCarrier"":""""}}]}}"
body = body & "&action=trackpackages&locale=en_US&version=1&format=json"
With CreateObject("MSXML2.XMLHTTP")
.Open("POST", "https://www.fedex.com/trackingCal/track", False)
.setRequestHeader("Referer", "https://www.fedex.com/apps/fedextrack/?tracknumbers=YOUR TRACKING NUMBER HERE")
.setRequestHeader("User-Agent", "Mozilla/5.0")
.setRequestHeader("X-Requested-With", "XMLHttpRequest")
.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8")
.send(body)
Dim Reply = .responseText
End With
Alternatively have you considered building a browser into your application using Cefsharp.net and then using Dev Tools through the .net interface?
You may have noticed that even dynamically AJAX/JS generated HTML can be found using e.g. Inspect Element option in Firefox. So that code is sitting on your computer even if you can't scrape it using traditional HTML scraping methods.
Another option to consider.
https://cefsharp.github.io/
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!