How do I download a page with Selenium

How do I download a page with Selenium - c#

I did not find a solution how to download a whole Webpage
All I want is to navigate to https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?
and download it. Is it possible to download the page with Selenium?
I used the following Code to navigate to the page:
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
}

You can retrieve the page source content with driver.PageSource command. And save it to the file.
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
await File.WriteAllTextAsync("PageSource.html", driver.PageSource);
}
For downloading json it will work well.
But for html pages, note:
If the page has been modified after loading (for example, by JavaScript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server.
References
https://www.selenium.dev/selenium/docs/api/dotnet/html/P_OpenQA_Selenium_IWebDriver_PageSource.htm
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/file-system/how-to-write-to-a-text-file

Related

Click on HTML elements with Scrapy (WebScraping)

I'm doing a program in c # using scrapySharp or HtmlAgilityPack. But I have the disadvantage of that part of the information that I need, to appear when I click on an HTML element (Button, link ).
In some forums it was commented that when using Selenium you could manipulate the html elements, so I tried the following
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
// Defines the interface with the Chrome browser
IWebDriver driver = new ChromeDriver ();
// Auxiliary to store the label element in href
Element IWebElement;
// Go to the website
driver.Url = url;
// Click on the download button
driver.FindElement (By.Id ("Download button")). Click ();
but being a web automation test, it opens a browser and the website to perform the selection process (clicks), so it is not of my use, since I have to perform the inspection on several websites internally.
Although I can continue using Selenium, I am looking for ways to avoid using the browser and instead click without it.
Does anyone know how to achieve the click of the link or button, without the need to open a browser for web scraping?

Hope this would be helpful to anyone who has the same requirements.
If you want to avoid opening the browser, you could use below settings in the ChromeDriver.
// settings for avoid opening browser
var options = new ChromeOptions();
options.AddArgument("headless");
var service = ChromeDriverService.CreateDefaultService();
service.HideCommandPromptWindow = true;
// url to access and scrape
var url = "https://example.com";
using (var driver = new ChromeDriver(service, options))
{
// access the url
driver.Navigate().GoToUrl(url);
// Click on the download button - copied from your code above
driver.FindElement (By.Id ("Download button")). Click ();
}
In addition to above below links also, you may find useful,
can-selenium-webdriver-open-browser-windows-silently-in-background
running-webdriver-without-opening-actual-browser-window

How to detect the origin of a webpage's GET requests programmatically? (C#)

In short, I need to detect a webpage's GET requests programmatically.
The long story is that my company is currently trying to write a small installer for a piece of proprietary software that installs another piece of software.
To get this other piece of software, I realize it's as simple as calling the download link through C#'s lovely WebClient class (Dir is just the Temp directory in AppData/Local):
using (WebClient client = new WebClient())
{
client.DownloadFile("[download link]", Dir.FullName + "\\setup.exe");
}
However, the page which the installer comes from does is not a direct download page. The actual download link is subject to change (our company's specific installer might be hosted on a different download server another time around).
To get around this, I realized that I can just monitor the GET requests the page makes and dynamically grab the URL from there.
So, I know I'm going to do, but I was just wondering, is there was a built-in part of the language that allows you to see what requests a page has made? Or do I have to write this functionality myself, and what would be a good starting point?

I think I'd do it like this. First download the HTML contents of the download page (the page that contains the link to download the file). Then scrape the HTML to find the download link URL. And finally, download the file from the scraped address.
using (WebClient client = new WebClient())
{
// Get the website HTML.
string html = client.DownloadString("http://[website that contains the download link]");
// Scrape the HTML to find the download URL (see below).
// Download the desired file.
client.DownloadFile(downloadLink, Dir.FullName + "\\setup.exe");
}
For scraping the download URL from the website I'd recommend using the HTML Agility Pack. See here for getting started with it.

I think you have to write your own "mediahandler", which returns a HttpResponseMessage.
e.g. with webapi2
[HttpGet]
[AllowAnonymous]
[Route("route")]
public HttpResponseMessage GetFile([FromUri] string path)
{
HttpResponseMessage result = new HttpResponseMessage(HttpStatusCode.OK);
result.Content = new StreamContent(new FileStream(path, FileMode.Open, FileAccess.Read));
string fileName = Path.GetFileNameWithoutExtension(path);
string disposition = "attachment";
result.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue(disposition) { FileName = fileName + Path.GetExtension(absolutePath) };
result.Content.Headers.ContentType = new MediaTypeHeaderValue(MimeMapping.GetMimeMapping(Path.GetExtension(path)));
return result;
}

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);

You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

PhantomJS pass HTML string and return page source

for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page.
I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. Therefore I cannot simply use driver.Navigate().GoToUrl() method to retrieve the page source.
The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source.
In a sample project I tried the following without success:
WebClient wc = new WebClient();
string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
File.WriteAllText(tmpPath, content);
var driverService = PhantomJSDriverService.CreateDefaultService();
var driver = new PhantomJSDriver(driverService);
driver.Navigate().GoToUrl(new Uri(tmpPath));
string renderedContent = driver.PageSource;
driver.Quit();
You need the following nuget packages to run the sample:
https://www.nuget.org/packages/phantomjs.exe/
http://www.nuget.org/packages/selenium.webdriver
Problem here is that the code stops at GoToUrl() and it takes several minutes until program terminates without even giving me the driver.PageSource.
Doing this returns the correct HTML:
driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;
But I don't want to download the data twice. The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax.
Thank you!

Without running it, I would bet you need file:/// prior to tmpPath. That is:
WebClient wc = new WebClient();
string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
File.WriteAllText(tmpPath, content);
var driverService = PhantomJSDriverService.CreateDefaultService();
var driver = new PhantomJSDriver(driverService);
driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
string renderedContent = driver.PageSource;
driver.Quit();

You probably need to allow PhantomJS to make arbitrary requests. Requests are blocked when the domain/protocol doesn't match as is the case when a local file is opened.
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.LocalToRemoteUrlAccess = true;
driverService.WebSecurity = false; // may not be necessary
var driver = new PhantomJSDriver(driverService);
You might need to combine this with the solution of Dave Bush:
driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
Some of the resources have URLs that begin with // which means that the protocol of the page is used when the browser retrieves those resources. When a local file is read, this protocol is file:// in which case none of those resources will be found. The protocol must be added to the local file in order to download all those resources.
File.WriteAllText(tmpPath, content.Replace('"//', '"http://'));
It is apparent from your output that you use PhantomJS 1.9.8. It may be the case that a newly introduced bug is responsible for this sort of thing. You should user PhantomJS 1.9.7 with driverService.SslProcotol = 'tlsv1'.
You should also enable the disk cache if you do this multiple times for the same domain. Otherwise, the resources are downloaded each time you try to scrape it. This can be done with driverService.DiskCache = true;

Strange selenium value return

I;m using Selenium with PhantomJsdriver
using (IWebDriver driver = new PhantomJSDriver())
{
driver.Navigate().GoToUrl("http://www.google.com");
var content = driver.PageSource; >> wrong content return
}
content always get "<html><head></head><body></body></html>"
but driver.PageSource is properly get the full site content.
What's wrong that? Really strange behavior.

You are experiencing an issue with timing. The content is retrieved before the entire DOM content is loaded. The easiest way to check for that is to add Thread.Sleep(2000) before retrieving content. This however is not a good practice so utilize the events that the driver provides you with before retrieving content, or wait for a specific DOM element to be loaded before retrieving content.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How do I download a page with Selenium - c#

Related

Click on HTML elements with Scrapy (WebScraping)

How to detect the origin of a webpage's GET requests programmatically? (C#)

C# htmlAgility Webscrape html node inside the first Html node

PhantomJS pass HTML string and return page source

Strange selenium value return

Categories

Resources