bypassing javascript to get page source

bypassing javascript to get page source - c#

1)I am building an app for downloading all my pictures from my instagram profile.
2)I don't want to use instagram api so i searched and i find Selenium Webdriver and i add this package to my project
3)I am using c# to build android app (Xamarin) check below Code.
string url = "https://www.instagram.com/" + _theUsername;
IWebDriver driver = new FirefoxDriver();
driver.Url = url;
string source = driver.PageSource;
4)By Using above code i am getting below error :-
OpenQA.Selenium.DriverServiceNotFoundException: The geckodriver file does not exist in the current directory or in a directory on the PATH environment variable
5)Is it possible to use this in xamarin? i need to find a way to do this
any help will be greatly appreciated

I am not familiar with Selenium, but getting the page source for a url is easy with HttpClient:
string url = "https://www.instagram.com/" + _theUsername;
using (var httpClient = new HttpClient()){
var response = await httpClient.GetStringAsync(url);
Debug.WriteLine(response);
}
and response will have your page source. And if you do not want to show the web page then there is no need for a WebView anyway.

Related

How do I download a page with Selenium

I did not find a solution how to download a whole Webpage
All I want is to navigate to https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?
and download it. Is it possible to download the page with Selenium?
I used the following Code to navigate to the page:
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
}

You can retrieve the page source content with driver.PageSource command. And save it to the file.
var options = new ChromeOptions();
using (var driver = new ChromeDriver(".", options))
{
driver.Navigate().GoToUrl("https://api.tracker.gg/api/v2/rocket-league/standard/profile/epic/ManuelNotManni?");
await File.WriteAllTextAsync("PageSource.html", driver.PageSource);
}
For downloading json it will work well.
But for html pages, note:
If the page has been modified after loading (for example, by JavaScript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server.
References
https://www.selenium.dev/selenium/docs/api/dotnet/html/P_OpenQA_Selenium_IWebDriver_PageSource.htm
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/file-system/how-to-write-to-a-text-file

Click on HTML elements with Scrapy (WebScraping)

I'm doing a program in c # using scrapySharp or HtmlAgilityPack. But I have the disadvantage of that part of the information that I need, to appear when I click on an HTML element (Button, link ).
In some forums it was commented that when using Selenium you could manipulate the html elements, so I tried the following
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
// Defines the interface with the Chrome browser
IWebDriver driver = new ChromeDriver ();
// Auxiliary to store the label element in href
Element IWebElement;
// Go to the website
driver.Url = url;
// Click on the download button
driver.FindElement (By.Id ("Download button")). Click ();
but being a web automation test, it opens a browser and the website to perform the selection process (clicks), so it is not of my use, since I have to perform the inspection on several websites internally.
Although I can continue using Selenium, I am looking for ways to avoid using the browser and instead click without it.
Does anyone know how to achieve the click of the link or button, without the need to open a browser for web scraping?

Hope this would be helpful to anyone who has the same requirements.
If you want to avoid opening the browser, you could use below settings in the ChromeDriver.
// settings for avoid opening browser
var options = new ChromeOptions();
options.AddArgument("headless");
var service = ChromeDriverService.CreateDefaultService();
service.HideCommandPromptWindow = true;
// url to access and scrape
var url = "https://example.com";
using (var driver = new ChromeDriver(service, options))
{
// access the url
driver.Navigate().GoToUrl(url);
// Click on the download button - copied from your code above
driver.FindElement (By.Id ("Download button")). Click ();
}
In addition to above below links also, you may find useful,
can-selenium-webdriver-open-browser-windows-silently-in-background
running-webdriver-without-opening-actual-browser-window

How to detect the origin of a webpage's GET requests programmatically? (C#)

In short, I need to detect a webpage's GET requests programmatically.
The long story is that my company is currently trying to write a small installer for a piece of proprietary software that installs another piece of software.
To get this other piece of software, I realize it's as simple as calling the download link through C#'s lovely WebClient class (Dir is just the Temp directory in AppData/Local):
using (WebClient client = new WebClient())
{
client.DownloadFile("[download link]", Dir.FullName + "\\setup.exe");
}
However, the page which the installer comes from does is not a direct download page. The actual download link is subject to change (our company's specific installer might be hosted on a different download server another time around).
To get around this, I realized that I can just monitor the GET requests the page makes and dynamically grab the URL from there.
So, I know I'm going to do, but I was just wondering, is there was a built-in part of the language that allows you to see what requests a page has made? Or do I have to write this functionality myself, and what would be a good starting point?

I think I'd do it like this. First download the HTML contents of the download page (the page that contains the link to download the file). Then scrape the HTML to find the download link URL. And finally, download the file from the scraped address.
using (WebClient client = new WebClient())
{
// Get the website HTML.
string html = client.DownloadString("http://[website that contains the download link]");
// Scrape the HTML to find the download URL (see below).
// Download the desired file.
client.DownloadFile(downloadLink, Dir.FullName + "\\setup.exe");
}
For scraping the download URL from the website I'd recommend using the HTML Agility Pack. See here for getting started with it.

I think you have to write your own "mediahandler", which returns a HttpResponseMessage.
e.g. with webapi2
[HttpGet]
[AllowAnonymous]
[Route("route")]
public HttpResponseMessage GetFile([FromUri] string path)
{
HttpResponseMessage result = new HttpResponseMessage(HttpStatusCode.OK);
result.Content = new StreamContent(new FileStream(path, FileMode.Open, FileAccess.Read));
string fileName = Path.GetFileNameWithoutExtension(path);
string disposition = "attachment";
result.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue(disposition) { FileName = fileName + Path.GetExtension(absolutePath) };
result.Content.Headers.ContentType = new MediaTypeHeaderValue(MimeMapping.GetMimeMapping(Path.GetExtension(path)));
return result;
}

PhantomJS pass HTML string and return page source

for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page.
I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. Therefore I cannot simply use driver.Navigate().GoToUrl() method to retrieve the page source.
The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source.
In a sample project I tried the following without success:
WebClient wc = new WebClient();
string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
File.WriteAllText(tmpPath, content);
var driverService = PhantomJSDriverService.CreateDefaultService();
var driver = new PhantomJSDriver(driverService);
driver.Navigate().GoToUrl(new Uri(tmpPath));
string renderedContent = driver.PageSource;
driver.Quit();
You need the following nuget packages to run the sample:
https://www.nuget.org/packages/phantomjs.exe/
http://www.nuget.org/packages/selenium.webdriver
Problem here is that the code stops at GoToUrl() and it takes several minutes until program terminates without even giving me the driver.PageSource.
Doing this returns the correct HTML:
driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;
But I don't want to download the data twice. The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax.
Thank you!

Without running it, I would bet you need file:/// prior to tmpPath. That is:
WebClient wc = new WebClient();
string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
File.WriteAllText(tmpPath, content);
var driverService = PhantomJSDriverService.CreateDefaultService();
var driver = new PhantomJSDriver(driverService);
driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
string renderedContent = driver.PageSource;
driver.Quit();

You probably need to allow PhantomJS to make arbitrary requests. Requests are blocked when the domain/protocol doesn't match as is the case when a local file is opened.
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.LocalToRemoteUrlAccess = true;
driverService.WebSecurity = false; // may not be necessary
var driver = new PhantomJSDriver(driverService);
You might need to combine this with the solution of Dave Bush:
driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
Some of the resources have URLs that begin with // which means that the protocol of the page is used when the browser retrieves those resources. When a local file is read, this protocol is file:// in which case none of those resources will be found. The protocol must be added to the local file in order to download all those resources.
File.WriteAllText(tmpPath, content.Replace('"//', '"http://'));
It is apparent from your output that you use PhantomJS 1.9.8. It may be the case that a newly introduced bug is responsible for this sort of thing. You should user PhantomJS 1.9.7 with driverService.SslProcotol = 'tlsv1'.
You should also enable the disk cache if you do this multiple times for the same domain. Otherwise, the resources are downloaded each time you try to scrape it. This can be done with driverService.DiskCache = true;

Automatically download an xml file from URI with download popup

ASP.NET MVC 4 Razor:
I've been working at this for a bit, so I apologize if I'm missing something obvious, but I will truly appreciate any assistance that could be offered.
In a nutshell, what I'm looking to do is download an XML file from a URI using C#. It ought to be pretty straightforward, but the URI leads to a blank page with a download prompt popup populated with a dynamically created filename.
I can't provide the URI due to its confidential nature, but here is the code I've been toying with. (Forgive my ignorance on this matter, it's the first time I've tried anything like this)
byte[] data;
using (WebClient Client = new WebClient())
{
data = Client.DownloadData(uriString + fileString);
}
File.WriteAllBytes(dirString + fileString, data);
I've also tried:
using (WebClient Client = new WebClient())
{
Client.DownloadFile(uriString + fileString, dirString + fileString);
}
To be honest, this code doesn't really work for me. The downloaded files aren't correct. The XML files appear to contain the code from the webpage they've been downloaded from, and if I try something like an image, the image is broken. So, again, any assistance would be appreciated.
Thanks in advance!

The URI that you are using is probably wrong. You are using the URI that opens the popup page. The popup page should be doing another GET to the dynamically generated file.
To automate this process, you should use a WebRequest to get the contents of the popup page. Scrape the contents of the page to get the actual URL to download the file. Then use the code you have written to download the file.
var request = WebRequest.Create("PopupUrl");
var response = request.GetResponse();
string url = GetUrlFromResponseByRegExOrXMLParsing();
var client = new WebClient();
webClient.DownloadFile(url, filePath);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

bypassing javascript to get page source - c#

Related

How do I download a page with Selenium

Click on HTML elements with Scrapy (WebScraping)

How to detect the origin of a webpage's GET requests programmatically? (C#)

PhantomJS pass HTML string and return page source

Automatically download an xml file from URI with download popup

Categories

Resources