I have a web browser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.
I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the web page before the JavaScript loads the content. My next idea was to use a web browser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.
Is there a way I can grab the page post JavaScript load?
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
Installed selenium: http://docs.seleniumhq.org/
Started the selenium hub as a service
Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
Started phantomjs in webdriver mode pointing to the selenium hub
In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);
var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install-Package Selenium.WebDriver
Install embedded exe:
Install-Package phantomjs.exe
Updated code:
var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
Thanks to wbennet, I discovered PhantomJSCloud.com. Enough free service to scrap pages through web API calls.
public static string GetPagePhantomJs(string url)
{
using (var client = new System.Net.Http.HttpClient())
{
client.DefaultRequestHeaders.ExpectContinue = false;
var pageRequestJson = new System.Net.Http.StringContent
(#"{'url':'" + url + "','renderType':'html','outputAsJson':false }");
var response = client.PostAsync
("https://PhantomJsCloud.com/api/browser/v2/{YOUR_API_KEY}/",
pageRequestJson).Result;
return response.Content.ReadAsStringAsync().Result;
}
}
Yeah.
ok i will show you how to enable javascript using phantomjs and selenuim with c#
create a new console project name it as you want
go to solution explorer in your right hand
a right click on References click on Manage NuGet packages
a windows will shows click on browse than install Selenium.WebDriver
downold phantomjs from here Phantomjs
in your main function type this code
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled", true);
IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
driver.Navigate().GoToUrl("https://www.yourwebsite.com/");
try
{
string pagesource = driver.PageSource;
driver.FindElement(By.Id("yourelement"));
Console.Write("yourelement founded");
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
Console.Read();
don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below
have great time of coding and thanks wbennett
Related
My C# .NET Core console application is a simple web crawler. On pages where the needed data is contained in the source code, I am able to access the needed data. In pages where the data can be copied from the window, viewed in the browser's Page Inspector, but NOT in the source code, I'm stuck.
Please provide code examples of how I can acquire this data.
My current capture code is below:
var htmlCode = string.empty;
using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
{
// Get the file content without saving it
htmlCode = client.DownloadString("https://www.wedj.com/dj-photo-video.nsf/firstdance.html");
}
Using the above code, you receive the source code as seen here:
The data shown in image 1, as seen from the browser inspector is hidden inside of
<div class="entry row">
There are few ways to implement what you need (considering a C# console application).
Maybe the easiest one is to use tools that interact with an instance of a browser, i.e. Selenium (used for unit tests).
So:
Install Selenium.WebDriver nuget package
Install a browser where your application will run (let's suppose chrome)
Download the browser driver (chromedriver)
Write something like:
IWebDriver driver = null;
try
{
ChromeOptions options = new ChromeOptions();
options.AddArguments("--incognito");
driver = new ChromeDriver(options);
driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(5);
driver.Url = "https://www.wedj.com/dj-photo-video.nsf/firstdance.html";
var musicTable = driver.FindElement(By.Id("musicTable"));
// interact with driver to get data from the page.
}
finally
{
if (driver != null)
driver.Dispose();
}
Otherwise, you need to investigate a little bit more how the webpage works.
As far as I can see, the page loads a javascript, https://www.wedj.com/dj-photo-video.nsf/musiclist.js, that it is responsible to load the list of music from server.
This js script basically load data from following url: https://www.wedj.com/gbmusic.nsf/musicList?open&wedj=1&list=category_firstdance&count=100 (you can open it also in a browser). Excluding "(" and ")", the result is a json you can parse (maybe using newtonsoft.json package):
{
"more": "yes",
"title": "<h1>Most Requested Wedding First Dance Songs<\/h...",
"event": "<table class='musicTable g6-table-all g6-small' id='musicTable' borde..."
}
The event property contains the data you need (you can use HtmlAgilityPack nuget package to parse it).
Pro Selenium:
easy to interact with
the behavior is the same of what you see by the browser
Cons Selenium:
you need chrome or another browser installed
the browser is running when you interact with it
the browser download the full page (images, html, js, css...)
Pro manual:
you load only what you need
no dependencies to external programs (i.e. browsers)
Cons manual:
you need to understand how html/js works
you need to manually parse the json/html
In this specific case, I prefer the second options.
Read about Selenium Automation tool for C#, but it'll open every web page that you want to scrap and then e.g return source code or perform some actions on that webpage.
Generally this tool is not (afaik) for web crawlers, but can be good at the beginning, especially if your dotnet core app is sitting on some virtual machine / docker.
But care, it may be risky to open not-safe pages via browser.
You might watn to try pupeteer sharp.
It allows you to get the current HTML state.
using (var page = await browser.NewPageAsync())
{
await page.GoToAsync("http://www.spapage.com");
var result = await page.GetContentAsync();
}
https://github.com/kblok/puppeteer-sharp
Upon reaching navigate.GoToUrl("http://www.example.com/") chromedriver.exe will stop working, but it's working when the FirefoxDriver is being used:
using (IWebDriver driver = new ChromeDriver(DRIVER_PATH))
{
// driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromSeconds(10));
INavigation navigate = driver.Navigate();
navigate.GoToUrl("http://www.example.com/");
}
Chrome browser will successfully open.
Then a few second "chromedriver.exe has stopped working" will appear.
Here's my debug.log file:
[0508/115012.911:ERROR:process_reader_win.cc(114)] NtOpenThread: {Access Denied} A process has requested access to an object, but has not been granted those access rights. (0xc0000022)
[0508/115012.912:ERROR:exception_snapshot_win.cc(87)] thread ID 7968 not found in process
[0508/115012.912:WARNING:crash_report_exception_handler.cc(60)] ProcessSnapshotWin::Initialize failed
ChromeDriver v2.9.248315 (chromedriver_win32.zip)
Google Chrome Version 58.0.3029.96 (64-bit)
Can anyone guess how to make it work in C#?
To work with Selenium 3.4.0 you need to have latest ChromeDriver 2.29.x from here & latest Google Chrome 58.0
I don't see any issues in your code as such.
You might need to check if navigate have GoToUrl method implemented or not in C#.
As in Java we do it like this:
WebDriver driver1 = new ChromeDriver(c1);
Navigation navigate = driver1.navigate();
navigate.to("https://gmail.com");
I have a web browser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.
I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the web page before the JavaScript loads the content. My next idea was to use a web browser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.
Is there a way I can grab the page post JavaScript load?
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
Installed selenium: http://docs.seleniumhq.org/
Started the selenium hub as a service
Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
Started phantomjs in webdriver mode pointing to the selenium hub
In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);
var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install-Package Selenium.WebDriver
Install embedded exe:
Install-Package phantomjs.exe
Updated code:
var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
Thanks to wbennet, I discovered PhantomJSCloud.com. Enough free service to scrap pages through web API calls.
public static string GetPagePhantomJs(string url)
{
using (var client = new System.Net.Http.HttpClient())
{
client.DefaultRequestHeaders.ExpectContinue = false;
var pageRequestJson = new System.Net.Http.StringContent
(#"{'url':'" + url + "','renderType':'html','outputAsJson':false }");
var response = client.PostAsync
("https://PhantomJsCloud.com/api/browser/v2/{YOUR_API_KEY}/",
pageRequestJson).Result;
return response.Content.ReadAsStringAsync().Result;
}
}
Yeah.
ok i will show you how to enable javascript using phantomjs and selenuim with c#
create a new console project name it as you want
go to solution explorer in your right hand
a right click on References click on Manage NuGet packages
a windows will shows click on browse than install Selenium.WebDriver
downold phantomjs from here Phantomjs
in your main function type this code
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled", true);
IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
driver.Navigate().GoToUrl("https://www.yourwebsite.com/");
try
{
string pagesource = driver.PageSource;
driver.FindElement(By.Id("yourelement"));
Console.Write("yourelement founded");
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
Console.Read();
don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below
have great time of coding and thanks wbennett
I am using selenium WD in C# for cross browser testing but facing a strange problem that when ever i run my test using Nunit firstly Firefox window will open & then my desired browser window will open & run the test on it(desired browser).
As per my knowledge if any system is not having Firefox installed in it then it fails the script.
So is there any way to change this default value of browser in selenium.
I am able to run tests on different browser, my problem is only that before opening my desired browser by default first system is opening firefox. which create issue for me & my tests.
public void SetupTest()
{
driver = new SafariDriver();
baseURL = "http://google.com/";
verificationErrors = new StringBuilder();
}
Most probably, somewhere in your code you are initializing the Firefox Driver. Search for this within you code:
new FirefoxDriver();
You could also debug to the line
driver = new SafariDriver();
and see if it has an assigned value already.
But I am also pretty certain that you are initializing a FirefoxDriver somewhere.
I used IWebDriver to control IE for testing before. But the methods supported for IWebDriver and IWebElement are so limited. I find that ISelenium/DefaultSelenium which belongs to Selenium namespace are very useful. How can I use them to control an IE without installing the Selenium Server??
Here's the constructor of DefaultSelenium:
ISelenium sele = new DefaultSelenium(**serveraddr**, **serverport**, browser, url2test);
sele.Start();
sele.Open();
...
Seems that I have to install Selenium Server before I create an ISelenium object.
My case is, I'm trying to build a .exe application with C#+Selenium which can run on different PCs and it's impossible to install Selenium Server on all PCs(you never know which one is the next to run the app).
Does anyone know how to use ISelenium/DefaultSelenium without installing the server?
thx!
There are some solutions in Java without using the RC Server:
1) For the selenium browser startup:
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setBrowserName("safari");
CommandExecutor executor = new SeleneseCommandExecutor(new URL("http://localhost:4444/"), new URL("http://www.google.com/"), capabilities);
WebDriver driver = new RemoteWebDriver(executor, capabilities);
2) For selenium commands:
// You may use any WebDriver implementation. Firefox is used here as an example
WebDriver driver = new FirefoxDriver();
// A "base url", used by selenium to resolve relative URLs
String baseUrl = "http://www.google.com";
// Create the Selenium implementation
Selenium selenium = new WebDriverBackedSelenium(driver, baseUrl);
// Perform actions with selenium
selenium.open("http://www.google.com");
selenium.type("name=q", "cheese");
selenium.click("name=btnG");
// Get the underlying WebDriver implementation back. This will refer to the
// same WebDriver instance as the "driver" variable above.
WebDriver driverInstance = ((WebDriverBackedSelenium) selenium).getWrappedDriver();
//Finally, close the browser. Call stop on the WebDriverBackedSelenium instance
//instead of calling driver.quit(). Otherwise, the JVM will continue running after
//the browser has been closed.
selenium.stop();
Descripted here: http://seleniumhq.org/docs/03_webdriver.html
Google for something similar in C#.
Theres no other way to achieve that.