I am trying to use Selenium and ChromeDriver to get the complete DOM of a page as quickly as possible. When my important Ajax requests are finished I inject a class into the dom and use a WebDriverWait to wait for that class before continuing.
When I test the responses from the api (the ajax calls) they are very consistent, I have also removed any requests to adservers, or anything outside the website and api. I have inspected the requests from server logs and wireshark and again, they are very consistent.
The time it takes for ChromeDriver to get the full dom varies wildly. Below are the arguments and switches I am using. I am getting anything from 700ms - 4 seconds to get the full dom for the same page. Are there switches I am using here that are impeding ChromeDriver? What should I use if I simply want the DOM, nothing else? How can I optmise for speed?
Using Selenium.WebDriver.3.5.0 and Chrome 60.
chromeOptions.AddArguments("headless", "disable-gpu", "renderer");
chromeOptions.AddArgument("disable-translate");
chromeOptions.AddArgument("no-default-browser-check");
chromeOptions.AddArgument("site-per-process");
chromeOptions.AddArgument("disable-3d-apis");
chromeOptions.AddArgument("disable-background-mode");
chromeOptions.AddArgument("site-per-process");
chromeOptions.AddUserProfilePreference("profile.managed_default_content_settings.images", 2);
chromeOptions.AddUserProfilePreference("credentials_enable_service", false);
Thanks
My use case was to use chrome webdriver to generate a full dom to be sent to crawlers. It was to be a replacement fro prerender.io.
I have seen some other developers want to try this too.
My advice is to not do this. It's isn't designed for it and there is nothing you can do to make the responses more reliable in my experience.
If yu are using angular, use universal (https://universal.angular.io/) we tested this and it works great.
For other JS frameworks, https://prerender.io is still an option.
Thanks.
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
There is one website named "www.localbanya.com", i wanted to grab the HTML information from that site, they list products, the structure of their display is:
First they display some around 8-10 products on page-load, and
later when user scrolls down it generates more products.
Now as this is happening based on javascript, i am not able to get the whole page source using WebClient.
I wanted to know is there any way i can update the page-source while using WebClient class in .net to retrieve whole page information or any other alternative i can use to get the whole page HTML information, at once.
You can refer this for reference localbanya product page
Any help will be a appreciated.
WebClient obviously doesn't run the javascript.
so you gonna need some sort of a headless browser to do it.
There are many options for it, though I don't know any C# or .NET implementation..
You may look into Phantom JS and other headless browsers which replicate what a normal browser does and you can write scripts for it.
Also refer to this question
Headless browser for C# (.NET)?
You can also run something like Fiddler to see what requests were made from the page when scrolling down, to reverse engineer how the data is retrieved, and replicate that with a WebClient if possible.
Hope this Helps.
Could anyone please advise me what is the best framework/library for web browser automatisation? The task is to open web browsers page, sign in, perform some long searches, and save gathered information to excel. Now I'm using IE references in C#, but at work I could use only IE8. If I've upgraded it to IE9, but some scripts on target sites started working with errors.
I tried to use awesomium, but I couldn't parse page with help of it, as I understand. Are there any variants to do this with high speed? Size of libs - doesn't matter.
If there are any solutions compatible with Scala it would be great.
As om-nom-nom hinted already, your best bet is probably a webdriver implementation like selenium webdriver. It has bindings for c# and java and can use IE, FF, Chrome, phantomjs (great if you want to go headless) and others.
Note that it might be not the best idea to do also the gathering of information directly with the webdriver, especially if the site content is changing fast. In such cases it might be useful to save the html page source with webdriver and then switch to some more efficient library for static content, like JSoup.
Im trying to scrape a page. Everything is ok, but when values are updated, the sourse code of page is still the same for a minute. Even when i refresh a page with slow internet connection, first i see old data, and only after page gets fully loaded values are current.
I guess javascript updates them. BUt it still has to download them somehow.
How can i get current values?
I write my program in C#, but if you have some ideas/advices/examples language doesnt really matter.
Thank you.
You're right - javascript is probably updating the data after load.
I could think of three ways to handle this:
Use a webbrowser control - I guess your using the HttpWebRequest object to retrieve values from the site. This won't work if you need to let the javascript to run. You can use the webbrowser control, let the javascript run and retrieve values from the DOM. Only thing I don't like about this approach is it feels like a hack and probably too clunky for prod applications. You also need to know when to read the contents of the DOM (an update might be inprogress in the background). Google "C# WebBrowser Control Read DOM Programmatically" or you can read more about that here.
I personally prefer this over the previous but it doesn't work all the time. First you need to inspect the website from firebug or something and see which urls are called from the background. Say for example the site is updating stock quotes using javascript. Most likely, its using an asynchronous request to retrieving the updated information from a webservice. Using firebug, you can view this under NET>XHR. Now is the hard part. Well, take a look at the request and the values returned. The idea is, you can try to retrieve the values your self and parse the contents - which can be a lot easier than scraping a page. The problem is, you would need to do a bit of reverse engineering to get it right. You might also encounter problems with authentication and/or encryption.
Lastly and my most preferred solution is asking the owner [of the site your are scraping] directly.
I think the WebBrowser control approach is probably OK and doesn't depend on third party libraries. Here is what I intend to use and it solves the problem of waiting for the page to complete loading:
private string ReadPage(string Link)
{
using (var client = new WebClient())
{
this.wbrwPages.Navigate(Link);
while (this.wbrwPages.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
ReadPage = this.wbrwPages.DocumentText;
}
}
I will get information out of the HTML through some form of DOM or XPath treatment. I am curious if others will have comments about entering the 'while' loop and depending upon the 'complete' state to get me out of it. I may put a timer of some sort in there as well - just to be safe.
I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.