Web browser automatisation (Robor) - c#

Could anyone please advise me what is the best framework/library for web browser automatisation? The task is to open web browsers page, sign in, perform some long searches, and save gathered information to excel. Now I'm using IE references in C#, but at work I could use only IE8. If I've upgraded it to IE9, but some scripts on target sites started working with errors.
I tried to use awesomium, but I couldn't parse page with help of it, as I understand. Are there any variants to do this with high speed? Size of libs - doesn't matter.
If there are any solutions compatible with Scala it would be great.

As om-nom-nom hinted already, your best bet is probably a webdriver implementation like selenium webdriver. It has bindings for c# and java and can use IE, FF, Chrome, phantomjs (great if you want to go headless) and others.
Note that it might be not the best idea to do also the gathering of information directly with the webdriver, especially if the site content is changing fast. In such cases it might be useful to save the html page source with webdriver and then switch to some more efficient library for static content, like JSoup.

Related

Add Google Chrome extensions to your browser cefsharp or any webbrowser

Is there an example of adding plugins to Google IE browser that supports language c# or .net I want to run files js - json - hrml popup
Microsoft just released Blazor, which allows browsers to run Razor / C# code on the client side. You could research and experiment with that.
However, you don't need that to run js - json - html popup. Use something like ASP.Net MVC to run any C# code on the server. The server generates your HTML and Javascript and stuff. Browsers know how to work with all of that stuff without any plugins.
This site will work better when you have specific questions and you can show your code and what you have tried. For a question like this, Google will work better for you! Good luck.

Using Selenium in C# without opening a browser in Visual Studio 2013

I have been using Selenium along side C# in Visual Studio 2013. I will make a call to:
driver.Navigate().GoToUrl("http://<insert webpage>");
...Which will open whichever WebDriver I choose to use.
From here, I will make calls to links/text boxes/menus as I need to.
However, I was wondering if there is a way to get the information from webpages without having to actually open a browser, and if so, could someone perhaps explain or link me to the right direction? It would save time and speed up a lot of my programs. I know applications can get information remotely without actually opening a browser, I just do not know how the process works or if Selenium alone will give that ability.
I appologize if this is wrong place to ask this question.
It is not clear whether or not you need to work with web page (like click on the links, or edit test), but here are two options:
You can use PhantomJS.It is headless browser and since there will be no UI execution may be faster. There is a selenium driver for it.
You can use Html Agility Pack to parse the page and WebClient to download the page. No selenium is required in that case. Html Agility Pack will allow you to make XPath queries, find objects by class name or ID. But: you won't be able to manipulate with DOM structure as you can do with real browser. It is just to parse and navigate over static html page.

c# ways to render a webpage and navigate/manipulate its DOM?

I have a good understanding of DOM+HTML etc but I'm new to c#, whats the best way currently of downloading then rendering (executing all javascript + DOM changes etc) and simulating user interaction with a webpage in c#?
I've seen HTML agility pack mentioned quite a few times but it doesn't look like its been updated since August 2012? Has anyone used this recently and encountered any problems? Does c# have anything built in for this?
Thanks!
First of all HTMLAgilityPack it's not for simulating user interaction in a web page, HTMLAgilityPack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...).
HTMLAgilityPack not support JavaScript, it's a very important step, because many developers get trouble with the full load of the page in the browser and the request made by HTMLAgilityPack or any library you use to make the request.
For user interaction, full load of the web page, web testing I strongly recommend you Selenium, Selenium automates browsers. Selenium has support for several programming languages (Java, C#, Ruby, Python, etc), you can read more in the above link with a very good documentation.
The only drawback of Selenium is its open a browser to make the work, but it can be simulated in some environments to run headless browser, you can read more about this in the following links :
Selenium Headless Automated Testing in Ubuntu
Headless Browser and scraping - solutions
I hope this help you

get HTML page sources from multiple sites

There several websites that use AJAX to update the contents periodically and I would like to monitor them. That's why it is necessary to keep multiple webpage windows always open and to grab page sources periodically.
I am searching for an approach for getting HTML sources from these webpages! Could you recommend something? I need it for statistical analysis.
Here are my thoughts so far:
approach. Opening separate Chrome windows manually. Using Handles to find the window. The problem is that it is nearly impossible to grab the HTML of the webpage.. (except the rich text)
Approach. writing an extension for Chrome/Firefox and a C# program. Program will send requests to extension and the extension will return HTML contents of the webpage. That's the theory. Google didn't put my hopes high so I am not sure if that is possible...
Approach. The most realistic one. Using the embed browser such as CefSharp, Awesomium, etc.. But as I mentioned - they have to support multiple opened windows! Any problems here?
So, these are my thought after hours of study..
Personally I would love to implement approach 2 because it is the most awesome.. but others will do too. What would be the easiest and most bulletproof?
Additionaly I would love a feature to do some input operations in these windows. ex: Login/navigate.
If IE browser is an option, look at implementing a managed add-on that will allow you to hook into notifications when document is loaded, access to live DOM of the document, possibly notifications when DOM changes, and so on. The same can be done in FF/Chrome. With IE, look into IObjectWithSite COM interface. This article seems to be a decent tutorial, though I'm not vouching for its accuracy.

Web page crawling in C#

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.

Categories

Resources