get HTML page sources from multiple sites

get HTML page sources from multiple sites - c#

There several websites that use AJAX to update the contents periodically and I would like to monitor them. That's why it is necessary to keep multiple webpage windows always open and to grab page sources periodically.
I am searching for an approach for getting HTML sources from these webpages! Could you recommend something? I need it for statistical analysis.
Here are my thoughts so far:
approach. Opening separate Chrome windows manually. Using Handles to find the window. The problem is that it is nearly impossible to grab the HTML of the webpage.. (except the rich text)
Approach. writing an extension for Chrome/Firefox and a C# program. Program will send requests to extension and the extension will return HTML contents of the webpage. That's the theory. Google didn't put my hopes high so I am not sure if that is possible...
Approach. The most realistic one. Using the embed browser such as CefSharp, Awesomium, etc.. But as I mentioned - they have to support multiple opened windows! Any problems here?
So, these are my thought after hours of study..
Personally I would love to implement approach 2 because it is the most awesome.. but others will do too. What would be the easiest and most bulletproof?
Additionaly I would love a feature to do some input operations in these windows. ex: Login/navigate.

If IE browser is an option, look at implementing a managed add-on that will allow you to hook into notifications when document is loaded, access to live DOM of the document, possibly notifications when DOM changes, and so on. The same can be done in FF/Chrome. With IE, look into IObjectWithSite COM interface. This article seems to be a decent tutorial, though I'm not vouching for its accuracy.

Related

full browser simulation with domtree

I want to simulate the browser fully in a programatical way without interface and need to access every aspect of it. such as DOM Tree , js execution and etc.
I've read phantomjs and casperjs documentation and seems they don't support dom tree.
what do you recommend?

As they said before, Casperjs handles rendering the domtree any way you would like, list all the anchors, titles, headers etc.. In the old days you would just use something like Python's Beautifulsoup or Perl's TokeParser, but neither of those stand up to the heavy ajax sites we see nowadays.
I would check out the documentation with CasperJs. It's one of the best tools for scraping the modern web.

Multi-Select File Uploader

I'm trying to come up with a viable (and the simpler the better) solution for a multi-select file upload control. Normally this would be a breeze except for a few things...
The user needs to be able to literally select multiple files in the dialog, NOT one by one.
Can't use open source code. (But Javascript/JQuery is OK)
Cant use a third-party library the Microsoft doesn't support.
(Please don't bother with "Why can't you?" comments.)
I don't have a lot of experience making my own controls. (And I'd assume if there was a simple way to do this just by modifying the "Open" control, it would be an easily found tutorial.)
Thanks.
EDIT: To answer some questions...
I haven't tried much of anything outside of researching. Not really sure of where to start with all these limitations.
I can't use HTML5. In fact, I need IE7 compatibility. So no multiple attribute.

How about Telerik's multi-file upload? I believe they are an MS certified partner.
-J

If you want to make a customized multi-file uploader control yourself, you have to build a rich file explorer using java script client side and then upload files using ajax technology. I think all multi-file upload components use this method. If you can't use open source or third party components, it seems you have to make it from scratch.

Web page crawling in C#

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.

I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.

FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike

If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.

If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.

AbotX does javascript rendering for you. Its not free though.

Programmatically changing the destination printer for a WinForms WebBrowser control

I'm trying to use an invisible WebBrowser control to print a very simple HTML document. Our application requires that we be able to print several documents this way, and that they all can be sent to different printers. Unfortunately, I haven't been very successful in making the output go to the right printer.
The way it works right now is that before printing a document, the application determines which printer is to receive it, and sets the default printer accordingly. To do this it uses SetDefaultPrinter() imported from WinSpool.drv. If I step the code in debug mode I can clearly see that the default printer changes (and this change is reflected in the control panel UI), but the WebBrowser still insists on using the original default printer.
The MSDN documentation, from what I've seen, doesn't really provide a solution for this scenario. I would greatly appreciate some input on how I can accomplish this programmatically.

Given what you've said, perhaps if you restart the process which contains the web browser control (or the process which is the web browser control), after you change the default printer? That's the kind of thing I see happening here, for example.
I suppose it would be possible to fork off a background process that does the actual printing, but I'm really hoping for a simpler solution.
Forking was my first thought towards a probably-simplest solution.
Some other alternatives are as follows.
1). IE, which the webbrowser control is wrapping, exposes APIs via ActiveX. One of its/those APIs might let you specify the destination printer.
2). Some executables (I don't know about IE) have printto entries in the registry. For example, Acrobat Reader has an entry whose value is as follows:
""C:\Program Files\Adobe\Reader 9.0\Reader\AcroRd32.exe"" /t "%1" "%2" "%3" "%4"
That's used for specifying the syntax of the command-line which you can use to print and specify a (non-default) printer. You can also Google for printto, see e.g. PrintTo command in the ShellExecute.
3). I have implemented an HTML control for .NET of my own, which doesn't depend on IE. You say that your HTML (and CSS I presume) are simple, so perhaps I can render it, either out of the box or with only a little extra development effort. I don't support printing, but printing is quite easy for a user control to implement. Getting me to implement that for you would cost you several hundred but, who knows, maybe it's worth it to you. It would be quite a light-weight solution, and perhaps well supported. You could email me if you want to discuss that further.
4). You might also find other controls, similar to mine, more or less famous/expensive; or other applications, e.g. OpenOffice etc etc.
5). You could consider converting the HTML (somehow) to another format (e.g. PDF) for which you have an application which gives you better support for printing.

I've had the exact same problem, and incorporated this control instead of the standard .NET WebBrowser to work around it.

This works on .NET 3.5, if not before
this.webBrowser1.ShowPrintDialog();

Web Crawling Sites with Javascripts or web forms

I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.
My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?
Thanks.

There are two separate issues here.
Forms
As a rule of thumb, crawlers do not touch forms.
It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.
The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.
JavaScript
JavaScript is a rather complicated beast.
There are three common ways you can deal with it:
Write your crawler so it duplicates the JS functionality of specific websites that you care about.
Automate a web browser
Use something like Rhino with env.js

I found an article which tackles deep web and its very interesting and I think this answers my questions above.
http://www.trycatchfail.com/2008/11/10/creating-a-deep-web-crawler-with-net-background/
Gotta love this.

AbotX handles javascript out of the box. Its not free though.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.