I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.
My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?
Thanks.
There are two separate issues here.
Forms
As a rule of thumb, crawlers do not touch forms.
It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.
The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.
JavaScript
JavaScript is a rather complicated beast.
There are three common ways you can deal with it:
Write your crawler so it duplicates the JS functionality of specific websites that you care about.
Automate a web browser
Use something like Rhino with env.js
I found an article which tackles deep web and its very interesting and I think this answers my questions above.
http://www.trycatchfail.com/2008/11/10/creating-a-deep-web-crawler-with-net-background/
Gotta love this.
AbotX handles javascript out of the box. Its not free though.
Related
I want to create an application that basically search for something with some filters from various websites (I don't require to login to those third party websites so the data available is open to public) and show it on my application. I have a few questions:
1. Is It Legal ?
2. Is this web scraping or Meta Search Engine ?
3. Can I get more information (any web links/articles) to know more
about it ? How to achieve it technically ? One way I know that we can use the XPath technique to scrape but I am wondering if there are more ways.
I am NOT asking for the entire code. Just how to start / Any guidance?
Thank You in Advance !
Firstly you need to understand how search engines work!
-Our so called search engines like google have special programs designed to mine out information from the web they are called "Spiders" what a spider does is basically scroll over all web pages within the search query and find matching information however that's a really complex thing to work on! it takes really good code and algorithm expertise to develop a spider for yourself. However if you can master that you'll be earning a smooth sum of money, but it's really rare unless you're a blatant genius!
So I am just beginning to learn C#, and one of my main goals is to be able to 'navigate' a website. I have done minimal research and have found that the two primary was to do this would be HTTPClient and Requests, and I would like to learn this through HTTPClient.
Now what I mean by navigate is to essentially bot a website for practice. This is like clicking buttons, putting text into fields, etc.
If anyone can give me an idea on where to start with this it would be much appreciated! Not looking for code specifically, just looking for what I should learn in HTTPClient to make this happen. Thanks!
I think that you are a little confused about the concepts. HTTPClient send requests to some site, but you cannot click buttons or "navigate" inside the site.
If youre looking for a way to test some site, i recommend you learn about cypress.io. You can add text to your textboxes, click buttons or navigate in any site. All of this with a few lines of code with Javascript. Its free.
Otherwise, if you need to save values on a database depending of your "navigation", you have to research about scraping tools. I recommend you Selenium or any other scraping tool.
Usually HTTPClient is used when you have to consume a REST API.
Basically you have to think about how a program could ‘see’ a website. You cannot expect to say to the HTTPClient: ‘Open page www.google.com and search for something.’ If you want to do this programmatically you have to exactly specify what your program should do.
For your purpose I recommend the HTML Agility Pack. This one can be used to get the navigation elements of a HTML document. This way you can parse a HTML delivered from a website into your program and do further stuff with it.
Kind regards :)
I'm trying to create a wpf application such as a movies library because i would like to manage and sort out my movies with a pretty interface.
I'd like to create a library with all my movies getting information from the web, but i don't know how very well.
I thought to get the information from a web site, for example imdb, but i don't know if it's legally to capture html from page to get the nested information.
It's my first desktop application and I would also like to know if it is necessary to create a database within the project and then create a setup project with specified script for deploy it.
Sorry for the confusion but i would like to know too much things :)
Thanks a lot in advance.
The legality of web scraping is a grey area. See my question, "Legality of Web Scraping vs Normal Use" and the corresponding answers for some insight.
Even if the legality is not a problem, web scraping is a flimsy approach because the webpage structure may change without notice, making your application suddenly useless until you update it to the new format. You are much better off using some sort of web API (if the site providing the information offers it).
Whether you need a database or not depends entirely on what your application will be doing and how you design it - it's not something any of us can tell you.
Same goes for the setup project - in fact I wouldn't worry about that until you actually have a working application. Take it step by step and keep the scope within control.
Yes I did not think about api.
It's a great idea, maybe use "themoviedb".
But if i create an application based on it, that has to show all the movies that you have stored on your hdd and get , for example, the posters, the description and the ranking, i have to create a database according to you?
Thanks a lot.
There several websites that use AJAX to update the contents periodically and I would like to monitor them. That's why it is necessary to keep multiple webpage windows always open and to grab page sources periodically.
I am searching for an approach for getting HTML sources from these webpages! Could you recommend something? I need it for statistical analysis.
Here are my thoughts so far:
approach. Opening separate Chrome windows manually. Using Handles to find the window. The problem is that it is nearly impossible to grab the HTML of the webpage.. (except the rich text)
Approach. writing an extension for Chrome/Firefox and a C# program. Program will send requests to extension and the extension will return HTML contents of the webpage. That's the theory. Google didn't put my hopes high so I am not sure if that is possible...
Approach. The most realistic one. Using the embed browser such as CefSharp, Awesomium, etc.. But as I mentioned - they have to support multiple opened windows! Any problems here?
So, these are my thought after hours of study..
Personally I would love to implement approach 2 because it is the most awesome.. but others will do too. What would be the easiest and most bulletproof?
Additionaly I would love a feature to do some input operations in these windows. ex: Login/navigate.
If IE browser is an option, look at implementing a managed add-on that will allow you to hook into notifications when document is loaded, access to live DOM of the document, possibly notifications when DOM changes, and so on. The same can be done in FF/Chrome. With IE, look into IObjectWithSite COM interface. This article seems to be a decent tutorial, though I'm not vouching for its accuracy.
I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.
I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.
FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike
If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.
If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.
AbotX does javascript rendering for you. Its not free though.