Web crawler Parsing PHP/Javascript links?

Web crawler Parsing PHP/Javascript links? - c#

I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?

Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.
You're trying to program a web crawler but it cannot crawl URLs that end with .php?
If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.
In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.
Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.
The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.
The Javascript problem
Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?
A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.

Related

C# Using HTTPClient to 'Navigate' a Website

So I am just beginning to learn C#, and one of my main goals is to be able to 'navigate' a website. I have done minimal research and have found that the two primary was to do this would be HTTPClient and Requests, and I would like to learn this through HTTPClient.
Now what I mean by navigate is to essentially bot a website for practice. This is like clicking buttons, putting text into fields, etc.
If anyone can give me an idea on where to start with this it would be much appreciated! Not looking for code specifically, just looking for what I should learn in HTTPClient to make this happen. Thanks!

I think that you are a little confused about the concepts. HTTPClient send requests to some site, but you cannot click buttons or "navigate" inside the site.
If youre looking for a way to test some site, i recommend you learn about cypress.io. You can add text to your textboxes, click buttons or navigate in any site. All of this with a few lines of code with Javascript. Its free.
Otherwise, if you need to save values on a database depending of your "navigation", you have to research about scraping tools. I recommend you Selenium or any other scraping tool.
Usually HTTPClient is used when you have to consume a REST API.

Basically you have to think about how a program could ‘see’ a website. You cannot expect to say to the HTTPClient: ‘Open page www.google.com and search for something.’ If you want to do this programmatically you have to exactly specify what your program should do.
For your purpose I recommend the HTML Agility Pack. This one can be used to get the navigation elements of a HTML document. This way you can parse a HTML delivered from a website into your program and do further stuff with it.
Kind regards :)

How to navigate through a website and "mine information"

I want to "simulate" navigation through a website and parse the responses.
I just want to make sure i am doing something reasonable before i start, I saw 2 options to do so:
Using the WebBrowser class.
Using the HttpWebRequest class.
So my initial though was to use HttpWebRequest and just parse the response.
What do you guys think?
Also wanted to ask,i use c# cause its my strongest language, but what are common languages used to do such stuff as mining from websites?

If you start doing it manually, you probably will end up hard-coding lots of cases. Try Html Agility Pack or something else support xpath expressions.
There are alot of Mining and ETL tools out there for serious data mining needs.

For "user simulation" I would suggest using Selenum web driver or PhantomJS, which is much faster but has some limitations in browser emulation, while Selenium provides almost 100% browser features support.

If you're going to mine data from a website there is something you must do first in order to be 'polite' to the websites you are mining from. You have to obey the rules set in that websites robots.txt, which is almost always located at www.example.com/robots.txt.
Then use HTML Agility Pack to traverse the website.
Or Convert the html document to xhtml using html2xhtml. Then use an xml parser to traverse the website.
Remember to:
Check for duplicate pages. (general idea is to hash each the html doc at each url. Look up (super)shingles)
Respect the robots.txt
Get the absolute URL from each page
Filter duplicate URL from your queue
Keep track of the URLs you have visited(ie. timestamp)
Parse your html doc. And keep your queue updated.
Keywords: robots.txt, absolute URL, html parser, URL normalization, mercator scheme.
Have fun.

Using Request QueryString and Path Info in same URL?

I'm working on an existing large site that uses querystings in ID for different sections (representing physical stores) of the website.
I'd like to be able to implement pathinfo requests for SEO purposes so I'm looking at URLS like:
http://www.domain.com/cooking-classes.aspx?ID=5 (where 5 would be the ID of the local store)
Is there a way to make this type of URL work?
http://www.domain.com/cooking-classes.aspx?ID=5/chocolate ? I can get the content to work without the querystring however the existing infrastructure needs the ID to run. I tried:
http://www.domain.com/cooking-classes.aspx/chocolate?ID=5 however the ID comes back incorrectly.
Using http://www.domain.com/cooking-classes.aspx/5/chocolate means a rewrte of the page handling engine.
Am I clutching at straws here? No real way to get PathInfo and Querystring to play nicely with each other?
I'd like to stay away from any IIS mods as we don't have access.

Your last URL is going to yield the best result for search engines, however you may want to drop the .aspx. You will need to write an HttpHandler or HttpModule to be able to accomplish this. It's actually not as much work as it may seem, and you don't have to change your page at all. Your HttpHandler can do a behind the scenes redirect preserving the URL. Check out this article on the MSDN:
http://msdn.microsoft.com/en-us/library/ms972974.aspx
If you don't need anything super specific, you could use an existing HttpModule like the one mentioned in the post on ScottGu's blog:
http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
He mentions UrlRewriter.net which is open source:
http://urlrewriter.net/

Web page crawling in C#

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:
http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB
If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.
Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).
The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.

I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.
Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.
Good luck.

FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.
Using SHDocVw is faster, but is also semaphore limited.
Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)
This is headless, so none of the controls are rendered. (Faster).
Thanks,
Mike

If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.
As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.

If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.
Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.

AbotX does javascript rendering for you. Its not free though.

Grab details from web page

I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)

The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.

You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.

yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.

You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.