Writing my first webcrawler - c#

I've tried to find som good how to, or some example that is good for beginners when it comes to write your first web crawler. I would like to write it in c#. Does anybody have any good example code to share or some tips on some sites where I can find info for c#, and some bacic webcrawling.
Thanks

HtmlAgilityPack is your friend.

Yes, HtmlAgeilityPack is a good tool to parse the HTML but that is definitely not enough.
There are 3 elements to crawling:
1) Crawling itself i.e. looping through web sites: This can be done by sending requests to random IP addresses but this does not work well since many websites use shared IP address HTTP with host header so using IP does not hit it. On the other hand, there are far too many IP addresses unused or not hosting a web server so this does not get you anywhere.
I suggest you send request to google (search for words from a dictionary) and crawl the results coming back.
2) Rendering the content: Many websites generate the HTML content in JavaScript when the form is loaded so if you send a simple request, it will not be able to capture the content as a user would be able to see. You need to render the page as browser does and that can be done using Webkit.net which is an open source tool although still in beta.
3) Comprehending and parsing the HTML: use HTML pack and there are tons of examples online. This can be used to crawl the site as well.

A while ago I also wanted to write a custom web crawler, and found this document:
Web Crawler
It has some great info, and is very well written IMO.

Related

Abot Web Crawler Performance

I have built a robots.txt crawler which extracts the urls out of robots and then loads the page with some post processing once the page is done. This all happens quite fast, and I can extract information from 5 pages per second.
In the event a website doesn't have a robots.txt I use Abot Web Crawler instead. The problem is Abot is far slower than the direct robots.txt crawler. It seems when Abot hits a page with lots of links, it schedules each link very slowly. With some pages taking 20+ seconds to queue all and run the post process as mentioned above.
I use the PoliteWebCrawler which is configured to not crawl external pages. Should I instead be crawling multiple websites at once or is there another, faster solution to Abot?
Thanks!
Added a patch to Abot to fix issues like this one. Should be available in nuget version 1.5.1.42. See issue #134 for more details. Can you verify this fixed your issue?
Is it possible that the site you are crawling cannot handle lots of concurrent requests? A quick test would be to open a browser and start clicking around the site while Abot is crawling it. If the browser is noticeably slower then the server is showing signs of the load.
If that is the issue, you need to slow the crawl down through the configuration settings.
If not, can you give a url of a site or page that is being crawled slowly? Abot's full configuration would also be helpful.

Retrieving XML and/or JSON data from another site

Good morning everyone,
I recently got a request if it's possible to retrieve data from other sites search results. I tried searching, but didn't exactly know how to word my searching.
Best explained by example.
Visit: https://bcbst.vitalschoice.com/professional?search_specialty_id=29&ci=DFT&geo_location=33688&network_id=39&sort=relevancy&radius=any&page=1
You'll see a list of doctors.
I'm looking for a way to programmatically get the list of doctors. Like the name, address, phone.
I just need some direction as I will probably be doing this for multiple sites.
I program in C# and JS.
In the case of the website you linked it has an API available for use. What you can do is make an AJAX request (if using JQuery) or WebRequest (if using C#) to one of the endpoints, and then convert the JSON you get from the website into whatever you need to use.
You can test what you'll be getting back from the server by typing the url into the browser, example
As for the search parameters, you'll have to add those to the url. I'd advise taking a look at their API to see what functions they support.
Hope this helps!

Need Help in building a "robot" that extracts data from HTTP request

I am building a web site in ASP.net and C# that one of its components involves log-in to a website that the user has an account (for example cellular phone company) on behalf of the user, take information from this site and store it in our database.
I think this action called "scraping".
Are there any products that already does so that I can use to integrate with my software ?
I don't need a software that does it, I need some sort of SDK that I can integrate with my C# code.
Thanks,
Koby
Use the HtmlAgilityPack to parse the HTML that you get from a web request once you've logged in.
See here for logging in: Login to website, via C#
I haven't found any product, that would do it right so far.
One way to handle this is to
- do requests by your self
- use http://htmlagilitypack.codeplex.com/ to extract important information from downloaded html
- save extracted information by your self
Thing is, that depending on context, there are so many things to tune/configure, that you need very large product and still it won't reach custom solution performance/accuracy:
a) multithreading control
b) extraction rules
c) persistance control
d) web spidering (or how next link to parse is chosen)
Check the Web Scraping Wikipedia Entry.
However I would say since what we need to acquire via web-scraping is application specific, most of the time, it may be more efficient to scrape whatever you need from a web response stream.

Logging in to phpBB forum programmatically through C#

I'm working on a C# application that needs to scrape some data from a phpBB forum. The forum scraping requires logging in. The application will prompt the user for their login credentials to connect.
I've scraped websites before with C#, but what I'm not sure how to do is login to phpBB and keep a session open during the duration of the screen scraping. I've done some searching and haven't had much luck. Is there a good way to programmatically do something like this?
You don't say what you've tried, but if you used an HttpWebRequest object to retrieve pages and/or logon, then you need to assign a new CookieContainer collection to the HttpWebRequest to store any cookies returned by the website. Share this amongst HttpWebRequest objects to remain logged in
look for the names of the username and password fields using Firebug or Chrome (or even View Source), then use my answer here: Programmatically logging into a site, replacing 'session_key' and 'session_password' as appropriate. that should work.
and then translate to C#!
I would recommend using WatiN API for doing screen scraping. I have done screen scraping using this API and it does good work.
Check it out !
I recommend using HTML Agility Pack.

Replicate steps in downloading file

I'm trying to automate the download of a file from a website. Normally to download the file, I login with a username and password. Navigate to a particular screen then click a button.
I've been trying to watch the sequence of POSTs using Chrome's developer mode, and then replicate all the steps using .Net WebClient class, but to no success. I've derived from the WebClient class and added cookie handling. Which seems to be working. I go to the login page and post using WebClient.UploadValues. About half the times it seems to work. The next step appears to make another POST action to a reporting URL. Once again I use WebClient.UploadValues, but the response from the server is a page showing an internal error.
I have a couple of questions.
1) Are there better tools than hand coding C# code to replicate a bunch of web browser interactions? I really only care about being able to download the file at a particular time each day onto a Windows box.
2) The WebClient does not seem to be the best class to use for this. Perhaps it's a bit to simplistic. I tried using HttpWebRequest, but it has no facilities for encoding POST requests. Any other recommendations?
3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
I can even post the exact code I'm using. The site I'm pulling data from, specifically is the Standard and Poors website. They have the ability to create custom reports for downloading historical data which I need for reporting, not republishing.
Using IE to download the file would be a much easier, as compared to writing C# / Perl / Java code to replicate http requests.
Reason is, even a slight change in JavaScript code can break the flow.
With IE, you can automate it using COM. Following VBA example opens IS and performs a google search:
Sub Search_Google()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate "http://www.google.com" 'load web page google.com
While IE.Busy
DoEvents 'wait until IE is done loading page.
Wend
IE.Document.all("q").Value = "what you want to put in text box"
ie.Document.all("btnG").Click
'clicks the button named "btng" which is google's "google search" button
While ie.Busy
DoEvents 'wait until IE is done loading page.
Wend
End Sub
3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
For this you can use Fiddler to view all the interaction going on and the RAW data going back and forth. To make it work with HTTPS you will need to install the Certificates to enable decryption of trafffic.

Categories

Resources