Abot Web Crawler Performance

Abot Web Crawler Performance - c#

I have built a robots.txt crawler which extracts the urls out of robots and then loads the page with some post processing once the page is done. This all happens quite fast, and I can extract information from 5 pages per second.
In the event a website doesn't have a robots.txt I use Abot Web Crawler instead. The problem is Abot is far slower than the direct robots.txt crawler. It seems when Abot hits a page with lots of links, it schedules each link very slowly. With some pages taking 20+ seconds to queue all and run the post process as mentioned above.
I use the PoliteWebCrawler which is configured to not crawl external pages. Should I instead be crawling multiple websites at once or is there another, faster solution to Abot?
Thanks!

Added a patch to Abot to fix issues like this one. Should be available in nuget version 1.5.1.42. See issue #134 for more details. Can you verify this fixed your issue?

Is it possible that the site you are crawling cannot handle lots of concurrent requests? A quick test would be to open a browser and start clicking around the site while Abot is crawling it. If the browser is noticeably slower then the server is showing signs of the load.
If that is the issue, you need to slow the crawl down through the configuration settings.
If not, can you give a url of a site or page that is being crawled slowly? Abot's full configuration would also be helpful.

Related

Why is Chrome flooding my site with GET requests?

I'm getting a periodic issue with my IIS hosted website whereby one of my clients browsers (Google Chrome 77/78 or higher) suddenly begins submitting dozens of requests per second to my website for the same page.
The user is always a valid, authenticated user with my application. The requests also don't seem to follow any standard pattern that I can determine from our logs. For instance, it's not a authorization redirect issue for instance, it's almost like the browser is sending through dozens of requests which have somehow been initiated by the user. For instance, opening a bookmarked version of our page dozens of times.
Looking at the request details I can see the following fetch headers:
HTTP_SEC_FETCH_USER: ?1
HTTP_SEC_FETCH_SITE: none
HTTP_SEC_FETCH_MODE: navigate
Which from what I can understand means that the action was user-initiated, and that it did not come from my own application, in terms of a ajax request or page refresh. I can only get the above combination of fetch headers when I open my page in a new tab in Chrome for instance.
Could this actually be related to the Chrome browser itself? I cannot replicate the issue in development, but it's happened a few times now and I'm not sure where to start in terms of determing a cause.

As other users have pointed out in comments, this can be in fact caused by Mobile Chrome Predictive Loading mechanism.
A recent version of Chrome for Android (78.0.3924.108) has experimented with predictive loading, changing the rules when links are selected for prefetching. This can cause arbitrary links to be "loaded" (issuing a GET request, distorting stats and causing any side effect that action has) without any user input when visiting your website.
This has been rolling out over the past week, and has caused many issues in many different scenarios (logging users out, clicking on paid or aggregator links, etc.)
More info on the Chromium issue tracker:
https://bugs.chromium.org/p/chromium/issues/detail?id=1027991
Requests made by prefetching issue a Purpose: prefetch header - at least by Chrome, other browsers might issue other headers
This has since then been fixed today morning (25th november 2019).

Disable browser caching the page and javascript running in Azure

Whenever we deploy an application and the client reviews the app, sometimes the javascript doesn't work (not totally). But when the browser is refreshed, the page works as intended.
I'm suspecting that it has something to do with the cache. Is there a way to disable caching of pages? I'm using Azure with .NET 4.0
Thank you in advance!

The only way I know of to reliably stop caching of files and links in most browsers is to append a random number or time to the file. e.g.
http://www.domain.com/js/script.js?date=20120409120003
This will mean it is a new link each time the page is loaded and next time it goes to get the file it won't have it available in cache.

Replicate steps in downloading file

I'm trying to automate the download of a file from a website. Normally to download the file, I login with a username and password. Navigate to a particular screen then click a button.
I've been trying to watch the sequence of POSTs using Chrome's developer mode, and then replicate all the steps using .Net WebClient class, but to no success. I've derived from the WebClient class and added cookie handling. Which seems to be working. I go to the login page and post using WebClient.UploadValues. About half the times it seems to work. The next step appears to make another POST action to a reporting URL. Once again I use WebClient.UploadValues, but the response from the server is a page showing an internal error.
I have a couple of questions.
1) Are there better tools than hand coding C# code to replicate a bunch of web browser interactions? I really only care about being able to download the file at a particular time each day onto a Windows box.
2) The WebClient does not seem to be the best class to use for this. Perhaps it's a bit to simplistic. I tried using HttpWebRequest, but it has no facilities for encoding POST requests. Any other recommendations?
3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
I can even post the exact code I'm using. The site I'm pulling data from, specifically is the Standard and Poors website. They have the ability to create custom reports for downloading historical data which I need for reporting, not republishing.

Using IE to download the file would be a much easier, as compared to writing C# / Perl / Java code to replicate http requests.
Reason is, even a slight change in JavaScript code can break the flow.
With IE, you can automate it using COM. Following VBA example opens IS and performs a google search:
Sub Search_Google()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate "http://www.google.com" 'load web page google.com
While IE.Busy
DoEvents 'wait until IE is done loading page.
Wend
IE.Document.all("q").Value = "what you want to put in text box"
ie.Document.all("btnG").Click
'clicks the button named "btng" which is google's "google search" button
While ie.Busy
DoEvents 'wait until IE is done loading page.
Wend
End Sub

3) Although Chrome's developer plugin appears to show all interaction, I find it a bit cumbersome to use. I'd be interested in seeing all of the raw communication (unencrypted though, the site is only accesses via https), so I can see if I'm really replicating all of the steps.
For this you can use Fiddler to view all the interaction going on and the RAW data going back and forth. To make it work with HTTPS you will need to install the Certificates to enable decryption of trafffic.

Writing my first webcrawler

I've tried to find som good how to, or some example that is good for beginners when it comes to write your first web crawler. I would like to write it in c#. Does anybody have any good example code to share or some tips on some sites where I can find info for c#, and some bacic webcrawling.
Thanks

HtmlAgilityPack is your friend.

Yes, HtmlAgeilityPack is a good tool to parse the HTML but that is definitely not enough.
There are 3 elements to crawling:
1) Crawling itself i.e. looping through web sites: This can be done by sending requests to random IP addresses but this does not work well since many websites use shared IP address HTTP with host header so using IP does not hit it. On the other hand, there are far too many IP addresses unused or not hosting a web server so this does not get you anywhere.
I suggest you send request to google (search for words from a dictionary) and crawl the results coming back.
2) Rendering the content: Many websites generate the HTML content in JavaScript when the form is loaded so if you send a simple request, it will not be able to capture the content as a user would be able to see. You need to render the page as browser does and that can be done using Webkit.net which is an open source tool although still in beta.
3) Comprehending and parsing the HTML: use HTML pack and there are tons of examples online. This can be used to crawl the site as well.

A while ago I also wanted to write a custom web crawler, and found this document:
Web Crawler
It has some great info, and is very well written IMO.

How Systems like AdSense and Webstats Work?

I am thinking about working with remote data and receive or send data actually in external web sites. exists a large amount of examples in World Wide Web are working. For example: free online web tools like web stats OR Google's AdSense .... .you know in such web services some code will generate for publishers and the publisher put generated code in her BODY of web page document(HTML file) and the system after that will work. we can have count of visits for home pages, count of clicks on advertisements and so on.now this is my question: How such systems Work? and how can I investigate and search about them to find out how to program them? can you suggest me some keywords? Which Titles should I looking for? and which Technologies is relevant to this kind of programming? Exactly I want to find some relevant references to learn and start some experiences on these systems. if my Q is not Clear I will Explain it more if you want...Help me I am confused.
Consider that I am an Programmer want to program such a systems not to use them.

There are a few different ways to track clicks.
Redirection Tracking
One is to link the advertisement (or any link) to a redirection script. You would normally pass it some sort of ID so it knows which URL it should forward to. But before redirecting the user to that page it can first record that click in a database where it can store the users IP, timestamp, browser information, etc. It will then forward the user (without them really knowing) to the specified URL.
Advertisement ---> Redirection Script (records click) ---> Landing Page
Pixel Tracking
Another way to do it is to use pixel tracking. This is where you put a "pixel" or a piece of Javascript code onto the body of a webpage. The pixel is just an image (or a script posing as an image) which will then be requested by the user visiting the page. The tracker which hosts the pixel can record the relevant information by that image request. Some systems will use Javascript instead of an image (or they use both) to track clicks. This may allow them to gain slightly more information using Javascript's functions.
Advertisement ---> Landing Page ---> User requests pixel (records click)
Here is an example of a pixel: <img src="http://tracker.mydomain.com?id=55&type=png" />
I threw in the png at the end because some systems might require a valid image filetype.
Hidden Tracking
If you do not want the user to know what the tracker is you can put code on your landing page to pass data to your tracker. This would be done on the backend (server side) so it is invisible to the user. Essentially you can just "request" the tracker URL while passing relevant data via the GET parameters. The tracker would then record that data with very limited server load on the landing page's server.
Advertisement ---> Landing Page requests tracker URL and concurrently renders page

Your question really isn't clear I'm afraid.
Are you trying to find out information on who uses your site, how many click you get and so one? Something like Google Analytics might be what you are after - take a look here http://www.google.com/analytics/
EDIT: Adding more info in response to comment.
Ah, OK, so you want to know how Google tracks clicks on sites when those sites use Google ads? Well, a full discussion on how Google AdSense works is well beyond me I'm afraid - you'll probably find some useful info on Google itself and on Wikipedia.
In a nutshell, and at a very basic level, Google Ads work by actually directing the click to Google first - if you look at the URL for a Google ad (on this site for example) you will see the URL starts with "http://googleads.g.doubleclick.net..." (Google own doubleclick), the URL also contains a lot of other information which allows Google to detect where the click came from and where to redirect you to see the actual web site being advertised.
Google analytics is slightly different in that it is a small chunk of JavaScript you run in your page, but that too basically reports back to Google that the page was clicked on, when you landed there and how long you spend on a page.
Like I said a full discussion of this is beyond me I'm afraid, sorry.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.