Download thousands of urls

Download thousands of urls - c#

I'm developing some service which has to enter my client website and process it's content. As you probably understand, my service is downloading thousands of URLs every hour. Some of those URLs are from same domain.
In order to make the process faster, my application contains 100 threads. Every thread downloading one URL and process it's content.
I'm noticed that after some time of downloading webpages, my "WebRequest.GetResponse()" are stuck. After timeout period, the WebRequest throws Timeout-Exceptions (from all the threads that doing same work). URLs are valid and downloadable (checked).
Ok, so i'm suspected that the server is felling that robot doing this work and stop response to it's requests.
One solution for this situation is to use TOR system. This will makes the requested web-server fell likes it's another client that request for information. The bad side is TOR IPs are public and some servers are blocking those IPs. Therefor, for those specific server the solution won't work.
I'm looking for better solution, someone?

If you have permission from the site owner ask him to add your IP to the firewall / DDoS protection.
If he has set this functionality up he should be able to add an IP to the allow list

Related

Get response back from unpingable websites (C# ASP .net mvc)

I'm not a network expert, but for one of my projects, I need to ensure that the website I'm sending the request to is alive. Some websites do not respond to ping; basically, their configuration prevents response to ping requests.
I was trying to Arping instead of pinging websites, but Arping only works on the local network and will not go beyond the network segment (this).
I can download the whole or part of the webpage and confirm if the content is the same as the previous state, but I rather have one more level of confirmation before downloading Html.
Is there any other method that enables the app to get a response back from non-pingable websites outside the network?

Based on common practices you may use ping, telnet and tracert as a client to the requested server (at this point the website or the service you want to connect) and make sure the 3 command are enable to your side. You may also try to access it to your browser.
If its API you may also try to use POSTMAN and call the service.
Goodluck and happy coding :)

How to change my IP frequently in a windows forms application?

I have a security course project. It asks to enter a given website and download its information 20 times(site has 20 subpages), then parse etc. I am using c#'s downloadstring to download and parse the page. However, after the fifth time, website finds out that I am doing those downloads as a robot(programmatically).
What I create as a program is successful until the sixth request.
I download the content and parse the desired information. When I reach the sixth subpage, my pc is blocked.
It is not related with time interval. Because, I used random generated timeouts between 6-12 seconds. However, that does not help. It is definitely related with entry counter of the webpage. It is like " not give permission after 5 request in 30 minutes. If it passes the limit then block it for a (or more) day". Since, I have been blocked for many times. I am using my phone's Hotspot.
I find a solution while I am searching on the internet. People are using IP changing methods via netsh etc. However, I think my IP is static (WiFi) and I could not figure out how to change it programmatically in C# Windows Forms App.
Because of that, I would like to hear your thoughts.

Your ISP most likely gives you a single Dynamic IP Address, which is the IP Address of your computer's access point to the Internet (i.e. the WAN). If so, they control the IP and not you. Even if you have a home network with multiple computers all on different local IP Addresses (LAN), you still aren't changing your WAN IP address which is the address that is effectively blocked.
Also, I am not going to judge, but I would say that if this is for an actual course project, then ethically speaking your instructor most likely would not want you to hammer an innocent website any more than the website's owner wishes for you to hammer it, hence the blocking. My suggestion would be to set your sites on another website that does not have the blocking to complete your coursework. Maybe you can do this against Google.com?

If you really need to make a request through a different IP address you could link your application up to several different proxies and switch between them at intervals.
Also, you mention that your IP is static, but there is a difference between your local IP and your external IP address. The IP address given to your WiFi connection is local, but the external IP address which is the one that would be seen by Internet sites is not the same.
If you have a dynamic external IP address one option might be able to programmatically connect to your router and restart it. This is one way to trigger an IP address update if you actually have access to it.
Overall, what you are doing is difficult to achieve for what sounds to be a simple assignment.

Here's a rather involved and eccentric solution that would, however, get around the problem nicely. Create 4 Amazon EC2 t2.micro instances (Windows) and issue 5 requests each from the EC2 instances. You can store the result to S3 buckets. It would take you a lot of work to get this working, but you'd come out the other end also having your first experience of working in the cloud. And each of those instances would have a different IP.
Also if you spin the same instance up and down a few times, it's unlikely to have the same ip in any case, so you could easily get away with one instance.
In a more serious vein: experiment with changing your user agent string and adding a much more hefty amount of time (minutes, hours) between requests. Also, turn your hotspot on and off between every five request, which will likely give you a new IP each time.

How to request an Asp.Net page manually?

I have a project and it contains 2 pages: test1.aspx and test2.aspx. Now from test1.aspx I want to manually request test2.aspx and get the HTML out of it. I could do this using HttpClient or HttpWebRequest. Problem is I have a firewall and I suspect it won't work. Is there any other way to download the content from the webpage without actually using HttpWebRequest
Thanks in advance.

I don't really like what you are trying to do ;) Anyway, since your page don't seems to be a static page (.aspx) you must do a request to your webserver, whatever the method you use (HttpClient or HttpWebRequest).
Usually, a request done on the same machine does not passes through the network. If the DNS alias point to the machine IP address a loopback occurs.
In this case:
if your firewall is somewhere on your network, you don't care about
it, the request will not leave your host
if you speak about a firewall software, on your machine, it may block
the request. You may have to authorize such requests or force the DNS locally in your host file to specify 127.0.0.1 (which is a true localhost) and may work with
most firewall software
if you are on a Windows Server and your site require authentication, you may have to deal with Loopback
Check (or here)
NB: Loopbacks are usually considered as security breach and not recommended.
You should think about another solution like Ajax Web Services, Web or User controls (as already said) etc...

Masking your web scraping activities to look like normal browser surfing activities?

I'm using the Html Agility Pack and I keep getting this error. "The remote server returned an error: (500) Internal Server Error." on certain pages.
Now I'm not sure what this is, as I can use Firefox to get to these pages without any problems.
I have a feeling the website itself is blocking and not sending a response. Is there a way I can make my HTML agility pack call more like a call that is being called from FireFox?
I've already set a timer in there so it only sends to the website every 20 seconds.
Is there any other method I can use?

Set a User-Agent similar to a regular browser. A User agent is a http header being passed by the http client(browser) to identify itself to the server.

There are a lot of ways servers can detect scraping and its really just an arms race between the scraper and the scrapee(?), depending on how bad one or the other wants to access/protect data. Some of the things to help you go undetected are:
Make sure all http headers sent over are the same as a normal browser, especially the user agent and the url referrer.
Download all images and css scripts like a normal browser would, in the order a browser would.
Make sure any cookies that are set are sent over with each subsequent request
Make sure requests are throttled according to the sites robots.txt
Make sure you aren't following any no-follow links because the server could be setting up a honeypot where they stop serving your ip requests
Get a bunch of proxy servers to vary your ip address
Make sure the site hasn't started sending you captcha's because they think you are a robot.
Again, the list could go on depending on how sophisticated the server setup is.

Comet and simultaneous Ajax request

I am trying to use a COMET solution using ASP.NET .
Trouble is I want to implement sending and notification part in the same page.
On IE7, whenever I try to send a request, it just gets queued up.
After reading on internet and stackoverflow pages I found that I can only do 2 simultaneous asyn ajax requests per page.
So until I close my comet Ajax request, my 2nd request doesn't get completed, doesn't even go out from the browser. And when I checked with Firefox I just one Ajax comet request running all time..so doesn't that leave me one more ajax request?
Also the solution uses IRequiressessionstate for Asynchronous HTTP Handler which I had removed. But it still creates problems on multiple instances of IE7.
I had one work around which is stated here http://support.microsoft.com/kb/282402
it means we can increase the request limit from registry by default is 2.
By changing "MaxConnectionsPer1_0Server" key
in hive "HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings"
we can increase the number of requests.
Basically I want to broadcast information to multiple clients connected to a server using Comet and the clients can also send messages to the Server.
Broadcasting works but the send request back to server doesn't work.
I'm using IIS 6 and ASP.NET .
Are there any more workarounds or ways to send more requests?
References :
How many concurrent AJAX (XmlHttpRequest) requests are allowed in popular browsers?
AJAX, PHP Sessions and simultaneous requests
jquery .ajax request blocked by long running .ajax request
jQuery: Making simultaneous ajax requests, is it possible?

You are limited to 2 connections, but typically that's all you need - 1 to send, 1 to receive, even in IE.
That said, you can totally do this; we do it all the time in WebSync. The solution lies in subdomains.
The thing to note is that IE (and other browsers, although they typically limit to 6 requests, not 2) limits requests per domain - but that limitation is for the entire domain excluding subdomains. So for example, you can have 2 requests open to "www.stackoverflow.com" and 2 more requests open to "static.stackoverflow.com", all at the same time.
Now, you've got to be somewhat careful with this approach, because if you make a request from the www subdomain to the static subdomain, that's considered a cross-domain request, so you're immediately limited to not using direct XHR calls, but at that point you have nevertheless bypassed the 2 connection limit; JSONP, HTML5, etc, are all your friend for bypassing the cross-domain limitations.
Edit
Managing with > 1 instance of IE comes back to the same problem. The limitation applies across all instances. So, if you have two browsers open, and they're both using comet, you're stuck with 2 long-polling connections open. If you've maximized your options, you're going to be connecting those long-polling requests to something like "comet.mysite.com", and your non-long-polling requests will go to "mysite.com". That's the best you'll get without going into wildcard DNS.
Check out some of our WebSync Demos; they work in 2 instances of IE without a problem. If you check out the source, you'll see that the DNS for the streaming connection is different from the main page; we use JSONP to bypass the cross-domain limitation.

The main idea in COMET is to keep one client-to-server request open, until a response is necessary.
If you design your code properly, then you don't need more than 2 requests to be open simultaneously. Here's how it works:
client uses a central message send-receive loop to send out a request to the server
server receives the request and keeps it open.
at some point, the server responds to the client.
the client (browser) receives the response, handles it in its central message loop.
immediately the client sends out another request.
repeat
The key is to centralize and asynchronize all communications in the client. So you will never need to have 2 open requests.
But to answer your question directly, no, there are no additional workarounds.
Raise the connection limit or reduce the number of connections you use.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.