Html Agility Pack, Web Scraping, and spoofing in C#

Html Agility Pack, Web Scraping, and spoofing in C# - c#

Is there a way to spoof a web request from C# code so it doesn't look like a bot or spam hitting the site? I am trying to web scrape my website, but keep getting blocked after a certain amount of calls. I want to act like a real browser. I am using this code, from HTML Agility Pack.
var web = new HtmlWeb();
web.UserAgent =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";

I do way too much web scraping, but here are the options:
I have a default list of headers I add as all of these are expected from a browser:
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en-US;q=0.8,en;q=0.6";
wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";
(WC is my WebClient).
As a further help - here is my webclient class that keeps cookies stored - which is also a massive help:
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
HttpWebRequest webRequest = request as HttpWebRequest;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return request;
}
catch
{
return null;
}
}
}
Here is my usual use for it. Add a static copy to your base site class with all your parsing functions you likely have:
protected static CookieWebClient wc = new CookieWebClient();
And call it as such:
public HtmlDocument Download(string url)
{
HtmlDocument hdoc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
HtmlNode.ElementsFlags.Remove("select");
Stream read = null;
try
{
read = wc.OpenRead(url);
}
catch (ArgumentException)
{
read = wc.OpenRead(HttpHelper.HTTPEncode(url));
}
hdoc.Load(read, true);
return hdoc;
}
The other main reason you may be crashing out is the connection is being closed by the server as you have had an open connection for too long. You can prove this by adding a try catch around the download part as above and if it fails, reset the webclient and try to download again:
HtmlDocument d = new HtmlDocument();
try
{
d = this.Download(prp.PropertyUrl);
}
catch (WebException e)
{
this.Msg(Site.ErrorSeverity.Severe, "Error connecting to " + this.URL + " : Resubmitting..");
wc = new CookieWebClient();
d = this.Download(prp.PropertyUrl);
}
This saves my ass all the time, even if it was the server rejecting you, this can re-jig the lot. Cookies are cleared and your free to roam again. If worse truly comes to worse - add proxy support and get a new proxy applied per 50-ish requests.
That should be more than enough for you to kick your own and any other sites arse.
RATE ME!

Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.
Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference).
In regards to "getting blocked after a certain amount of calls" - throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.
Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

Related

Download string from bitbucket with basic authentication

I am trying to make some bitbucket api requests using csharp.bitbucket library. I have some code which fetches a request token then builds up an authenticate url. The authenticate url looks something like
https://bitbucket.org/api/1.0/oauth/authenticate/?oauth_token=xxxxxx
Where xxxxx is my token that I have already retrieved via bitbucket api.
The issue I am having is when I try to using webclient download the url I always get the bitbucket login page even though I am passing an authorisation header. When i hit the authenticate url using postman and pass through the same token and authorisation header it all works. My code looks like this:
using (var wc = new CookieWebClient(_username, _password))
{
pageText = wc.DownloadString(url);
}
The CookieWebClient class looks like
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
public CookieWebClient(string authenticationUser,string authenticationPassword)
{
string credentials = Convert.ToBase64String(Encoding.ASCII.GetBytes(authenticationUser + ":" + authenticationPassword));
Headers[HttpRequestHeader.Authorization] = "Basic " + credentials;
}
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
var webRequest = request as HttpWebRequest;
webRequest.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";
webRequest.PreAuthenticate = true;
webRequest.AllowAutoRedirect = true;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return webRequest;
}
catch
{
return null;
}
}
}
It looks like the authenticate part via webclient is not working becuase when i make the DownloadString call I get the bitbucket login page.
Anyone seen this before?
Thanks in advance
Ismail

So in answer to my own question, after looking at fiddler and postman I could see that when calling authenticate it was doing a 301 redirect and losing the authorisation header so I updated my code to hit the url it was trying to 301 to.
So instead of authenticate i goto authorise directly while passing my token and authorisation header and now it all works. This all used to work however I think at bitbucket's end they have changed something hence the breakage.
So issue is 301 redirect losing authorisation header that has been set. Hope this helps someone.
Ismail

C# generate a cookie dynamically that site will accept?

I have tried just about all related solutions found on the Web, but they all refused to work for some reason. And this does not work too: C# - HttpWebRequest POST (Login to Facebook) , since we are using different methods.
And I am not using the POST method, but the GET method, which is being used in a request. The site I am using does not need any login credentials to get the image. (Most of the other root domains the site has does not require a cookie.)
The below code is a part of what I figured out to make the program get the image like the web-based versions do, but with a few problems.
Before, I was trying to use a normal WebClient to download the image since it refused to show up in any way that the PictureBox control would accept. But then I switched to HttpWebRequest.
The particular root domain of the site where I am trying to get the image from requires a cookie, though.
Below is a code snippet which basically tries to get an image from a site. The only trouble is, it is almost impossible to get the image from the site unless you pass a few things in the HttpWebRequest, along with a cookie.
For now, I am using a static cookie as a temporary workaround.
HttpWebRequest _request = (HttpWebRequest)HttpWebRequest.Create(_URL);
_request.Method = WebRequestMethods.Http.Get;
_request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
_request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip,deflate,sdch");
_request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
_request.Headers.Set(HttpRequestHeader.CacheControl, "max-age=0");
_request.Host = "www.habbo" + _Country;
_request.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36";
using (WebResponse _response = _request.GetResponse())
using (Stream _stream = _response.GetResponseStream())
{
Image _image = Image.FromStream(_stream);
_bitmap = new Bitmap(_image);
string contentType = _response.ContentType;
_PictureBox.Image = _bitmap;
}
Let's let the following variables be:
_URL = "http://www.habbo.com/habbo-imaging/avatarimage?hb=img&user=aa&direction=2&head_direction=2&size=m&img_format=gif";
_Country = ".com";
Most of the things I am passing into the HttpWebRequest is obtained from looking at the Network tab of Google Chrome's Developer Tools.
The web-based versions of the Habbo Imager seems to just direct people to the page where they can find the image, and their browsers seem to somehow add the cookie. What I am doing is different, as all they do is display the site where the image is located, but I want to locate the image's true location, then read from it to a type Image.
Apparently the site seems to need the user to "visit" them, according to what I read from this thread: Click here
What I would like to know is, is there a better way to get a valid cookie that the server will happily accept every time?
Or do I need to somehow trick the site into thinking the user has visited the page and seen it, thereby making them maybe return the cookie we might need, even though the user doesn't ever see the page?
Not too sure if this would mean that I need to somehow dynamically generate the cookies though.
I also do not understand how to truly create or get the cookies (and set stored cookies) using C#, so if it is possible, please use some examples.
I would prefer to not use any third-party libraries, or to change the code I am using too much. Neither is the program going to send two GET requests just to be able to get what it could get with one GET request. Thus, this wouldn't work: Passing cookie with HttpWebRequest in winforms?
I am using .NET 4.0.

It is a little bit more complicated than at first sight expected. The browser makes actually two calls. The first one returns an html script with a small piece of javascript that when executed sets a cookie and reload the page. In your c# code you have to mimic that.
In your form class add an instance variable to hold all the cookies across multiple httpwebrequest calls:
readonly CookieContainer cookiecontainer = new CookieContainer();
I have created a Builder method that creates the HttpWebRequest and returns an HttpWebResponse. It takes a namevaluecollection to add any cookies to the Cookiecontainer.
private HttpWebResponse Builder(string url, string host, NameValueCollection cookies)
{
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Get;
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
// _request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip,deflate,sdch");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
request.Headers.Set(HttpRequestHeader.CacheControl, "max-age=0");
request.Host = host;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36";
request.CookieContainer = cookiecontainer;
if (cookies != null)
{
foreach (var cookiekey in cookies.AllKeys)
{
request.CookieContainer.Add(
new Cookie(
cookiekey,
cookies[cookiekey],
#"/",
host));
}
}
return (HttpWebResponse) request.GetResponse();
}
If the incoming stream turns out to be an text/html contenttype we need to parse its content and return the cookie name and value. The Parse method does just that:
// find in the html and return the three parameters in a string array
// setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '127.0.0.1', 10);
private static string[] Parse(Stream _stream, string encoding)
{
const string setCookieCall = "setCookie('";
// copy html as string
var ms = new MemoryStream();
_stream.CopyTo(ms);
var html = Encoding.GetEncoding(encoding).GetString(ms.ToArray());
// find setCookie call
var findFirst = html.IndexOf(
setCookieCall,
StringComparison.InvariantCultureIgnoreCase) + setCookieCall.Length;
var last = html.IndexOf(");", findFirst, StringComparison.InvariantCulture);
var setCookieStatmentCall = html.Substring(findFirst, last - findFirst);
// take the parameters
var parameters = setCookieStatmentCall.Split(new[] {','});
for (int x = 0; x < parameters.Length; x++)
{
// cleanup
parameters[x] = parameters[x].Replace("'", "").Trim();
}
return parameters;
}
Now are our building blocks complete we can start calling our methods from the Click method. We use a loop to call our Builder twice to obtain a result from the given url. Based on the received contenttype we either Parse or create the Image from the stream.
private void button1_Click(object sender, EventArgs e)
{
var cookies = new NameValueCollection();
for (int tries = 0; tries < 2; tries++)
{
using (var response = Builder(_URL, "www.habbo" + _Country, cookies))
{
using (var stream = response.GetResponseStream())
{
string contentType = response.ContentType.ToLowerInvariant();
if (contentType.StartsWith("text/html"))
{
var parameters = Parse(stream, response.CharacterSet);
cookies.Add(parameters[0], parameters[1]);
}
if (contentType.StartsWith("image"))
{
pictureBox1.Image = Image.FromStream(stream);
break; // we're done, get out
}
}
}
}
}
Words of caution
This code works for the url in your question. I didn't take any measures to handle other patterns, and/or exceptions. It is up to you to add that. Also when doing this kind of scraping make sure the owner of the website does allow this.

Exception when downloading data from HTTPS site

I am working on a siteripper / screenscraper for looking up tracking information on the Royal Mail website. Unfortunately Royal Mail do not support an API, so this is the way to do it.
I keep getting the same exception no matter what I do.
(The remote server returned an error: (500) Internal Server Error.)
My base code is:
class Program
{
static void Main(string[] args)
{
string url = "http://track.royalmail.com/portal/rm/track?catId=22700601&gear=authentication&forcesegment=SG-Personal";
byte[] response;
WebClient webClient = new WebClient();
response = webClient.DownloadData(url);
}
}
I have used Fiddler, to investigate the data transactions made by my browser in order to mimic that in my code. I can see Royal Mail uses cookies, so I have tried to implement a WebClient that supports cookies by adding a cookie handler to it:
public class CookieAwareWebClient : WebClient
{
private CookieContainer m_container = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = m_container;
}
return request;
}
}
But that didn't help eather :-(
I have also tried to look up the tracking information through Royal Mails SSL protected site (https://www.royalmail.com/portal/sme/track?catId=62200738&mediaId=63900708), and implementing credentials into my C# program, but no luck there.
I have now meet the wall, and I keep bumping into the same tutorials / threads that don't seem to help me any further.
I hope there is a brilliant brain out there :-)

If you send all the headers you should stop getting the 500 error
string url = "http://track.royalmail.com/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage&keyname=track_blank&_requestid=17931";
using(WebClient webClient = new WebClient()) {
webClient.Headers["User-Agent"] = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 (.NET CLR 3.5.30729)";
webClient.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
webClient.Headers["Accept-Language"] = "en-us,en;q=0.5";
webClient.Headers["Accept-Encoding"] = " gzip,deflate";
webClient.Headers["Accept-Charset"] = "ISO-8859-1,utf-8;q=0.7,*;q=0.7";
byte[] response = webClient.DownloadData(url);
}

matweb.com: How to get source of page?

I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?

It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here

It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.

You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.

Why is this WebRequest code slow?

I requested 100 pages that all 404. I wrote
{
var s = DateTime.Now;
for(int i=0; i < 100;i++)
DL.CheckExist("http://google.com/lol" + i.ToString() + ".jpg");
var e = DateTime.Now;
var d = e-s;
d=d;
Console.WriteLine(d);
}
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}
Two runs show it takes 00:00:30.7968750 and 00:00:26.8750000. Then i tried firefox and use the following code
<html>
<body>
<script type="text/javascript">
for(var i=0; i<100; i++)
document.write("<img src=http://google.com/lol" + i + ".jpg><br>");
</script>
</body>
</html>
Using my comp time and counting it was roughly 4 seconds. 4 seconds is 6.5-7.5faster then my app. I plan to scan through a thousands of files so taking 3.75hours instead of 30mins would be a big problem. How can i make this code faster? I know someone will say firefox caches the images but i want to say 1) it still needs to check the headers from the remote server to see if it has been updated (which is what i want my app to do) 2) I am not receiving the body, my code should only be requesting the header. So, how do i solve this?

I noticed that an HttpWebRequest hangs on the first request. I did some research and what seems to be happening is that the request is configuring or auto-detecting proxies. If you set
request.Proxy = null;
on the web request object, you might be able to avoid an initial delay.
With proxy auto-detect:
using (var response = (HttpWebResponse)request.GetResponse()) //6,956 ms
{
}
Without proxy auto-detect:
request.Proxy = null;
using (var response = (HttpWebResponse)request.GetResponse()) //154 ms
{
}

change your code to asynchronous getresponse
public override WebResponse GetResponse() {
•••
IAsyncResult asyncResult = BeginGetResponse(null, null);
•••
return EndGetResponse(asyncResult);
}
Async Get

Probably Firefox issues multiple requests at once whereas your code does them one by one. Perhaps adding threads will speed up your program.

The answer is changing HttpWebRequest/HttpWebResponse to WebRequest/WebResponse only. That fixed the problem.

Have you tried opening the same URL in IE on the machine that your code is deployed to? If it is a Windows Server machine then sometimes it's because the url you're requesting is not in IE's (which HttpWebRequest works off) list of secure sites. You'll just need to add it.
Do you have more info you could post? I've doing something similar and have run into tons of problems with HttpWebRequest before. All unique. So more info would help.
BTW, calling it using the async methods won't really help in this case. It doesn't shorten the download time. It just doesn't block your calling thread that's all.

close the response stream when you are done, so in your checkExist(), add wresp.Close() after wresp = (HttpWebResponse)wreq.GetResponse();

OK if you are getting status code 404 for all webpages then it is due to not specifying credentials. So you need to add
wreq.Credentials = CredentialCache.DefaultCredentials;
Then you may also come across status code= 500 for that you need to specify User Agent. Which looks something like the below line
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
"A WebClient instance does not send optional HTTP headers by default. If your request requires an optional header, you must add the header to the Headers collection. For example, to retain queries in the response, you must add a user-agent header. Also, servers may return 500 (Internal Server Error) if the user agent header is missing."
reference: https://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.110).aspx
To improve the Performance of the HttpWebrequest you need to add
wreq.Proxy=null
now the code will look like:
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.Credentials = CredentialCache.DefaultCredentials;
wreq.Proxy=null;
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}

Set cookie is matter and you must add AspxAutoDetectCookieSupport=1 like this code
req.CookieContainer = new CookieContainer();
req.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html Agility Pack, Web Scraping, and spoofing in C# - c#

Related

Download string from bitbucket with basic authentication

C# generate a cookie dynamically that site will accept?

Exception when downloading data from HTTPS site

matweb.com: How to get source of page?

Why is this WebRequest code slow?

Categories

Resources