Exception when downloading data from HTTPS site - c#

I am working on a siteripper / screenscraper for looking up tracking information on the Royal Mail website. Unfortunately Royal Mail do not support an API, so this is the way to do it.
I keep getting the same exception no matter what I do.
(The remote server returned an error: (500) Internal Server Error.)
My base code is:
class Program
{
static void Main(string[] args)
{
string url = "http://track.royalmail.com/portal/rm/track?catId=22700601&gear=authentication&forcesegment=SG-Personal";
byte[] response;
WebClient webClient = new WebClient();
response = webClient.DownloadData(url);
}
}
I have used Fiddler, to investigate the data transactions made by my browser in order to mimic that in my code. I can see Royal Mail uses cookies, so I have tried to implement a WebClient that supports cookies by adding a cookie handler to it:
public class CookieAwareWebClient : WebClient
{
private CookieContainer m_container = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = m_container;
}
return request;
}
}
But that didn't help eather :-(
I have also tried to look up the tracking information through Royal Mails SSL protected site (https://www.royalmail.com/portal/sme/track?catId=62200738&mediaId=63900708), and implementing credentials into my C# program, but no luck there.
I have now meet the wall, and I keep bumping into the same tutorials / threads that don't seem to help me any further.
I hope there is a brilliant brain out there :-)

If you send all the headers you should stop getting the 500 error
string url = "http://track.royalmail.com/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage&keyname=track_blank&_requestid=17931";
using(WebClient webClient = new WebClient()) {
webClient.Headers["User-Agent"] = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 (.NET CLR 3.5.30729)";
webClient.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
webClient.Headers["Accept-Language"] = "en-us,en;q=0.5";
webClient.Headers["Accept-Encoding"] = " gzip,deflate";
webClient.Headers["Accept-Charset"] = "ISO-8859-1,utf-8;q=0.7,*;q=0.7";
byte[] response = webClient.DownloadData(url);
}

Related

Download string from bitbucket with basic authentication

I am trying to make some bitbucket api requests using csharp.bitbucket library. I have some code which fetches a request token then builds up an authenticate url. The authenticate url looks something like
https://bitbucket.org/api/1.0/oauth/authenticate/?oauth_token=xxxxxx
Where xxxxx is my token that I have already retrieved via bitbucket api.
The issue I am having is when I try to using webclient download the url I always get the bitbucket login page even though I am passing an authorisation header. When i hit the authenticate url using postman and pass through the same token and authorisation header it all works. My code looks like this:
using (var wc = new CookieWebClient(_username, _password))
{
pageText = wc.DownloadString(url);
}
The CookieWebClient class looks like
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
public CookieWebClient(string authenticationUser,string authenticationPassword)
{
string credentials = Convert.ToBase64String(Encoding.ASCII.GetBytes(authenticationUser + ":" + authenticationPassword));
Headers[HttpRequestHeader.Authorization] = "Basic " + credentials;
}
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
var webRequest = request as HttpWebRequest;
webRequest.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";
webRequest.PreAuthenticate = true;
webRequest.AllowAutoRedirect = true;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return webRequest;
}
catch
{
return null;
}
}
}
It looks like the authenticate part via webclient is not working becuase when i make the DownloadString call I get the bitbucket login page.
Anyone seen this before?
Thanks in advance
Ismail
So in answer to my own question, after looking at fiddler and postman I could see that when calling authenticate it was doing a 301 redirect and losing the authorisation header so I updated my code to hit the url it was trying to 301 to.
So instead of authenticate i goto authorise directly while passing my token and authorisation header and now it all works. This all used to work however I think at bitbucket's end they have changed something hence the breakage.
So issue is 301 redirect losing authorisation header that has been set. Hope this helps someone.
Ismail

WebClient login failure

My Code:
class MyWebClient : WebClient
{
private CookieContainer _cookieContainer = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = _cookieContainer;
}
return request;
}
}
using (var client = new MyWebClient())
{
var data = new NameValueCollection
{
{ "username", "myUser" },
{ "password", "myPw" }
};
client.UploadValues("http://www..tv/takelogin.php", data);
}
MNM3.4:
Response:
Building my app i use 3 sites.. with 2 of them everything works fine but with this no.
Passing a CookieContainer usually does the trick but you're already sending it. Can you confirm the field names?
Also, for some websites, you'll need to post back the hidden fields. I usually perform a GET to the login page and, using an HTML parser (like HtmlAgilityPack), I locate the appropriate form and POST the login request with all INPUT/SELECT fields I find.
I think the best advice here is to use a debugging proxy like Fiddler and try to perform the login from the browser and inspect the generated traffic.
I Found the problem...
client.UploadValues("http://www..tv/takelogin.php", data);
changed to:
client.UploadValues("http://.tv/takelogin.php", data);
That means:
http://www.MY_SITE.tv
dont work, but
http://MY_SITE.tv
works fine.

Html Agility Pack, Web Scraping, and spoofing in C#

Is there a way to spoof a web request from C# code so it doesn't look like a bot or spam hitting the site? I am trying to web scrape my website, but keep getting blocked after a certain amount of calls. I want to act like a real browser. I am using this code, from HTML Agility Pack.
var web = new HtmlWeb();
web.UserAgent =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
I do way too much web scraping, but here are the options:
I have a default list of headers I add as all of these are expected from a browser:
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en-US;q=0.8,en;q=0.6";
wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";
(WC is my WebClient).
As a further help - here is my webclient class that keeps cookies stored - which is also a massive help:
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
HttpWebRequest webRequest = request as HttpWebRequest;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return request;
}
catch
{
return null;
}
}
}
Here is my usual use for it. Add a static copy to your base site class with all your parsing functions you likely have:
protected static CookieWebClient wc = new CookieWebClient();
And call it as such:
public HtmlDocument Download(string url)
{
HtmlDocument hdoc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
HtmlNode.ElementsFlags.Remove("select");
Stream read = null;
try
{
read = wc.OpenRead(url);
}
catch (ArgumentException)
{
read = wc.OpenRead(HttpHelper.HTTPEncode(url));
}
hdoc.Load(read, true);
return hdoc;
}
The other main reason you may be crashing out is the connection is being closed by the server as you have had an open connection for too long. You can prove this by adding a try catch around the download part as above and if it fails, reset the webclient and try to download again:
HtmlDocument d = new HtmlDocument();
try
{
d = this.Download(prp.PropertyUrl);
}
catch (WebException e)
{
this.Msg(Site.ErrorSeverity.Severe, "Error connecting to " + this.URL + " : Resubmitting..");
wc = new CookieWebClient();
d = this.Download(prp.PropertyUrl);
}
This saves my ass all the time, even if it was the server rejecting you, this can re-jig the lot. Cookies are cleared and your free to roam again. If worse truly comes to worse - add proxy support and get a new proxy applied per 50-ish requests.
That should be more than enough for you to kick your own and any other sites arse.
RATE ME!
Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.
Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference).
In regards to "getting blocked after a certain amount of calls" - throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.
Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

Problems authenticating to website from code

I am trying to write code that will authenticate to the website wallbase.cc. I've looked at what it does using Firfebug/Chrome Developer tools and it seems fairly easy:
Post "usrname=$USER&pass=$PASS&nopass_email=Type+in+your+e-mail+and+press+enter&nopass=0" to the webpage "http://wallbase.cc/user/login", store the returned cookies and use them on all future requests.
Here is my code:
private CookieContainer _cookies = new CookieContainer();
//......
HttpPost("http://wallbase.cc/user/login", string.Format("usrname={0}&pass={1}&nopass_email=Type+in+your+e-mail+and+press+enter&nopass=0", Username, assword));
//......
private string HttpPost(string url, string parameters)
{
try
{
System.Net.WebRequest req = System.Net.WebRequest.Create(url);
//Add these, as we're doing a POST
req.ContentType = "application/x-www-form-urlencoded";
req.Method = "POST";
((HttpWebRequest)req).Referer = "http://wallbase.cc/home/";
((HttpWebRequest)req).CookieContainer = _cookies;
//We need to count how many bytes we're sending. Post'ed Faked Forms should be name=value&
byte[] bytes = System.Text.Encoding.ASCII.GetBytes(parameters);
req.ContentLength = bytes.Length;
System.IO.Stream os = req.GetRequestStream();
os.Write(bytes, 0, bytes.Length); //Push it out there
os.Close();
//get response
using (System.Net.WebResponse resp = req.GetResponse())
{
if (resp == null) return null;
using (Stream st = resp.GetResponseStream())
{
System.IO.StreamReader sr = new System.IO.StreamReader(st);
return sr.ReadToEnd().Trim();
}
}
}
catch (Exception)
{
return null;
}
}
After calling HttpPost with my login parameters I would expect all future calls using this same method to be authenticated (assuming a valid username/password). I do get a session cookie in my cookie collection but for some reason I'm not authenticated. I get a session cookie in my cookie collection regardless of which page I visit so I tried loading the home page first to get the initial session cookie and then logging in but there was no change.
To my knowledge this Python version works: https://github.com/sevensins/Wallbase-Downloader/blob/master/wallbase.sh (line 336)
Any ideas on how to get authentication working?
Update #1
When using a correct user/password pair the response automatically redirects to the referrer but when an incorrect user/pass pair is received it does not redirect and returns a bad user/pass pair. Based on this it seems as though authentication is happening, but maybe not all the key pieces of information are being saved??
Update #2
I am using .NET 3.5. When I tried the above code in .NET 4, with the added line of System.Net.ServicePointManager.Expect100Continue = false (which was in my code, just not shown here) it works, no changes necessary. The problem seems to stem directly from some pre-.Net 4 issue.
This is based on code from one of my projects, as well as code found from various answers here on stackoverflow.
First we need to set up a Cookie aware WebClient that is going to use HTML 1.0.
public class CookieAwareWebClient : WebClient
{
private CookieContainer cookie = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = (HttpWebRequest)base.GetWebRequest(address);
request.ProtocolVersion = HttpVersion.Version10;
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = cookie;
}
return request;
}
}
Next we set up the code that handles the Authentication and then finally loads the response.
var client = new CookieAwareWebClient();
client.UseDefaultCredentials = true;
client.BaseAddress = #"http://wallbase.cc";
var loginData = new NameValueCollection();
loginData.Add("usrname", "test");
loginData.Add("pass", "123");
loginData.Add("nopass_email", "Type in your e-mail and press enter");
loginData.Add("nopass", "0");
var result = client.UploadValues(#"http://wallbase.cc/user/login", "POST", loginData);
string response = System.Text.Encoding.UTF8.GetString(result);
We can try this out using the HTML Visualizer inbuilt into Visual Studio while staying in debug mode and use that to confirm that we were able to authenticate and load the Home page while staying authenticated.
The key here is to set up a CookieContainer and use HTTP 1.0, instead of 1.1. I am not entirely sure why forcing it to use 1.0 allows you to authenticate and load the page successfully, but part of the solution is based on this answer.
https://stackoverflow.com/a/10916014/408182
I used Fiddler to make sure that the response sent by the C# Client was the same as with my web browser Chrome. It also allows me to confirm if the C# client is being redirect correctly. In this case we can see that with HTML 1.0 we are getting the HTTP/1.0 302 Found and then redirects us to the home page as intended. If we switch back to HTML 1.1 we will get an HTTP/1.1 417 Expectation Failed message instead.
There is some information on this error message available in this stackoverflow thread.
HTTP POST Returns Error: 417 "Expectation Failed."
Edit: Hack/Fix for .NET 3.5
I have spent a lot of time trying to figure out the difference between 3.5 and 4.0, but I seriously have no clue. It looks like 3.5 is creating a new cookie after the authentication and the only way I found around this was to authenticate the user twice.
I also had to make some changes on the WebClient based on information from this post.
http://dot-net-expertise.blogspot.fr/2009/10/cookiecontainer-domain-handling-bug-fix.html
public class CookieAwareWebClient : WebClient
{
public CookieContainer cookies = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
var request = base.GetWebRequest(address);
var httpRequest = request as HttpWebRequest;
if (httpRequest != null)
{
httpRequest.ProtocolVersion = HttpVersion.Version10;
httpRequest.CookieContainer = cookies;
var table = (Hashtable)cookies.GetType().InvokeMember("m_domainTable", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField | System.Reflection.BindingFlags.Instance, null, cookies, new object[] { });
var keys = new ArrayList(table.Keys);
foreach (var key in keys)
{
var newKey = (key as string).Substring(1);
table[newKey] = table[key];
}
}
return request;
}
}
var client = new CookieAwareWebClient();
var loginData = new NameValueCollection();
loginData.Add("usrname", "test");
loginData.Add("pass", "123");
loginData.Add("nopass_email", "Type in your e-mail and press enter");
loginData.Add("nopass", "0");
// Hack: Authenticate the user twice!
client.UploadValues(#"http://wallbase.cc/user/login", "POST", loginData);
var result = client.UploadValues(#"http://wallbase.cc/user/login", "POST", loginData);
string response = System.Text.Encoding.UTF8.GetString(result);
You may need to add the following:
//get response
using (System.Net.WebResponse resp = req.GetResponse())
{
foreach (Cookie c in resp.Cookies)
_cookies.Add(c);
// Do other stuff with response....
}
Another thing that you might have to do is, if the server responds with a 302 (redirect) the .Net web request will automatically follow it and in the process you might lose the cookie you're after. You can turn off this behavior with the following code:
req.AllowAutoRedirect = false;
The Python you reference uses a different referrer (http://wallbase.cc/start/). It is also followed by another post to (http://wallbase.cc/user/adult_confirm/1). Try the other referrer and followup with this POST.
I think you are authenticating correctly, but that the site needs more info/assertions from you before proceeding.

matweb.com: How to get source of page?

I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?
It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here
It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.
You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.

Categories

Resources