I have the following code for getting a website and it works fine. The problem come up when I try to get a web page developed in Angular.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
request.Method = "GET";
request.Timeout = 30000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream flujo = response.GetResponseStream();
Encoding encode = Encoding.GetEncoding("utf-8");
StreamReader readStream = new StreamReader(flujo, encode);
String html;
try
{
html = readStream.ReadToEnd();
} catch(System.IO.IOException)
{
return;
}
response.Close();
readStream.Close();
HtmlAgilityPack.HtmlDocument DOM = new HtmlAgilityPack.HtmlDocument();
DOM.LoadHtml(html);
I know Angular first supply the skeleton of the page and in client side, fecth for info and display it.
When I try to get some info using HtmlAgilityPack, I get nothing.
My question is if it's possible to setup HttpWebRequest or HttpWebResponse or any other class to indicate to wait for javascript is done before getting the content or something similar.
Also, I tried to get content using WebBrowser and used the loadCompleted event and the same problem.
Any help?
Thanks.
Related
I'm trying to fetch the HTML of a page through code:
WebRequest r = WebRequest.Create(szPageURL);
WebClient client = new WebClient();
try
{
WebResponse resp = r.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
szHTML = sr.ReadToEnd();
}
This code works when I use URLs like www.microsoft.com, www.google.com, or www.nasa.gov. However, when I put in www.epa.gov (using either 'http' or 'https' in the URL parameter), I get a 403 exception when executing r.GetResponse(). Yet I can easily fetch the page manually in a browser. The exception I'm getting is 403 (Forbidden) and the exception status member says "ProtocolError". What does that mean? Why I am I getting this on a page that actually is available? Anyone have any ideas? Thanks!
BTW - I also tried this way:
string downloadString = client.DownloadString(szPageURL);
Got exact same exception.
try this code, it works:
string Url = "https://www.epa.gov/";
CookieContainer cookieJar = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
request.CookieContainer = cookieJar;
request.Accept = #"text/html, application/xhtml+xml, */*";
request.Referer = #"https://www.epa.gov/";
request.Headers.Add("Accept-Language", "en-GB");
request.UserAgent = #"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)";
request.Host = #"www.epa.gov";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
String htmlString;
using (var reader = new StreamReader(response.GetResponseStream()))
{
htmlString = reader.ReadToEnd();
}
I have tried many ways to login to an https website programmatically, but I am having issues. Every time I get an error stating that my login and password are incorrect. I am sure they are correct because I can login to the site via the browser using the same credentials.
Failing Code
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://www.majesticseo.com/account/login?EmailAddress=myemail&Password=mypass&RememberMe=1");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,**;q=0.8";
request.UnsafeAuthenticatedConnectionSharing = true;
request.Method = "POST";
request.KeepAlive = true;
request.ContentType = "application/x-www-form-urlencoded";
request.AllowAutoRedirect = true;
request.CookieContainer = container;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//String tmp;
foreach(Cookie cookie1 in response.Cookies)
{
container.Add(cookie1);
}
Stream stream = response.GetResponseStream();
string html = new StreamReader(stream).ReadToEnd();
Console.WriteLine("" + html);
That site uses HTTP POST for login, and does not send the username and password in the URL.
The correct login URL is https://www.majesticseo.com/account/login
You need to create a string of data to post, convert it to a byte array, set the content length and then do your request. It is very important that the content-length is sent. Without it the post will not work.
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://www.majesticseo.com/account/login?EmailAddress=myemail&Password=mypass&RememberMe=1");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0";
request.Referer = "https://www.majesticseo.com/account/login";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,**;q=0.8";
request.UnsafeAuthenticatedConnectionSharing = true;
request.Method = "POST";
request.KeepAlive = true;
request.ContentType = "application/x-www-form-urlencoded";
request.AllowAutoRedirect = true;
// the post string for login form
string postData = "redirect=&EmailAddress=EMAIL&Password=PASS";
byte[] postBytes = System.Text.Encoding.ASCII.GetBytes(postData);
request.ContentLength = postBytes.Length;
System.IO.Stream str = request.GetRequestStream();
str.Write(postBytes, 0, postBytes.Length);
str.Close();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
System.IO.Stream stream = response.GetResponseStream();
string html = new System.IO.StreamReader(stream).ReadToEnd();
Console.WriteLine("" + html);
You are trying to post something (I don't see, what, from your code) but not credentials. I guess that your web page shows you a web form where you enter username (email address?) and password. Then the browsers posts this form. Consequently you need to replicate browser behavior - encode form contents and send them in your post request. Use some webmaster developer tools for popular browsers to see what exactly the client browser sends to the server and how it encodes form data. Next, it's very likely that your request requires special cookies which you can collect by visiting another page (eg. login page). Sending preset cookies (like you do in commented code) won't work for most sites.
In other words, proper mechanism is:
GET the login web page
collect cookies
POST form data and pass collected cookies in the request.
collect other cookies, which could have been sent after login.
I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?
It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here
It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.
You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.
I'm trying to fetch some webpages using the code below:
public static string FetchPage(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; sv-SE; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.Headers.Add("Accept-Language", "sv-se,sv;q=0.8,en-us;q=0.5,en;q=0.3");
req.Headers.Add("Accept-Encoding", "gzip,deflate");
req.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
req.Headers.Add("Keep-Alive", "115");
req.Headers.Add("Cache-Control: max-age=0");
req.AllowAutoRedirect = true;
req.IfModifiedSince = DateTime.Now;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
using (Stream resStream = resp.GetResponseStream())
{
StreamReader reader = new StreamReader(resStream);
return reader.ReadToEnd();
}
}
}
Some pages work (W3C, example.com) while most others I've tried do not (BBC.co.uk, CNN.com, etc). Wireshark shows that I'm getting a proper reponse.
I've tried setting the encoding of the reader to the expected encoding of the response (CNN - utf8) as well as every possible combination but I have had no luck.
What am I missing out on here?
The first bytes of my response are always "1f ef bf bd" if you're able to tell something based on that.
I suspect the most likely explanation is that you are getting compressed data and not uncompressing it. Try using a stream filter to deflate/unzip it. See Rick Strahl's blog article for more info.
Loading http://bbc.co.uk worked for me when leaving out the "Accept-Encoding" header:
req.Headers.Add("Accept-Encoding", "gzip,deflate");
I have the following code that sends a HttpWebRequest to Bing. When I request the url below though it returns what appears to be an empty response when it should be returning a list of results.
var response = string.Empty;
var httpWebRequest = WebRequest.Create("http://www.bing.com/search?q=stackoverflow&count=100") as HttpWebRequest;
httpWebRequest.Method = WebRequestMethods.Http.Get;
httpWebRequest.Headers.Add("Accept-Language", "en-US");
httpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Win32)";
httpWebRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
using (var httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse)
{
Stream stream = null;
using (stream = httpWebResponse.GetResponseStream())
{
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
stream = new GZipStream(stream, CompressionMode.Decompress);
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
stream = new DeflateStream(stream, CompressionMode.Decompress);
var streamReader = new StreamReader(stream, Encoding.UTF8);
response = streamReader.ReadToEnd();
}
}
Its pretty standard code for requesting and receiving a web page. Any ideas why the response is empty? Thanks in advance.
EDIT I left off a query string parameter in the url. I also had &count=100 which I have now corrected. It seems to work for values of 50 and below but returns nothing when larger. This works ok when in the browser, but not for this web request.
It makes me think the issue is that the response is large and HttpWebResponse is not handling that for me the way I have it set up. Just a guess though.
This works just fine on my machine. Perhaps you are IP banned from Bing?
Your code works fine on my machine.
I suggest you get yourself a copy of Fiddler and examine the actual HTTP sesssion occuring. May be a proxy or firewall thing.