So I have a list of URLs that I'm processing through and in a couple of cases I run into an argument exception due to gzip encoding issues, so I drew up this code to resolve the gzip encoding problems.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (WebResponse webResponse = req.GetResponse())
// On the second iteration we never get beyond this line
{
HttpWebResponse httpWebResponse = webResponse as HttpWebResponse;
using (StreamReader reader = new StreamReader(httpWebResponse.GetResponseStream()))
{
source = reader.ReadToEnd();
}
httpWebResponse.Close();
}
req.Abort();
This works for the first URL that needs this processing. however the second URL that needs processing times out. I'm not sure what it is that I'm missing to get this to work consistently.
Now I do have the URLs being sent to the above method inside a foreach loop.
foreach (string uri in _UriAddresses)
{
ProcessListItem(uri);
}
Let me know if there's anything that's not shown that would shed light on this issue.
Related
I have the following code for getting a website and it works fine. The problem come up when I try to get a web page developed in Angular.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
request.Method = "GET";
request.Timeout = 30000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream flujo = response.GetResponseStream();
Encoding encode = Encoding.GetEncoding("utf-8");
StreamReader readStream = new StreamReader(flujo, encode);
String html;
try
{
html = readStream.ReadToEnd();
} catch(System.IO.IOException)
{
return;
}
response.Close();
readStream.Close();
HtmlAgilityPack.HtmlDocument DOM = new HtmlAgilityPack.HtmlDocument();
DOM.LoadHtml(html);
I know Angular first supply the skeleton of the page and in client side, fecth for info and display it.
When I try to get some info using HtmlAgilityPack, I get nothing.
My question is if it's possible to setup HttpWebRequest or HttpWebResponse or any other class to indicate to wait for javascript is done before getting the content or something similar.
Also, I tried to get content using WebBrowser and used the loadCompleted event and the same problem.
Any help?
Thanks.
I'm trying to fetch the HTML of a page through code:
WebRequest r = WebRequest.Create(szPageURL);
WebClient client = new WebClient();
try
{
WebResponse resp = r.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
szHTML = sr.ReadToEnd();
}
This code works when I use URLs like www.microsoft.com, www.google.com, or www.nasa.gov. However, when I put in www.epa.gov (using either 'http' or 'https' in the URL parameter), I get a 403 exception when executing r.GetResponse(). Yet I can easily fetch the page manually in a browser. The exception I'm getting is 403 (Forbidden) and the exception status member says "ProtocolError". What does that mean? Why I am I getting this on a page that actually is available? Anyone have any ideas? Thanks!
BTW - I also tried this way:
string downloadString = client.DownloadString(szPageURL);
Got exact same exception.
try this code, it works:
string Url = "https://www.epa.gov/";
CookieContainer cookieJar = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
request.CookieContainer = cookieJar;
request.Accept = #"text/html, application/xhtml+xml, */*";
request.Referer = #"https://www.epa.gov/";
request.Headers.Add("Accept-Language", "en-GB");
request.UserAgent = #"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)";
request.Host = #"www.epa.gov";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
String htmlString;
using (var reader = new StreamReader(response.GetResponseStream()))
{
htmlString = reader.ReadToEnd();
}
I send a POST request by this way:
HttpWebResponse res = null;
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "POST";
req.CookieContainer = cookieContainer;
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3";
req.ContentType = "application/x-www-form-urlencoded";
ASCIIEncoding encoding = new ASCIIEncoding();
byte[] loginDataBytes = encoding.GetBytes(requestCommand);
req.ContentLength = loginDataBytes.Length;
Stream stream = req.GetRequestStream();
stream.Write(loginDataBytes, 0, loginDataBytes.Length);
stream.Close();
//line below is my way to send request, but it return a response.
res = (HttpWebResponse)req.GetResponse();
I just want to send request to WebServer and I don't want to receive reponse. It wastes time and bandwidth. Are there anyway to do that?
Thank you very much!
No, there's no way. The HTTP protocol is two way. It is a request/response protocol. You need to call the GetResponse method if you want to send the request that you prepared earlier.
I'm trying to fetch some webpages using the code below:
public static string FetchPage(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; sv-SE; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.Headers.Add("Accept-Language", "sv-se,sv;q=0.8,en-us;q=0.5,en;q=0.3");
req.Headers.Add("Accept-Encoding", "gzip,deflate");
req.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
req.Headers.Add("Keep-Alive", "115");
req.Headers.Add("Cache-Control: max-age=0");
req.AllowAutoRedirect = true;
req.IfModifiedSince = DateTime.Now;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
using (Stream resStream = resp.GetResponseStream())
{
StreamReader reader = new StreamReader(resStream);
return reader.ReadToEnd();
}
}
}
Some pages work (W3C, example.com) while most others I've tried do not (BBC.co.uk, CNN.com, etc). Wireshark shows that I'm getting a proper reponse.
I've tried setting the encoding of the reader to the expected encoding of the response (CNN - utf8) as well as every possible combination but I have had no luck.
What am I missing out on here?
The first bytes of my response are always "1f ef bf bd" if you're able to tell something based on that.
I suspect the most likely explanation is that you are getting compressed data and not uncompressing it. Try using a stream filter to deflate/unzip it. See Rick Strahl's blog article for more info.
Loading http://bbc.co.uk worked for me when leaving out the "Accept-Encoding" header:
req.Headers.Add("Accept-Encoding", "gzip,deflate");
I have the following code that sends a HttpWebRequest to Bing. When I request the url below though it returns what appears to be an empty response when it should be returning a list of results.
var response = string.Empty;
var httpWebRequest = WebRequest.Create("http://www.bing.com/search?q=stackoverflow&count=100") as HttpWebRequest;
httpWebRequest.Method = WebRequestMethods.Http.Get;
httpWebRequest.Headers.Add("Accept-Language", "en-US");
httpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Win32)";
httpWebRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
using (var httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse)
{
Stream stream = null;
using (stream = httpWebResponse.GetResponseStream())
{
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
stream = new GZipStream(stream, CompressionMode.Decompress);
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
stream = new DeflateStream(stream, CompressionMode.Decompress);
var streamReader = new StreamReader(stream, Encoding.UTF8);
response = streamReader.ReadToEnd();
}
}
Its pretty standard code for requesting and receiving a web page. Any ideas why the response is empty? Thanks in advance.
EDIT I left off a query string parameter in the url. I also had &count=100 which I have now corrected. It seems to work for values of 50 and below but returns nothing when larger. This works ok when in the browser, but not for this web request.
It makes me think the issue is that the response is large and HttpWebResponse is not handling that for me the way I have it set up. Just a guess though.
This works just fine on my machine. Perhaps you are IP banned from Bing?
Your code works fine on my machine.
I suggest you get yourself a copy of Fiddler and examine the actual HTTP sesssion occuring. May be a proxy or firewall thing.