I have tried just about all related solutions found on the Web, but they all refused to work for some reason. And this does not work too: C# - HttpWebRequest POST (Login to Facebook) , since we are using different methods.
And I am not using the POST method, but the GET method, which is being used in a request. The site I am using does not need any login credentials to get the image. (Most of the other root domains the site has does not require a cookie.)
The below code is a part of what I figured out to make the program get the image like the web-based versions do, but with a few problems.
Before, I was trying to use a normal WebClient to download the image since it refused to show up in any way that the PictureBox control would accept. But then I switched to HttpWebRequest.
The particular root domain of the site where I am trying to get the image from requires a cookie, though.
Below is a code snippet which basically tries to get an image from a site. The only trouble is, it is almost impossible to get the image from the site unless you pass a few things in the HttpWebRequest, along with a cookie.
For now, I am using a static cookie as a temporary workaround.
HttpWebRequest _request = (HttpWebRequest)HttpWebRequest.Create(_URL);
_request.Method = WebRequestMethods.Http.Get;
_request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
_request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip,deflate,sdch");
_request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
_request.Headers.Set(HttpRequestHeader.CacheControl, "max-age=0");
_request.Host = "www.habbo" + _Country;
_request.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36";
using (WebResponse _response = _request.GetResponse())
using (Stream _stream = _response.GetResponseStream())
{
Image _image = Image.FromStream(_stream);
_bitmap = new Bitmap(_image);
string contentType = _response.ContentType;
_PictureBox.Image = _bitmap;
}
Let's let the following variables be:
_URL = "http://www.habbo.com/habbo-imaging/avatarimage?hb=img&user=aa&direction=2&head_direction=2&size=m&img_format=gif";
_Country = ".com";
Most of the things I am passing into the HttpWebRequest is obtained from looking at the Network tab of Google Chrome's Developer Tools.
The web-based versions of the Habbo Imager seems to just direct people to the page where they can find the image, and their browsers seem to somehow add the cookie. What I am doing is different, as all they do is display the site where the image is located, but I want to locate the image's true location, then read from it to a type Image.
Apparently the site seems to need the user to "visit" them, according to what I read from this thread: Click here
What I would like to know is, is there a better way to get a valid cookie that the server will happily accept every time?
Or do I need to somehow trick the site into thinking the user has visited the page and seen it, thereby making them maybe return the cookie we might need, even though the user doesn't ever see the page?
Not too sure if this would mean that I need to somehow dynamically generate the cookies though.
I also do not understand how to truly create or get the cookies (and set stored cookies) using C#, so if it is possible, please use some examples.
I would prefer to not use any third-party libraries, or to change the code I am using too much. Neither is the program going to send two GET requests just to be able to get what it could get with one GET request. Thus, this wouldn't work: Passing cookie with HttpWebRequest in winforms?
I am using .NET 4.0.
It is a little bit more complicated than at first sight expected. The browser makes actually two calls. The first one returns an html script with a small piece of javascript that when executed sets a cookie and reload the page. In your c# code you have to mimic that.
In your form class add an instance variable to hold all the cookies across multiple httpwebrequest calls:
readonly CookieContainer cookiecontainer = new CookieContainer();
I have created a Builder method that creates the HttpWebRequest and returns an HttpWebResponse. It takes a namevaluecollection to add any cookies to the Cookiecontainer.
private HttpWebResponse Builder(string url, string host, NameValueCollection cookies)
{
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Get;
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
// _request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip,deflate,sdch");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
request.Headers.Set(HttpRequestHeader.CacheControl, "max-age=0");
request.Host = host;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36";
request.CookieContainer = cookiecontainer;
if (cookies != null)
{
foreach (var cookiekey in cookies.AllKeys)
{
request.CookieContainer.Add(
new Cookie(
cookiekey,
cookies[cookiekey],
#"/",
host));
}
}
return (HttpWebResponse) request.GetResponse();
}
If the incoming stream turns out to be an text/html contenttype we need to parse its content and return the cookie name and value. The Parse method does just that:
// find in the html and return the three parameters in a string array
// setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '127.0.0.1', 10);
private static string[] Parse(Stream _stream, string encoding)
{
const string setCookieCall = "setCookie('";
// copy html as string
var ms = new MemoryStream();
_stream.CopyTo(ms);
var html = Encoding.GetEncoding(encoding).GetString(ms.ToArray());
// find setCookie call
var findFirst = html.IndexOf(
setCookieCall,
StringComparison.InvariantCultureIgnoreCase) + setCookieCall.Length;
var last = html.IndexOf(");", findFirst, StringComparison.InvariantCulture);
var setCookieStatmentCall = html.Substring(findFirst, last - findFirst);
// take the parameters
var parameters = setCookieStatmentCall.Split(new[] {','});
for (int x = 0; x < parameters.Length; x++)
{
// cleanup
parameters[x] = parameters[x].Replace("'", "").Trim();
}
return parameters;
}
Now are our building blocks complete we can start calling our methods from the Click method. We use a loop to call our Builder twice to obtain a result from the given url. Based on the received contenttype we either Parse or create the Image from the stream.
private void button1_Click(object sender, EventArgs e)
{
var cookies = new NameValueCollection();
for (int tries = 0; tries < 2; tries++)
{
using (var response = Builder(_URL, "www.habbo" + _Country, cookies))
{
using (var stream = response.GetResponseStream())
{
string contentType = response.ContentType.ToLowerInvariant();
if (contentType.StartsWith("text/html"))
{
var parameters = Parse(stream, response.CharacterSet);
cookies.Add(parameters[0], parameters[1]);
}
if (contentType.StartsWith("image"))
{
pictureBox1.Image = Image.FromStream(stream);
break; // we're done, get out
}
}
}
}
}
Words of caution
This code works for the url in your question. I didn't take any measures to handle other patterns, and/or exceptions. It is up to you to add that. Also when doing this kind of scraping make sure the owner of the website does allow this.
Related
Basically I am making a chat app for my university students only and for that I have to make sure they are genuine by checking there details on UMS(university management system) and get their basic detail so they chat genuinely. I am nearly done with my chat app only the login is left.
So I want to login to my UMS page via my website from a generic handler.
and then navigate to another page in it to access there basic info keeping the session alive.
I did research on httpwebrequest and failed to login with my credentials.
https://ums.lpu.in/lpuums
(made in asp.net)
I did tried codes from other posts for login.
I am novice to this part so bear with me.. any help will be appreciated.
Without the actual handshake with UMS via a defined API, you would end up scraping UMS html, which is bad for various reasons.
I would suggest you read up on Single Sign On (SSO).
A few articles on SSO and ASP.NET -
1. Codeproject
2. MSDN
3. asp.net forum
Edit 1
Although, I think this is a bad idea, since you say you are out of options, here is a link that shows how Html Agility Pack can help in scraping the web pages.
Beware of the drawbacks of screen scraping, changes from UMS will not be communicated to you, and you will see your application not working all of a sudden.
public string Scrap(string Username, string Password)
{
string Url1 = "https://www.example.com";//first url
string Url2 = "https://www.example.com/login.aspx";//secret url to post request to
//first request
CookieContainer jar = new CookieContainer();
HttpWebRequest request1 = (HttpWebRequest)WebRequest.Create(Url1);
request1.CookieContainer = jar;
//Get the response from the server and save the cookies from the first request..
HttpWebResponse response1 = (HttpWebResponse)request1.GetResponse();
//second request
string postData = "***viewstate here***";//VIEWSTATE
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(Url2);
request2.CookieContainer = jar;
request2.KeepAlive = true;
request2.Referer = Url2;
request2.Method = WebRequestMethods.Http.Post;
request2.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request2.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
request2.ContentType = "application/x-www-form-urlencoded";
request2.AllowWriteStreamBuffering = true;
request2.ProtocolVersion = HttpVersion.Version11;
request2.AllowAutoRedirect = true;
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
request2.ContentLength = byteArray.Length;
Stream newStream = request2.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
using (StreamReader sr = new StreamReader(response2.GetResponseStream()))
{
responseData = sr.ReadToEnd();
}
return responseData;
}
This is the code which works for me any one can add there links and viewstate for asp.net websites to scrap and you need to take care of cookie too.
and for other websites(non asp.net) they don't require viewstate.
Use fiddler to find things needed to add in header and viewstate or cookie.
Hope this helps if some one having the problem. :)
Is there a way to spoof a web request from C# code so it doesn't look like a bot or spam hitting the site? I am trying to web scrape my website, but keep getting blocked after a certain amount of calls. I want to act like a real browser. I am using this code, from HTML Agility Pack.
var web = new HtmlWeb();
web.UserAgent =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
I do way too much web scraping, but here are the options:
I have a default list of headers I add as all of these are expected from a browser:
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en-US;q=0.8,en;q=0.6";
wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";
(WC is my WebClient).
As a further help - here is my webclient class that keeps cookies stored - which is also a massive help:
public class CookieWebClient : WebClient
{
public CookieContainer m_container = new CookieContainer();
public WebProxy proxy = null;
protected override WebRequest GetWebRequest(Uri address)
{
try
{
ServicePointManager.DefaultConnectionLimit = 1000000;
WebRequest request = base.GetWebRequest(address);
request.Proxy = proxy;
HttpWebRequest webRequest = request as HttpWebRequest;
webRequest.Pipelined = true;
webRequest.KeepAlive = true;
if (webRequest != null)
{
webRequest.CookieContainer = m_container;
}
return request;
}
catch
{
return null;
}
}
}
Here is my usual use for it. Add a static copy to your base site class with all your parsing functions you likely have:
protected static CookieWebClient wc = new CookieWebClient();
And call it as such:
public HtmlDocument Download(string url)
{
HtmlDocument hdoc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
HtmlNode.ElementsFlags.Remove("select");
Stream read = null;
try
{
read = wc.OpenRead(url);
}
catch (ArgumentException)
{
read = wc.OpenRead(HttpHelper.HTTPEncode(url));
}
hdoc.Load(read, true);
return hdoc;
}
The other main reason you may be crashing out is the connection is being closed by the server as you have had an open connection for too long. You can prove this by adding a try catch around the download part as above and if it fails, reset the webclient and try to download again:
HtmlDocument d = new HtmlDocument();
try
{
d = this.Download(prp.PropertyUrl);
}
catch (WebException e)
{
this.Msg(Site.ErrorSeverity.Severe, "Error connecting to " + this.URL + " : Resubmitting..");
wc = new CookieWebClient();
d = this.Download(prp.PropertyUrl);
}
This saves my ass all the time, even if it was the server rejecting you, this can re-jig the lot. Cookies are cleared and your free to roam again. If worse truly comes to worse - add proxy support and get a new proxy applied per 50-ish requests.
That should be more than enough for you to kick your own and any other sites arse.
RATE ME!
Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.
Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference).
In regards to "getting blocked after a certain amount of calls" - throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.
Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.
Scope:
I am developing a C# aplication to simulate queries into this site. I am quite familiar with simulating web requests for achieving the same human steps, but using code instead.
If you want to try yourself, just type this number into the CNPJ box:
08775724000119 and write the captcha and click on Confirmar
I've dealed with the captcha already, so it's not a problem anymore.
Problem:
As soon as i execute the POST request for a "CNPJ", a exception is thrown:
The remote server returned an error: (403) Forbidden.
Fiddler Debugger Output:
Link for Fiddler Download
This is the request generated by my browser, not by my code
POST https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco HTTP/1.1
Host: www.sefaz.rr.gov.br
Connection: keep-alive
Content-Length: 208
Cache-Control: max-age=0
Origin: https://www.sefaz.rr.gov.br
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Referer: https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco
Accept-Encoding: gzip,deflate,sdch
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: GX_SESSION_ID=gGUYxyut5XRAijm0Fx9ou7WnXbVGuUYoYTIKtnDydVM%3D; JSESSIONID=OVuuMFCgQv9k2b3fGyHjSZ9a.undefined
// PostData :
_EventName=E%27CONFIRMAR%27.&_EventGridId=&_EventRowId=&_MSG=&_CONINSEST=&_CONINSESTG=08775724000119&cfield=rice&_VALIDATIONRESULT=1&BUTTON1=Confirmar&sCallerURL=http%3A%2F%2Fwww.sintegra.gov.br%2Fnew_bv.html
Code samples and References used:
I'm using a self developed library to handle/wrap the Post and Get requests.
The request object has the same parameters (Host,Origin, Referer, Cookies..) as the one issued by the browser (logged my fiddler up here).
I've also managed to set the ServicePointValidator of certificates by using:
ServicePointManager.ServerCertificateValidationCallback =
new RemoteCertificateValidationCallback (delegate { return true; });
After all that configuration, i stil getting the forbidden exception.
Here is how i simulate the request and the exception is thrown
try
{
this.Referer = Consts.REFERER;
// PARAMETERS: URL, POST DATA, ThrownException (bool)
response = Post (Consts.QUERYURL, postData, true);
}
catch (Exception ex)
{
string s = ex.Message;
}
Thanks in advance for any help / solution to my problem
Update 1:
I was missing the request for the homepage, which generates cookies (Thanks #W0lf for pointing me that out)
Now there's another weird thing. Fiddler is not showing my Cookies on the request, but here they are :
I made a successful request using the browser and recorded it in Fiddler.
The only things that differ from your request are:
my browser sent no value for the sCallerURL parameter (I have sCallerURL= instead of sCallerURL=http%3A%2F%2Fwww....)
the session ids are different (obviously)
I have other Accept-Language: values (I'm pretty sure this is not important)
the Content-Length is different (obviously)
Update
OK, I thought the Fiddler trace was from your application. In case you are not setting cookies on your request, do this:
before posting data, do a GET request to https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco. If you examine the response, you'll notice the website sends two session cookies.
when you do the POST request, make sure to attach the cookies you got at the previous step
If you don't know how to store the cookies and use them in the other request, take a look here.
Update 2
The problems
OK, I managed to reproduce the 403, figured out what caused it, and found a fix.
What happens in the POST request is that:
the server responds with status 302 (temporary redirect) and the redirect location
the browser redirects (basically does a GET request) to that location, also posting the two cookies.
.NET's HttpWebRequest attempts to do this redirect seamlessly, but in this case there are two issues (that I would consider bugs in the .NET implementation):
the GET request after the POST(redirect) has the same content-type as the POST request (application/x-www-form-urlencoded). For GET requests this shouldn't be specified
cookie handling issue (the most important issue) - The website sends two cookies: GX_SESSION_ID and JSESSIONID. The second has a path specified (/sintegra), while the first does not.
Here's the difference: the browser assigns by default a path of /(root) to the first cookie, while .NET assigns it the request url path (/sintegra/servlet/hwsintco).
Due to this, the last GET request (after redirect) to /sintegra/servlet/hwsintpe... does not get the first cookie passed in, as its path does not correspond.
The fixes
For the redirect problem (GET with content-type), the fix is to do the redirect manually, instead of relying on .NET for this.
To do this, tell it to not follow redirects:
postRequest.AllowAutoRedirect = false
and then read the redirect location from the POST response and manually do a GET request on it.
The cookie problem (that has happened to others as well)
For this, the fix I found was to take the misplaced cookie from the CookieContainer, set it's path correctly and add it back to the container in the correct location.
This is the code to do it:
private void FixMisplacedCookie(CookieContainer cookieContainer)
{
var misplacedCookie = cookieContainer.GetCookies(new Uri(Url))[0];
misplacedCookie.Path = "/"; // instead of "/sintegra/servlet/hwsintco"
//place the cookie in thee right place...
cookieContainer.SetCookies(
new Uri("https://www.sefaz.rr.gov.br/"),
misplacedCookie.ToString());
}
Here's all the code to make it work:
using System;
using System.IO;
using System.Net;
using System.Text;
namespace XYZ
{
public class Crawler
{
const string Url = "https://www.sefaz.rr.gov.br/sintegra/servlet/hwsintco";
public void Crawl()
{
var cookieContainer = new CookieContainer();
/* initial GET Request */
var getRequest = (HttpWebRequest)WebRequest.Create(Url);
getRequest.CookieContainer = cookieContainer;
ReadResponse(getRequest); // nothing to do with this, because captcha is f##%ing dumb :)
/* POST Request */
var postRequest = (HttpWebRequest)WebRequest.Create(Url);
postRequest.AllowAutoRedirect = false; // we'll do the redirect manually; .NET does it badly
postRequest.CookieContainer = cookieContainer;
postRequest.Method = "POST";
postRequest.ContentType = "application/x-www-form-urlencoded";
var postParameters =
"_EventName=E%27CONFIRMAR%27.&_EventGridId=&_EventRowId=&_MSG=&_CONINSEST=&" +
"_CONINSESTG=08775724000119&cfield=much&_VALIDATIONRESULT=1&BUTTON1=Confirmar&" +
"sCallerURL=";
var bytes = Encoding.UTF8.GetBytes(postParameters);
postRequest.ContentLength = bytes.Length;
using (var requestStream = postRequest.GetRequestStream())
requestStream.Write(bytes, 0, bytes.Length);
var webResponse = postRequest.GetResponse();
ReadResponse(postRequest); // not interested in this either
var redirectLocation = webResponse.Headers[HttpResponseHeader.Location];
var finalGetRequest = (HttpWebRequest)WebRequest.Create(redirectLocation);
/* Apply fix for the cookie */
FixMisplacedCookie(cookieContainer);
/* do the final request using the correct cookies. */
finalGetRequest.CookieContainer = cookieContainer;
var responseText = ReadResponse(finalGetRequest);
Console.WriteLine(responseText); // Hooray!
}
private static string ReadResponse(HttpWebRequest getRequest)
{
using (var responseStream = getRequest.GetResponse().GetResponseStream())
using (var sr = new StreamReader(responseStream, Encoding.UTF8))
{
return sr.ReadToEnd();
}
}
private void FixMisplacedCookie(CookieContainer cookieContainer)
{
var misplacedCookie = cookieContainer.GetCookies(new Uri(Url))[0];
misplacedCookie.Path = "/"; // instead of "/sintegra/servlet/hwsintco"
//place the cookie in thee right place...
cookieContainer.SetCookies(
new Uri("https://www.sefaz.rr.gov.br/"),
misplacedCookie.ToString());
}
}
}
Sometimes HttpWebRequest needs proxy initialization:
request.Proxy = new WebProxy();//in my case it doesn't need parameters, but you can set it to your proxy address
I have url like:
http://www.matweb.com/search/DataSheet.aspx?MatGUID=849e2916ab1541be9ff6a17b78f95c82
I want to download source code from that page using this code:
private static string urlTemplate = #"http://www.matweb.com/search/DataSheet.aspx?MatGUID=";
static string GetSource(string guid)
{
try
{
Uri url = new Uri(urlTemplate + guid);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
return result;
}
catch (Exception ex)
{
return null;
}
}
When I do so I get:
You do not seem to have cookies enabled. MatWeb Requires cookies to be enabled.
Ok, that I understand, so I added lines:
CookieContainer cc = new CookieContainer();
webRequest.CookieContainer = cc;
I got:
Your IP Address has been restricted due to excessive use. The problem may be compounded when an IP address may be shared by many people in a company or through an internet service provider. We apologize for any inconvenience.
I can understand this but I'm not getting this message when I try to visit this page using web browser. What can I do to get the source code? Some cookies or http headers?
It probably doesn't like your UserAgent. Try this:
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"; //maybe substitute your own in here
It looks like you're doing something that the company doesn't like, if you got an "excessive use" response.
You are downloading pages too fast.
When you use a browser you might get up to one page per second. Using a application you can get several pages per second and that's probably what their web server is detecting. Hence the excessive usage.
I requested 100 pages that all 404. I wrote
{
var s = DateTime.Now;
for(int i=0; i < 100;i++)
DL.CheckExist("http://google.com/lol" + i.ToString() + ".jpg");
var e = DateTime.Now;
var d = e-s;
d=d;
Console.WriteLine(d);
}
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}
Two runs show it takes 00:00:30.7968750 and 00:00:26.8750000. Then i tried firefox and use the following code
<html>
<body>
<script type="text/javascript">
for(var i=0; i<100; i++)
document.write("<img src=http://google.com/lol" + i + ".jpg><br>");
</script>
</body>
</html>
Using my comp time and counting it was roughly 4 seconds. 4 seconds is 6.5-7.5faster then my app. I plan to scan through a thousands of files so taking 3.75hours instead of 30mins would be a big problem. How can i make this code faster? I know someone will say firefox caches the images but i want to say 1) it still needs to check the headers from the remote server to see if it has been updated (which is what i want my app to do) 2) I am not receiving the body, my code should only be requesting the header. So, how do i solve this?
I noticed that an HttpWebRequest hangs on the first request. I did some research and what seems to be happening is that the request is configuring or auto-detecting proxies. If you set
request.Proxy = null;
on the web request object, you might be able to avoid an initial delay.
With proxy auto-detect:
using (var response = (HttpWebResponse)request.GetResponse()) //6,956 ms
{
}
Without proxy auto-detect:
request.Proxy = null;
using (var response = (HttpWebResponse)request.GetResponse()) //154 ms
{
}
change your code to asynchronous getresponse
public override WebResponse GetResponse() {
•••
IAsyncResult asyncResult = BeginGetResponse(null, null);
•••
return EndGetResponse(asyncResult);
}
Async Get
Probably Firefox issues multiple requests at once whereas your code does them one by one. Perhaps adding threads will speed up your program.
The answer is changing HttpWebRequest/HttpWebResponse to WebRequest/WebResponse only. That fixed the problem.
Have you tried opening the same URL in IE on the machine that your code is deployed to? If it is a Windows Server machine then sometimes it's because the url you're requesting is not in IE's (which HttpWebRequest works off) list of secure sites. You'll just need to add it.
Do you have more info you could post? I've doing something similar and have run into tons of problems with HttpWebRequest before. All unique. So more info would help.
BTW, calling it using the async methods won't really help in this case. It doesn't shorten the download time. It just doesn't block your calling thread that's all.
close the response stream when you are done, so in your checkExist(), add wresp.Close() after wresp = (HttpWebResponse)wreq.GetResponse();
OK if you are getting status code 404 for all webpages then it is due to not specifying credentials. So you need to add
wreq.Credentials = CredentialCache.DefaultCredentials;
Then you may also come across status code= 500 for that you need to specify User Agent. Which looks something like the below line
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
"A WebClient instance does not send optional HTTP headers by default. If your request requires an optional header, you must add the header to the Headers collection. For example, to retain queries in the response, you must add a user-agent header. Also, servers may return 500 (Internal Server Error) if the user agent header is missing."
reference: https://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.110).aspx
To improve the Performance of the HttpWebrequest you need to add
wreq.Proxy=null
now the code will look like:
static public bool CheckExist(string url)
{
HttpWebRequest wreq = null;
HttpWebResponse wresp = null;
bool ret = false;
try
{
wreq = (HttpWebRequest)WebRequest.Create(url);
wreq.Credentials = CredentialCache.DefaultCredentials;
wreq.Proxy=null;
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
wreq.KeepAlive = true;
wreq.Method = "HEAD";
wresp = (HttpWebResponse)wreq.GetResponse();
ret = true;
}
catch (System.Net.WebException)
{
}
finally
{
if (wresp != null)
wresp.Close();
}
return ret;
}
Set cookie is matter and you must add AspxAutoDetectCookieSupport=1 like this code
req.CookieContainer = new CookieContainer();
req.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });