When trying to grab the html of a webpage, very occasionally I get the exception "Too many redirections were attempted".
An example of such a website is http://www.magicshineuk.co.uk/
Normally I would set the timeout to be something like 6 seconds... but even with 30 seconds, and Max Redirections Allowed to something crazy like 200, it will still throw either the "Too many redirections" exception, or, a Timeout will occur.
How can I get around this problem?
My code is below...
try
{
System.Net.WebRequest request = System.Net.WebRequest.Create("http://www.magicshineuk.co.uk/");
var hwr = ((HttpWebRequest)request);
hwr.UserAgent ="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0";
hwr.Headers.Add("Accept-Language", "en-US,en;q=0.5");
hwr.Headers.Add("Accept-Encoding", "gzip, deflate");
hwr.ContentType = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; ;
hwr.KeepAlive = true;
hwr.Timeout = 30000; // 30 seconds... normally set to 6000
hwr.Method = "GET";
hwr.AllowAutoRedirect = true;
hwr.CookieContainer = new System.Net.CookieContainer();
// Setting this Makes no difference... normally I would like to keep to a sensible maximum but I will leave as the default of 50 if needs be...
// Either way, the Too Many Redirections exception occurs
hwr.MaximumAutomaticRedirections = 200;
using (var response = (HttpWebResponse)hwr.GetResponse())
{
Console.WriteLine(String.Format("{0} {1}", (int)response.StatusCode, response.StatusCode));
Console.WriteLine(response.ResponseUri);
Console.WriteLine("Last modified: {0}", response.LastModified);
Console.WriteLine("Server: {0}", response.Server);
Console.WriteLine("Supports Headers: {0}", response.SupportsHeaders);
Console.WriteLine("Headers: ");
// do something... e.g:
int keyCount = response.Headers.Keys.Count;
int i = 0;
Dictionary<string, string> hc = new Dictionary<string, string>();
foreach (var hname in response.Headers)
{
var hv = response.Headers[i].ToString();
hc.Add(hname.ToString(), hv);
i++;
}
foreach (var di in hc)
{
Console.WriteLine(" {0} = {1}", di.Key, di.Value);
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception: ");
Console.WriteLine(ex.Message);
}
I tried your code, for which I needed to comment out // hwr.Host = Utils.GetSimpleUrl(url); and it worked fine. If you are polling frequently, then the target site, or something in between (proxy, firewall etc), may be recognizing your polling as a denial of service and timing you out for a set duration. Alternatively, if you are behind a corporate firewall you may be receiving similar from an internal network appliance.
How often are you running this scraper?
Edited to add:
I tried this using .net 4.52, Windows 7 x64, Visual Studio 2015
The target site could also be unreliable (up and down)
There may be intermittent network problems between you and the target site
They may possibly expose an API which would be a more reliable integration
Related
I'm coding a multithreaded web-crawler that performs a lot of concurrent httpwebrequests every second using hundreds of threads, the application works great but sometimes(randomly) one of the webrequests hangs on the getResponseStream() completely ignoring the timeout(this happen when I perform hundreds of requests concurrently) making the crawling process never end, the strange thing is that with fiddler this never happen and the application never hang, it is really hard to debug because it happens randomly.
I've tried to set
Keep-Alive = false
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3;
but I still get the strange behavior, any ideas?
Thanks
HttpWebRequest code:
public static string RequestHttp(string url, string referer, ref CookieContainer cookieContainer_0, IWebProxy proxy)
{
string str = string.Empty;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.UserAgent = randomuseragent();
request.ContentType = "application/x-www-form-urlencoded";
request.Accept = "*/*";
request.CookieContainer = cookieContainer_0;
request.Proxy = proxy;
request.Timeout = 15000;
request.Referer = referer;
//request.ServicePoint.MaxIdleTime = 15000;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
List<byte> list = new List<byte>();
byte[] buffer = new byte[0x400];
int count = responseStream.Read(buffer, 0, buffer.Length);
while (count != 0)
{
list.AddRange(buffer.ToList<byte>().GetRange(0, count));
if (list.Count >= 0x100000)
{
break;
}
count = 0;
try
{
HERE IT HANGS SOMETIMES ---> count = responseStream.Read(buffer, 0, buffer.Length);
continue;
}
catch
{
continue;
}
}
//responseStream.Close();
int num2 = 0x200 * 0x400;
if (list.Count >= num2)
{
list.RemoveRange((num2 * 3) / 10, list.Count - num2);
}
byte[] bytes = list.ToArray();
str = Encoding.Default.GetString(bytes);
Encoding encoding = Encoding.Default;
if (str.ToLower().IndexOf("charset=") > 0)
{
encoding = GetEncoding(str);
}
else
{
try
{
encoding = Encoding.GetEncoding(response.CharacterSet);
}
catch
{
}
}
str = encoding.GetString(bytes);
// response.Close();
}
}
return str.Trim();
}
The Timeout property "Gets or sets the time-out value in milliseconds for the GetResponse and GetRequestStream methods." The default value is 100,000 milliseonds (100 seconds).
The ReadWriteTimeout property, "Gets or sets a time-out in milliseconds when writing to or reading from a stream." The default is 300,000 milliseconds (5 minutes).
You're setting Timeout, but leaving ReadWriteTimeout at the default, so your reads can take up to five minutes before timing out. You probably want to set ReadWriteTimeout to a lower value. You might also consider limiting the size of data that you download. With my crawler, I'd sometimes stumble upon an unending stream that would eventually result in an out of memory exception.
Something else I noticed when crawling is that sometimes closing the response stream will hang. I found that I had to call request.Abort to reliably terminate a request if I wanted to quit before reading the entire stream.
There is nothing apparent in the code you provided.
Why did you comment response.Close() out?
Documentation hints that connections may run out if not explicitly closed. The response getting disposed may close the connection but just releasing all the resources is not optimal I think. Closing the response will also close the stream so that is covered.
The system hanging without timeout can be just a network issue making the response object a dead duck or the problem is due the high number of threads resulting in memory fragmentation.
Looking at anything that may produce a pattern may help find the source:
How many threads are typically running (can you bundle request sets in less threads)
How is the network performance at the time the thread stopped
Is there a specific count or range when it happens
What data was processed last when it happened (are there any specific control characters or sequences of data that can upset the stream)
Want to ask more questions but not enough reputation so can only reply.
Good luck!
Below is some code that does something similar, it's also used to access multiple web sites, each call is in a different task. The difference is that I only read the stream once and then parse the results. That might be a way to get around the stream reader locking up randomly or at least make it easier to debug.
try
{
_webResponse = (HttpWebResponse)_request.GetResponse();
if(_request.HaveResponse)
{
if (_webResponse.StatusCode == HttpStatusCode.OK)
{
var _stream = _webResponse.GetResponseStream();
using (var _streamReader = new StreamReader(_stream))
{
string str = _streamReader.ReadToEnd();
I am building an application that highly relies on the loading speed of a web page.
I am not getting any good results with HttpWebResponse on C#. I am getting better results with internet browsers like Chrome and IE
Here are the stats that i collected:
HttpWebResponse (C#) = 17 Seconds / 20 Requests
Javascript/iFrame on Chrome = 9 seconds / 20 requests
Javascript/iFrame on IE = 11 seconds / 20 requests
Question #1
Is there anything i can do, to optimize my code for better performance?
Question #2
I can click start button twice and open two connections, so that i can get on par with browser performance. This works great, however the website i send a request to has a limit. If i send a new request before the other one is completed, it blocks my connection for 10 minutes. Is there a way i can prevent this?
My Thread:
void DomainThreadNamecheapStart()
{
while (stop == false)
{
foreach (string FromDomainList in DomainList.Lines)
{
if (FromDomainList.Length > 1)
{
// I removed my api parameters from the string
string namecheapapi = "https://api.namecheap.com/foo" + FromDomainList + "bar";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(namecheapapi);
request.Proxy = null;
request.ServicePoint.Expect100Continue = false;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
status.Text = FromDomainList + "\n" + sr.ReadToEnd();
sr.Close();
}
}
}
}
My Button:
private void button2_Click(object sender, EventArgs e)
{
stop = false;
Thread DomainThread = new Thread(new ThreadStart(DomainThreadNamecheapStart));
DomainThread.Start();
}
My Old Question:
How do I increase the performance of HttpWebResponse?
You're creating a thread every time the button is pressed. Creating a thread is expensive and takes time by itself. Try using a thread from an existing thread pool (try QueueUserWorkItem) and see if that helps.
I'm doing some HTTP header testing to check if a url is alive or not. Im doing this with random generated string urls going through a while loop which calls the HttpRequest function. The problem is that as long as HttpWebRequest is Async the while loop keeps running taking alot of processes checking hell of alot links at the same time. So what i would like to do is ti delay the while loop for either some seconds/milliseconds or simply wait for the HttpWebRequest to only handle like 3 requests at a time. Im just lost here and i dont know how to do so.
My while loop looks like this
String Episode = textBox1.Text;
String Rand = newInt(16);
String Url = "http://someurl.com?_" + Episode + "paradisehotel_" + Rand + ".wmv";
while (checkUrl(Url) == false)
{
Rand = newInt(16);
while (isInList(Rand, list))
{
Rand = newInt(16);
}
list.Add(Rand);
Url = "someurl.com_" + Episode + "paradisehotel_" + Rand + ".wmv";
}
My CheckUrl function looks like this
private bool checkUrl(String url)
{
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(#url);
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
try
{
WebResponse response = req.GetResponse();
return true;
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
return false;
}
}
I hope someone way more clever than me has a solution.
Thank you mostly
Jonas
Take the example here (too much code to copy and paste it all here), which uses an async callback, and increment a static counter inside the callback, after you have loaded the response. Then all you need to do is check the counter isn't over a max value in each iteration of the while loop prior to executing the next request, by using a thread.sleep.
I have a console app that makes a single request to a web page and returns its server status eg. 200, 404, etc..
I'd like to apply following changes:
List of User Inputs:
Url to request
How many parallel connections to use(concurrent users)
How long(seconds) to submit as many requests as it can
List of Outputs:
Show Total Fetches
Show Fetches per Second
Show Average Response Time (ms)
I imagine the best way to do it is to run multiple http fetches in parallel and run in a single process, so it doesn't bog down the client machine.
I really like C# but I'm still new to it. I've researched other articles about this but I don't fully understand them so any help would be greatly appreciated.
My Code:
static void Main(string[] args)
{
try
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("http://10.10.1.6/64k.html");
webRequest.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
//Returns "MovedPermanently", not 301 which is what I want.
int i_goodResponse = (int)response.StatusCode;
string s_goodResponse = response.StatusCode.ToString();
Console.WriteLine("Normal Response: " + i_goodResponse + " " + s_goodResponse);
Console.ReadLine();
}
catch (WebException we)
{
int i_badResponse = (int)((HttpWebResponse)we.Response).StatusCode;
string s_badResponse = ((HttpWebResponse)we.Response).StatusCode.ToString();
Console.WriteLine("Error Response: " + i_badResponse + " " + s_badResponse);
Console.ReadLine();
}
}
Some possible code that I found:
void StartWebRequest()
{
HttpWebRequest webRequest = ...;
webRequest.BeginGetResponse(new AsyncCallback(FinishWebRequest), webRequest);
}
void FinishWebRequest(IAsyncResult result)
{
HttpWebResponse response = (result.AsyncState as HttpWebRequest).EndGetResponse(result) as HttpWebResponse;
}
This is actually a good place to make use of the Task Parallel Library in .NET 4.0. I have wrapped your code in a Parallel.For block which will execute a number of sets of requests in parallel, collate the total times in each parallel branch, and then calculate the overall result afterwards.
int n = 16;
int reqs = 10;
var totalTimes = new long[n];
Parallel.For(0, n, i =>
{
for (int req = 0; req < reqs; req++)
{
Stopwatch w = new Stopwatch();
try
{
w.Start();
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("http://localhost:42838/Default.aspx");
webRequest.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
w.Stop();
totalTimes[i] += w.ElapsedMilliseconds;
//Returns "MovedPermanently", not 301 which is what I want.
int i_goodResponse = (int)response.StatusCode;
string s_goodResponse = response.StatusCode.ToString();
Console.WriteLine("Normal Response: " + i_goodResponse + " " + s_goodResponse);
}
catch (WebException we)
{
w.Stop();
totalTimes[i] += w.ElapsedMilliseconds;
int i_badResponse = (int)((HttpWebResponse)we.Response).StatusCode;
string s_badResponse = ((HttpWebResponse)we.Response).StatusCode.ToString();
Console.WriteLine("Error Response: " + i_badResponse + " " + s_badResponse);
}
}
});
var grandTotalTime = totalTimes.Sum();
var reqsPerSec = (double)(n * reqs * 1000) / (double)grandTotalTime;
Console.WriteLine("Requests per second: {0}", reqsPerSec);
The TPL is very useful here, as it abstracts away the detail of creating multiple threads of exececution within your process, and running each parallel branch on these threads.
Note that you still have to be careful here - we cannot share state which is updated during the tasks between threads, hence the array for totalTimes which collates the totals for each parallel branch, and only summed up at the very end, once the parallel execution is complete. If we didn't do this, we are open to the possibility of a race condition - where two seperate threads attempt to update the total count simultaneously, potentially corrupting the result.
I hope this makes sense and is useful as a start for you (I only calculate requests per second here, the other stats should be relatively easy to add). Add comments if you need further clarifications.
You have already answered your own question, you can use BeginGetResponse to start async request.
Another, and more convenient method might be using WebClient class, if you are more familiar with events then with AsyncResult.
DownloadDataCompletedEventHandlerd
I am using VSTS 2008 + C# + .Net 3.5 to develop a console application and I send request to another server (IIS 7.0 on Windows Server 2008). I find when the # of request threads are big (e.g. 2000 threads), the client will receive error "Unable to connect to remote server fail" when invoking response = (HttpWebResponse)request.GetResponse().My confusion is -- I have set timeout to be a large value, but I got such fail message within a minute. I think even if the connection are really larger than what IIS could serve, client should not get such fail message so soon, it should get such message after timeout period. Any comments? Any ideas what is wrong? Any ideas to make more number of concurrent connection being served by IIS 7.0?
Here is my code,
class Program
{
private static int ClientCount = 2000;
private static string TargetURL = "http://labtest/abc.wmv";
private static int Timeout = 3600;
static void PerformanceWorker()
{
Stream dataStream = null;
HttpWebRequest request = null;
HttpWebResponse response = null;
StreamReader reader = null;
try
{
request = (HttpWebRequest)WebRequest.Create(TargetURL);
request.Timeout = Timeout * 1000;
request.Proxy = null;
response = (HttpWebResponse)request.GetResponse();
dataStream = response.GetResponseStream();
reader = new StreamReader(dataStream);
// 1 M at one time
char[] c = new char[1000 * 10];
while (reader.Read(c, 0, c.Length) > 0)
{
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message + "\n" + ex.StackTrace);
}
finally
{
if (null != reader)
{
reader.Close();
}
if (null != dataStream)
{
dataStream.Close();
}
if (null != response)
{
response.Close();
}
}
}
static void Main(string[] args)
{
Thread[] workers = new Thread[ClientCount];
for (int i = 0; i < ClientCount; i++)
{
workers[i] = new Thread((new ThreadStart(PerformanceWorker)));
}
for (int i = 0; i < ClientCount; i++)
{
workers[i].Start();
}
for (int i = 0; i < ClientCount; i++)
{
workers[i].Join();
}
return;
}
}
Kev answered you question already, I just want to add that creating so many threads is not really good design solution (just context switching overhead is a big minus) plus it won't scale good.
The quick answer would be: use asynchronous operations to read data instead of creating a bunch of threads. Or at least use thread pool (and lower worker thread count). Remember that more connections to one source will only speed things up till some degree. Try benchmarking it and you will see that probably 3-5 connections will work faster that 2000 you are using now.
You can read more about asynchronous client/server architecture (IOCP - input/output completion ports) and its advantages here. You can start from here:
MSDN - Using an Asynchronous Server Socket
MSDN - Asynchronous Server Socket Example
CodeProject - Multi-threaded .NET TCP Server Examples
All of these examples uses lower level TCP object, but it can be applied to WebRequest/WebResponse as well.
UPDATE
To try thread pool version, you can do something like this:
ManualResetEvent[] events = new ManualResetEvent[ClientCount];
for (uint cnt = 0; cnt < events.Length; cnt++)
{
events[cnt] = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(obj => PerformanceWorker());
}
WaitHandle.WaitAll(events);
Not tested, may need some adjustment.
I reckon you've maxed out the web site's application pool queue. The default is 1000 requests, you're flooding the server with 2000 requests more or less all at once. Increasing the timeout isn't going to solve this.
Try increasing the Queue Length for the application pool the site resides in.
You should try and capture the underlying HTTP status, that'll give you a clue as to what is really going on.
Update:
When I run your code and try and download a sizeable file (200MB) I get (503) Server Unavailable.. Increasing the size of the Application Pool's request queue solves this (I set mine to 10000).
Only once did I see Unable to connect to remote server and sadly have been unable to replicate. This error sounds like there's something broken at the TCP/IP layer. Can you post the full exception?
Go to Smart Thread Pool and downlod the code. It is an instance thread pool that constrains the number of threads. The .Net Thread pool can be problematic in applications that connect to web servers and SQL servers.
Change the loop to this
static void Main(string[] args)
{
var stp = new SmartThreadPool((int) TimeSpan.FromMinutes(5).TotalMilliseconds,
Environment.ProcessorCount - 1, Environment.ProcessorCount - 1);
stp.Start();
for (var i = 0; i < ClientCount; i++)
{
stp.QueueWorkItem(PerformanceWorker);
}
stp.WaitForIdle();
stp.Shutdown();
return;
}
This constrains the thread pool to use 1 thread per proc. Adjust this up until performance starts to degrade. Too many threads are worse than too few. you many find that this is optimal.
Also add this to you config. The value of 100 is a default I use. There is a way to do this in code but the syntax escapes me now.
<system.net>
<connectionManagement>
<add address=“*“ maxconnection=“100“ />
</connectionManagement>
</system.net>
I am using Visual Studio 2005. How to send an SMS, here is my code:
IPHostEntry host;
host = Dns.GetHostEntry(Dns.GetHostName());
UriBuilder urlbuilder = new UriBuilder();
urlbuilder.Host = host.HostName;
urlbuilder.Port = 4719;
string PhoneNumber = "9655336272";
string message = "Just a simple text";
string subject = "MMS subject";
string YourChoiceofName = "victoria";
urlbuilder.Query = string.Format("PhoneNumber=%2B" + PhoneNumber + "&MMSFrom=" + YourChoiceofName + "&MMSSubject=" + subject + "&MMSText=" + message);//+ "&MMSFile=http://127.0.0.1/" + fileName
HttpWebRequest httpReq = (HttpWebRequest)WebRequest.Create(new Uri(urlbuilder.ToString(), false));
HttpWebResponse httpResponse = (HttpWebResponse)(httpReq.GetResponse());