I have a class that creates multiple WebClient classes with different proxies on multiple threads simultaneously.
Unfortunately, some instances of WebClient class take quite long to finish. Usually, I end up with ~20 threads that take a few minutes to finish. On the other hand, I spawn hundreds of threads which finish fast.
I tried to create extend the WebClient class and set the Timeout property to 20 seconds (as posted here), but it didn't change anything.
I'm not showing the whole code, because there would be quite a lot of it (WebClient is wrapped in another class). Still, I know the bottle-neck is WebClient.DownloadString(url), because all of the worker threads are processing this specific line whenever I pause debugging during that last step of executing code.
Here's how I use the extended WebClient:
public string GetHtml(string url)
{
this.CheckValidity(url);
var html = "";
using (var client = new WebDownload())
{
client.Proxy = this.Proxy;
client.Headers[HttpRequestHeader.UserAgent] = this.UserAgent;
client.Timeout = this.Timeout;
html = client.DownloadString(url);
}
return html;
}
EDIT
I have just ran a few tests, and some of the threads take up to 7 minutes to finish, all contemplating the WebClient.DownloadString() statement.
Furthermore, I have tried setting ServicePointManager.DefaultConnectionLimit to int.MaxValue, unfortunately to no avail.
Here's what I ended up doing.
I realized that the problem was, I needed simply to cancel WebClient.DownloadString() when it reached the specified timeout. Since I haven't found anything that would help me in WebClient, I simply called WebClient.DownloadStringTaskAsync(). This way, I could use Task.WaitAll with timeout to wait for WebClient to finish downloading string and then check if the task has finished (to rule out timeout).
Here's the code:
public string GetHtml(string url)
{
var html = "";
using (var client = new WebClient())
{
// Assign all the important stuff
client.Proxy = this.Proxy;
client.Headers[HttpRequestHeader.UserAgent] = this.UserAgent;
// Run DownloadString() as a task.
var task = client.DownloadStringTaskAsync(url);
// Wait for the task to finish, or timeout
Task.WaitAll(new Task<string>[] { task }, this.Timeout);
// If timeout was reached, cancel task and throw an exception.
if (task.IsCompleted == false)
{
client.CancelAsync();
throw new TimeoutException();
}
// Otherwise, happy. :)
html = task.Result;
}
Related
The problem summary: I need to make call to HTTP resource A while using name resolution from previous HTTP request to resource B on the same host.
CASE 1. Consecutive calls to same resource produce faster result after 1st call.
Profiler tells me that the difference between 1st and 2nd call goes to DNS name resolution (GetHostAddresses)
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
CASE 2. Consecutive calls to different resources on the same host produce same delay.
Profiler tells me that they both incur calls to DNS name resolution.
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/a.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
var request = (HttpWebRequest)WebRequest.Create("https://www.somehost.com/resources/b.txt");
using (var response = (HttpWebResponse)request.GetResponse()) {}
I wonder why in case 2 second call cant use DNS cache from first call? its the same host.
And main question - how to change that?
EDIT the behaviour above covers also use of HttpClient class. It appeared this is specific to the few webservers I use and this issue does not happen on other servers. I cant figure what specifically happens but I suspect the webservers in question (Amazon CloudFront and Akamai) force close connection after it has been served, ignoring my requests keep-alive headers. I am going to close this for now as it is not possible to formulate a conscious question..
Your problem doesn't exist for System.Net.Http.HttpClient, try it instead. It can reuse the existing connections (no DNS cache needed for such calls). Looks like that is exactly what you want to achieve. As a bonus it supports HTTP/2 (can be enabled with Property assignment at HttpClient instance creation).
WebRequest is ancient and not recommentded by Microsoft for new development. In .NET 5 HttpClient is rather faster (twice?).
Create the HttpClient instance once per application (link).
private static readonly HttpClient client = new HttpClient();
Analog of your request. Note await is available only in methods marked as async.
string text = await client.GetStringAsync("https://www.somehost.com/resources/b.txt");
You may also do multiple requests at once without spawning concurrent Threads.
string[] urls = new string[]
{
"https://www.somehost.com/resources/a.txt",
"https://www.somehost.com/resources/b.txt"
};
List<Task<string>> tasks = new List<Task<string>>();
foreach (string url in urls)
{
tasks.Add(client.GetStringAsync(url));
}
string[] results = await Task.WhenAll(tasks);
If you're not familiar with Asynchronous programming e.g. async/await, start with this article.
Also you can set a limit how many requests will be processed at once. Let's do the same request 1000 times with limit to 10 requests at once.
static async Task Main(string[] args)
{
Stopwatch sw = new StopWatch();
string url = "https://www.somehost.com/resources/a.txt";
using SemaphoreSlim semaphore = new SemaphoreSlim(10);
List<Task<string>> tasks = new List<Task<string>>();
sw.Start();
for (int i = 0; i < 1000; i++)
{
await semaphore.WaitAsync();
tasks.Add(GetPageAsync(url, semaphore));
}
string[] results = await Task.WhenAll(tasks);
sw.Stop();
Console.WriteLine($"Elapsed: {sw.Elapsemilliseconds}ms");
}
private static async Task GetPageAsync(string url, SemaphoreSlim semaphore)
{
try
{
return await client.GetStringAsync(url);
}
finally
{
semaphore.Release();
}
}
You may measure the time.
The following situation is given:
A new job is sent to an API via Post Request. This API returns a JobID and the HTTP ResponseCode 202.
This JobID is then used to request a status endpoint. If the end point has a "Finished" property set in the response body, you can continue with step 3.
The results are queried via a result endpoint using the JobID and can be processed.
My question is how I can solve this elegantly and cleanly. Are there perhaps already ready-to-use libraries that implement exactly this functionality? I could not find such functionality for RestSharp or another HttpClient.
The current solution looks like this:
async Task<string> PostNewJob()
{
var restClient = new RestClient("https://baseUrl/");
var restRequest = new RestRequest("jobs");
//add headers
var response = await restClient.ExecutePostTaskAsync(restRequest);
string jobId = JsonConvert.DeserializeObject<string>(response.Content);
return jobId;
}
async Task WaitTillJobIsReady(string jobId)
{
string jobStatus = string.Empty;
var request= new RestRequest(jobId) { Method = Method.GET };
do
{
if (!String.IsNullOrEmpty(jobStatus))
Thread.Sleep(5000); //wait for next status update
var response = await restClient.ExecuteGetTaskAsync(request, CancellationToken.None);
jobStatus = JsonConvert.DeserializeObject<string>(response.Content);
} while (jobStatus != "finished");
}
async Task<List<dynamic>> GetJobResponse(string jobID)
{
var restClient = new RestClient(#"Url/bulk/" + jobID);
var restRequest = new RestRequest(){Method = Method.GET};
var response = await restClient.ExecuteGetTaskAsync(restRequest, CancellationToken.None);
dynamic downloadResponse = JsonConvert.DeserializeObject(response.Content);
var responseResult = new List<dynamic>() { downloadResponse?.ToList() };
return responseResult;
}
async main()
{
var jobId = await PostNewJob();
WaitTillJobIsReady(jobID).Wait();
var responseResult = await GetJobResponse(jobID);
//handle result
}
As #Paulo Morgado said, I should not use Thread.Sleep / Task Delay in production code. But in my opinion I have to use it in the method WaitTillJobIsReady() ? Otherwise I would overwhelm the API with Get Requests in the loop?
What is the best practice for this type of problem?
Long Polling
There are multiple ways you can handle this type of problem, but as others have already pointed out no library such as RestSharp currently has this built in. In my opinion, the preferred way of overcoming this would be to modify the API to support some type of long-polling like Nikita suggested. This is where:
The server holds the request open until new data is available. Once
available, the server responds and sends the new information. When the
client receives the new information, it immediately sends another
request, and the operation is repeated. This effectively emulates a
server push feature.
Using a scheduler
Unfortunately this isn't always possible. Another more elegant solution would be to create a service that checks the status, and then using a scheduler such as Quartz.NET or HangFire to schedule the service at reoccurring intervals such as 500ms to 3s until it is successful. Once it gets back the "Finished" property you can then mark the task as complete to stop the process from continuing to poll. This would arguably be better than your current solution and offer much more control and feedback over whats going on.
Using Timers
Aside from using Thread.Sleep a better choice would be to use a Timer. This would allow you to continuously call a delegate at specified intervals, which seems to be what you are wanting to do here.
Below is an example usage of a timer that will run every 2 seconds until it hits 10 runs. (Taken from the Microsoft documentation)
using System;
using System.Threading;
using System.Threading.Tasks;
class Program
{
private static Timer timer;
static void Main(string[] args)
{
var timerState = new TimerState { Counter = 0 };
timer = new Timer(
callback: new TimerCallback(TimerTask),
state: timerState,
dueTime: 1000,
period: 2000);
while (timerState.Counter <= 10)
{
Task.Delay(1000).Wait();
}
timer.Dispose();
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff}: done.");
}
private static void TimerTask(object timerState)
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff}: starting a new callback.");
var state = timerState as TimerState;
Interlocked.Increment(ref state.Counter);
}
class TimerState
{
public int Counter;
}
}
Why you don't want to use Thread.Sleep
The reason that you don't want to use Thread.Sleep for operations that you want on a reoccurring schedule is because Thread.Sleep actually relinquishes control and ultimately when it regains control is not up to the thread. It's simply saying it wants to relinquish control of it's remaining time for a least x milliseconds, but in reality it could take much longer for it to regain it.
Per the Microsoft documentation:
The system clock ticks at a specific rate called the clock resolution.
The actual timeout might not be exactly the specified timeout, because
the specified timeout will be adjusted to coincide with clock ticks.
For more information on clock resolution and the waiting time, see the
Sleep function from the Windows system APIs.
Peter Ritchie actually wrote an entire blog post on why you shouldn't use Thread.Sleep.
EndNote
Overall I would say your current approach has the appropriate idea on how this should be handled however, you may want to 'future proof' it by doing some refactoring to utilize on of the methods mentioned above.
Problem: I inherited WebClient in ExtendedWebClient where I override the WebRequest's timeout property in the GetWebRequest method. If I set it to 100ms, or even 20ms, it always takes up to more than 30 seconds at least. Sometimes it seems to not get through at all.
Also, when the service (see code below) serving the images comes back online again, the code written in Rx / System.Reactive does not push images into the pictureBox anymore?
How can I get around this, what am I doing wrong? (See code below)
Test case: I have a WinForms test project set up for this, which is doing the following.
GetNextImageAsync
public async Task<Image> GetNextImageAsync()
{
Image image = default(Image);
try {
using (var webClient = new ExtendedWebClient()) {
var data = await webClient.DownloadDataTaskAsync(new Uri("http://SOMEIPADDRESS/x/y/GetJpegImage.cgi"));
image = ByteArrayToImage(data);
return image;
}
} catch {
return image;
}
}
ExtendedWebClient
private class ExtendedWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
var webRequest = base.GetWebRequest(address);
webRequest.Timeout = 100;
return webRequest;
}
}
3.1 Presentation code (using Rx)
Note: It has actually never reached the "else" statement in the pictures.Subscribe() body.
var pictures = Observable
.FromAsync<Image>(GetNextImageAsync)
.Throttle(TimeSpan.FromSeconds(.5))
.Repeat()
;
pictures.Subscribe(img => {
if (img != null) {
pictureBox1.Image = img;
} else {
if (pictureBox1.Created && this.Created) {
using (var g = pictureBox1.CreateGraphics()) {
g.DrawString("[-]", new Font("Verdana", 8), Brushes.Red, new PointF(8, 8));
}
}
}
});
3.2 Presentation code (using Task.Run)
Note 1: Here the "else" body is getting called, though WebClient still takes longer than expected to timeout....
Note 2: I don't want to use this method, because this way I can't "Throttle" the image stream, I'm not able to get them in proper order, and do other stuff with my stream of images... But this is just example code of it working...
Task.Run(() => {
while (true) {
GetNextImageAsync().ContinueWith(img => {
if(img.Result != null) {
pictureBox1.Image = img.Result;
} else {
if (pictureBox1.Created && this.Created) {
using (var g = pictureBox1.CreateGraphics()) {
g.DrawString("[-]", new Font("Verdana", 8), Brushes.Red, new PointF(8, 8));
}
}
}
});
}
});
As reference, the code to tranfser the byte[] to the Image object.
public Image ByteArrayToImage(byte[] byteArrayIn)
{
using(var memoryStream = new MemoryStream(byteArrayIn)){
Image returnImage = Image.FromStream(memoryStream);
return returnImage;
}
}
The Other Problem...
I'll address cancellation below, but there is also a misunderstanding of the behaviour of the following code, which is going to cause issues regardless of the cancellation problem:
var pictures = Observable
.FromAsync<Image>(GetNextImageAsync)
.Throttle(TimeSpan.FromSeconds(.5))
.Repeat()
You probably think that the Throttle here will limit the invocation rate of GetNextImageAsync. Sadly that is not the case. Consider the following code:
var xs = Observable.Return(1)
.Throttle(TimeSpan.FromSeconds(5));
How long do you think it will take the OnNext(1) to be invoked on a subscriber? If you thought 5 seconds, you'd be wrong. Since Observable.Return sends an OnCompleted immediately after its OnNext(1) the Throttle deduces that there are no more events that could possibly throttle the OnNext(1) and it emits it immediately.
Contrast with this code where we create a non-terminating stream:
var xs = Observable.Never<int>().StartWith(1)
.Throttle(TimeSpan.FromSeconds(5));
Now the OnNext(1) arrives after 5 seconds.
The upshot of all this is that your indefinite Repeat is going to batter your code, requesting images as fast as they arrive - how exactly this is causing the effects you are witnessing would take further analysis.
There are several constructions to limit the rate of querying, depending on your requirements. One is to simply append an empty delay to your result:
var xs = Observable.FromAsync(GetValueAsync)
.Concat(Observable.Never<int>()
.Timeout(TimeSpan.FromSeconds(5),
Observable.Empty<int>()))
.Repeat();
Here you would replace int with the type returned by GetValueAsync.
Cancellation
As #StephenCleary observed, setting the Timeout property of WebRequest will only work on synchronous requests. Having looked at this necessary changes to implement cancellation cleanly with WebClient, I have to concur it's such a faff with WebClient you are far better off converting to HttpClient if at all possible.
Sadly, even then the "easy" methods to pull back data such as GetByteArrayAsync don't (for some bizarre reason) have an overload accepting a CancellationToken.
If you do use HttpClient then one option for timeout handling is via the Rx like this:
void Main()
{
Observable.FromAsync(GetNextImageAsync)
.Timeout(TimeSpan.FromSeconds(1), Observable.Empty<byte[]>())
.Subscribe(Console.WriteLine);
}
public async Task<byte[]> GetNextImageAsync(CancellationToken ct)
{
using(var wc = new HttpClient())
{
var response = await wc.GetAsync(new Uri("http://www.google.com"),
HttpCompletionOption.ResponseContentRead, ct);
return await response.Content.ReadAsByteArrayAsync();
}
}
Here I have used the Timeout operator to cause an empty stream to be emitted in the event of timeout - other options are available depending on what you need.
When Timeout does timeout it will cancel it's subscription to FromAsync which in turn will cancel the cancellation token it passes indirectly to HttpClient.GetAsync via GetNextImageAsync.
You could use a similar construction to call Abort on a WebRequest too, but as I said, it's a lot more of a faff given there's no direct support for cancellation tokens.
To quote the MSDN docs:
The Timeout property affects only synchronous requests made with the GetResponse method. To time out asynchronous requests, use the Abort method.
You could mess around with the Abort method, but it's easier to convert from WebClient to HttpClient, which was designed with asynchronous operations in mind.
I'm creating a tool to load test (sends http: GETs) and it runs fine but eventually dies because of an out of memory error.
ASK: How can I reset the threads so this loop can continually run and not err?
static void Main(string[] args)
{
System.Net.ServicePointManager.DefaultConnectionLimit = 200;
while (true)
{
for (int i = 0; i < 1000; i++)
{
new Thread(LoadTest).Start(); //<-- EXCEPTION!.eventually errs out of memory
}
Thread.Sleep(2);
}
}
static void LoadTest()
{
string url = "http://myserv.com/api/dev/getstuff?whatstuff=thisstuff";
// Sends http get from above url ... and displays the repose in the console....
}
You are instantiating Threads left right and centre. This is likely you problem. You want to replace the
new Thread(LoadTest).Start();
with
Task.Run(LoadTest);
This will run your LoadTest on a Thread in the ThreadPool, instead of using resources to create a new Thread each time. HOWEVER. This will then expose a different issue.
Threads on the ThreadPool are a limited resource and you want to return Threads to the ThreadPool as soon as possible. I assume you are using the synchronous download methods as opposed to the APM methods. This means that whilst the request is being sent out to the server, the thread spawning the request is sleeping as opposed to going off to do some other work.
Either use (assuming .net 4.5)
var client = new WebClient();
var response = await client.DownloadStringTaskAsync(url);
Console.WriteLine(response);
Or use a callback (if not .net 4.5)
var client = new WebClient();
client.OnDownloadStringCompleted(x => Console.WriteLine(x));
client.BeginDownloadString(url);
Use a ThreadPool and use QueueUserWorkItem instead of creating thousands of threads. Threads are expensive objects and it is no surprise you are running out of memory and besides you won't be able to have any performance (in your test tool) with so many threads.
You code snippet creates lots of threads and no wonder it eventually runs out of memory. It would be better to use a Thread Pool here.
You code would look like this:
static void Main(string[] args)
{
System.Net.ServicePointManager.DefaultConnectionLimit = 200;
ThreadPool.SetMaxThreads(500, 300);
while (true)
{
ThreadPool.QueueUserWorkItem(LoadTest);
}
}
static void LoadTest(object state)
{
string url = "http://myserv.com/api/dev/getstuff?whatstuff=thisstuff";
// Sends http get from above url ... and displays the repose in the console....
}
I'm currently writing a sitemap generator that scrapes a site for urls and builds an xml sitemap. As most of the waiting is spent on requests to uri's I'm using threading, specifically the build in ThreadPool object.
In order to let the main thread wait for the unknown amount of threads to complete I have implemented the following setup. I don't feel this is a good solution though, can any threading gurus advise me of any problems this solution has, or suggest a better way to implement it?
The EventWaitHandle is set to EventResetMode.ManualReset
Here is the thread method
protected void CrawlUri(object o)
{
try
{
Interlocked.Increment(ref _threadCount);
Uri uri = (Uri)o;
foreach (Match match in _regex.Matches(GetWebResponse(uri)))
{
Uri newUri = new Uri(uri, match.Value);
if (!_uriCollection.Contains(newUri))
{
_uriCollection.Add(newUri);
ThreadPool.QueueUserWorkItem(_waitCallback, newUri);
}
}
}
catch
{
// Handle exceptions
}
finally
{
Interlocked.Decrement(ref _threadCount);
}
// If there are no more threads running then signal the waithandle
if (_threadCount == 0)
_eventWaitHandle.Set();
}
Here is the main thread method
// Request first page (based on host)
Uri root = new Uri(context.Request.Url.GetLeftPart(UriPartial.Authority));
// Begin threaded crawling of the Uri
ThreadPool.QueueUserWorkItem(_waitCallback, root);
Thread.Sleep(5000); // TEMP SOLUTION: Sleep for 5 seconds
_eventWaitHandle.WaitOne();
// Server the Xml Sitemap
context.Response.ContentType = "text/xml";
context.Response.Write(GetXml().OuterXml);
Any ideas are much appreciated :)
Well, first off you can create a ManualResetEvent that starts unset, so you don't have to sleep before waiting on it. Secondly you're going to need to put thread synchronization around your Uri collection. You could get a race condition where one two threads pass the "this Uri does not exist yet" check and they add duplicates. Another race condition is that two threads could pass the if (_threadCount == 0) check and they could both set the event.
Last, you can make the whole thing much more efficient by using the asynchronous BeginGetRequest. Your solution right now keeps a thread around to wait for every request. If you use async methods and callbacks, your program will use less memory (1MB per thread) and won't need to do context switches of threads nearly as much.
Here's an example that should illustrate what I'm talking about. Out of curiosity, I did test it out (with a depth limit) and it does work.
public class CrawlUriTool
{
private Regex regex;
private int pendingRequests;
private List<Uri> uriCollection;
private object uriCollectionSync = new object();
private ManualResetEvent crawlCompletedEvent;
public List<Uri> CrawlUri(Uri uri)
{
this.pendingRequests = 0;
this.uriCollection = new List<Uri>();
this.crawlCompletedEvent = new ManualResetEvent(false);
this.StartUriCrawl(uri);
this.crawlCompletedEvent.WaitOne();
return this.uriCollection;
}
private void StartUriCrawl(Uri uri)
{
Interlocked.Increment(ref this.pendingRequests);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.BeginGetResponse(this.UriCrawlCallback, request);
}
private void UriCrawlCallback(IAsyncResult asyncResult)
{
HttpWebRequest request = asyncResult.AsyncState as HttpWebRequest;
try
{
HttpWebResponse response = (HttpWebResponse)request.EndGetResponse(asyncResult);
string responseText = this.GetTextFromResponse(response); // not included
foreach (Match match in this.regex.Matches(responseText))
{
Uri newUri = new Uri(response.ResponseUri, match.Value);
lock (this.uriCollectionSync)
{
if (!this.uriCollection.Contains(newUri))
{
this.uriCollection.Add(newUri);
this.StartUriCrawl(newUri);
}
}
}
}
catch (WebException exception)
{
// handle exception
}
finally
{
if (Interlocked.Decrement(ref this.pendingRequests) == 0)
{
this.crawlCompletedEvent.Set();
}
}
}
}
When doing this kind of logic I generally try to make an object representing each asynchronous task and the data it needs to run. I would typically add this object to the collection of tasks to be done. The thread pool gets these tasks secheduled, and I would let the object itself remove itself from the "to be done" collection when the task finishes, possibly signalling on the collection itself.
So you're finished when the "to be done" collection is empty; the main thread is probably awoken once by each task that finishes.
You could look into the CTP of the Task Parallel Library which should make this simpler for you. What you're doing can be divided into "tasks", chunks or units of work, and the TPL can parallelize this for you if you supply the tasks. It uses a thread pool internally as well, but it's easier to use and comes with a lot of options like waiting for all tasks to finish. Check out this Channel9 video where the possibilities are explained and where a demo is shown of traversing a tree recursively in parallel, which seems very applicable to your problem.
However, it's still a preview and won't be released until .NET 4.0, so it comes with no warranties and you'll have to manually include the supplied System.Threading.dll (found in the install folder) into your project and I don't know if that's an option to you.