Timeout Exception - Queuing of Requests? Not enough threads? - c#

Background:
I have a service which aggregates data from multiple other services. To make things happen in a timely manner I use async throughout the code, and then gather the various requests into a list of tasks.
Here is some excerpts from the code:
private async Task<List<Foo>> Baz(..., int timeout)
{
var tasks = new List<Task<IEnumerable<Foo>>>();
Tasks.Add(GetFoo1(..., timeout));
Tasks.Add(GetFoo2(..., timeout));
// Up to 6, depending on other parameters. Some tasks return multiple objects.
return await Task.WhenAll(tasks).ContinueWith((antecedent) => { return antecedent.Result.AsEnumerable().SelectMany(f => f).ToList(); }).ConfigureAwait(false);
}
private async Task<IEnumerable<Foo>> GetFoo1(..., int timeout)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var value = await SomeAsyncronousService.GetAsync(..., timeout).ConfigureAwait(false);
sw.Stop();
// Record timing...
return new[] { new Foo(..., value) };
}
private async Task<IEnumerable<Foo>> GetFoo2(..., int timeout)
{
return await Task.Run(() => {
Stopwatch sw = new Stopwatch();
sw.Start();
var r = new[] { new Foo(..., SomeSyncronousService.Get(..., timeout)) };
sw.Start();
sw.Stop();
// Record timing...
return r;
}).ConfigureAwait(false);
}
// In class SomeAsyncronousService
public async Task<string> GetAsync(..., int timeout)
{
...
try
{
using (var httpClient = HttpClientFactory.Create())
{
// I have tried it with both timeout and CTS. The behavior is the same.
//httpClient.Timeout = TimeSpan.FromMilliseconds(timeout);
var cts = new CancellationTokenSource();
cts.CancelAfter(timeout);
var content = ...;
var responseMessage = await httpClient.PostAsync(Endpoint, content, cts.Token).ConfigureAwait(false);
if (responseMessage.IsSuccessStatusCode)
{
var contentData = await responseMessage.Content.ReadAsStringAsync().ConfigureAwait(false);
...
return ...
}
...
}
}
catch (OperationCanceledException ex)
{
// Log statement ...
}
catch (Exception ex)
{
// Log statement ...
}
return ...;
}
The Symptoms:
This code works great on my local machine, and it works fine on our test servers most of the time. However, occasionally we get a bunch of mass recorded timeouts - recorded by the "Record timing" comments above and the Log statements on OperationCanceledExceptions. I do not have any way of telling if the services I call actually timed out.
Now, when I say a series of timeouts I mean that most or all of the tasks (and the HttpClients that all but one use, the other uses a WCF service) all timeout at about the same time.
Now, I know what you are thinking, I am passing in the same timeout. Thats right, but I pass in 250 ms and the run time that is being reported by the various stop watches are around 800 ms or higher.
Now, I do see the OperationCanceledExceptions in the log, but the time stamp of the exception is the same as the time stamp of when the stopwatch ended (or within 2-3 ms) and my service is failing because clients are expecting it to respond in 500 ms or less, not 800 ms.
Now, normally the various services respond in less than 100 ms, with a wide variance among the results. When we a problem occurs, and most / all return in 800 ms or more, they vary only by ~10 ms. The dependencies I call are all on different domains. It seems highly unlikely that all of them are really taking that long to respond, all at the same time.
I suppose there could be a network issue, affecting all requests at the same time, but the other services in our network do not experience the same behavior - it is limited to the new service I am writing.
Even if that was the case, I would expect the cancellation exceptions to occur after 250 ms, then for the task to end and the stopwatch to record 250 (plus 5-20ms or so for exception handling).
So I do not think that it is a network issue. Now I am sure that at least part of the problem is related to me not cancelling / timing out correctly, but it seem to me that all of the out going requests from the service are being affected at the same time independent of HttpClient.
The reason I say that is because the WCF service also shows 800+ ms (according to the stopwatch) when the rest of the requests timeout. The WCF service is not asynchronous. The timeout is set like this:
var binding = new BasicHttpBinding()
{
Security = new BasicHttpSecurity()
{
Mode = BasicHttpSecurityMode.TransportCredentialOnly,
Transport = new HttpTransportSecurity()
{
ClientCredentialType = HttpClientCredentialType.Ntlm
}
},
ReceiveTimeout = TimeSpan.FromMilliseconds(timeout)
};
The Problem:
So, in short I think that something is causing all outgoing requests to any domain to pause or queue which is causing the observed behavior.
I have spent days trying to figure out what is going on, but have had no luck. Any ideas?
EDIT
I think what is happening is that the requests are being put put on hold because there isn't a thread available, and then a few hundred milliseconds later a thread is available and the task starts. Timing the method call shows that it is taking 800 ms, but the timeout on the HttpClient doesn't start until a thread is available to run the async call.
It would also explain why I see that the method takes 800+ ms, but sometimes it still completes without showing a timeout exception. Other times it does throw a timeout exception and does not complete.
I have tried setting the ServicePointManager.DefaultConnectionLimit to 200 in Application_Start, but that did not solve the issue.
The service isn't taking that much traffic compared to our other services, and none of the others appear to have the same problem.
Any ideas?
Edit 2
I logged into the box and monitored netstat while doing (minor) load tests.
Using HttpClient, with 1-2 requests per second the ports would show ESTABLISHED, then move to TIME_WAIT for about 4 minutes. With 3+ requests per second I would end up with about a constant 100 x requests per second ESTABLISHED ports (so 300 for a 3 per second load test), and then I would start seeing them go to CLOSE_WAIT instead of TIME_WAIT - indicating an error condition on close. At the same time I would see the spike in the number of exceptions and time to execute the requests. (TcpTimedWaitDelay does not apply to CLOSE_WAIT).
So I rewrote the whole thing to use HttpWebRequests in serial, instead of HttpClient in parallel. Then I ran the same tests.
Now the ESTABLISHED ports equal 0-2 x requests per second, and the ports then move on to TIME_CLOSE as expected. The performance and throughput improved, but didn't clear up completely.
Then I set TcpTimedWaitDelay to 30 (default 240). The performance has increased dramatically. I have a primitive load test that hits it with 40 requests per second without any issues. I will get a more thorough test setup but I think the problem has been solved.
I don't know what is going on, but it appears that the HttpClient was not closing the ephemoral ports correctly underneath. Many of the developers and architects at my company looked at it and couldn't not see anything wrong with the code. I tried having a single HttpClient in a using statement per request, as well as having a single HttpClient per api I call on the back end. I have tried using HttpClient in parallel and serial. I have tried it with async/await and without. No matter what I tried the behavior was the same.
I would like to be able to use HttpClient, but I can't spend anymore time on this issue as I have it working with HttpWebRequest. My next step is to make the HttpWebRequests occur in Parallel.
Thank you for your input.

I've experienced similar frustrations with the HttpClient. In my scenario I found setting MaxServicePointIdleTime to a much lower value and DefaultConnectionLimit to a high value on the ServicePointManager resolved my issues. I believe in my case I was experiencing pool starvation as the connections were being held open.
You may also want to test without the debugger attached, in release, if you are not already doing so, as the TaskScheduler behaves differently when debugging.
The following MSDN article is very helpful: http://blogs.msdn.com/b/jpsanders/archive/2009/05/20/understanding-maxservicepointidletime-and-defaultconnectionlimit.aspx

Related

Long Polling in AWS SQS causes application to hang

I have an application written in C# that long polls a SQS queue with a ReceiveWaitTime of 20 seconds, and the max number of messages read is one.
This application uses AWSSDK.SQS (3.3.3.62).
I am running into a bit of an issue where the polling seems to just hang indefinitely and does not stop polling until the application is restarted (when the application is restarted, we re-create the Message Queue Monitor and start polling from there).
Here is the bit of code that does the polling:
private async Task ReceiveInternalAsync(Func<IMessageReceipt, Task> onMessageReceived,
bool processAsynchronously, TimeSpan? maximumWait = null, CancellationToken? cancellationToken = null, int maxNumberOfMessages = 1)
{
var request = new ReceiveMessageRequest();
var totalWait = maximumWait ?? TimeSpan.MaxValue;
request.QueueUrl = Address;
request.WaitTimeSeconds = GetSqsWaitTimeSeconds(totalWait, cancellationToken.HasValue);
request.MaxNumberOfMessages = maxNumberOfMessages;
var stopwatch = Stopwatch.StartNew();
Amazon.SQS.Model.Message[] sqsMessages;
while (true)
{
var stopwatch2 = Stopwatch.StartNew();
var response = await _sqsClient.ReceiveMessageAsync(request).ConfigureAwait(false);
stopwatch2.Stop();
sqsMessages = response.Messages.Where(i => i != null).ToArray();
_logger.LogDebug($"{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms");
...
}
}
Where the parameters being sent to this method are:
onMessagedReceived = a delegate to handle the received message
processAsynchronously = true
maximumWait = 20 seconds (new TimeSpan(0,0,20))
cancellationToken = null
maxNumberOfMessages = 1
I have omitted the rest of the while loop, as I don't believe it's indefinitely looping in there, but I am more than welcome to share the rest of it, if we think it can be the crux of the issue.
The reason why I believe it's the sdk that is hanging is because I don't see the debug message:
{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms
Appear, and I know it has hit this method because the caller has a log that states that it has called this method (let me know if I should share the caller's code as well).
I looked up similar issues online and I found this:
https://github.com/aws/aws-sdk-net/issues/609
which seems similar to what I have.
The issue is that this seems to only happen on production whereas locally I cannot replicate to the full extent where it never polls again.
What I have done locally is:
Scenario 1: disconnect completely from the internet
Long Poll queue that has no messages in it
Disconnect from the Internet before 20 seconds are up
About 1 minute and 40 seconds, the AWS SDK will not throw an error but continue on as if there were empty messages in the queue
About 2 minutes in or 2 minutes and 30 seconds in I get a DNS name resolution error
Scenario 2: disconnect from the internet for 1 minute and 40 seconds and reconnect
Based on my analysis from the above scenario, I wondered then what would happen if I were to reconnect after step 3) in scenario 1.
I found that the AWS SDK will wait for 20 seconds to retrieve any messages from the queue.
Theory as to what's happening
I suppose we can have indefinite polling if the client's network keeps going in and out of a connection, such that they are not disconnected for a total of 1 minute and 40 seconds but keep disconnecting and reconnecting before the 20 seconds are up?
Has anyone encountered this issue before? My temporary solution is to send a cancellationToken with a client-side timeout specified. Just wondering if there is any other reasons for the indefinite polling?
Thank you very much for the help!

Performance testing API - WebClient.DownloadData async issue

I am busy doing performance testing on our public API by loading it with parallel, simultaneous calls. Code below.
int batchSize = 10;
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = batchSize;
Parallel.For(0, batchSize, parallelOptions, j =>
{
Debug.WriteLine("Thread began at " + DateTime.Now.ToLongTimeString());
using (WebClient client = new WebClient())
{
Stopwatch sw = Stopwatch.StartNew();
byte[] arr = client.DownloadData("http://myapiurl/webservice.svc");
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds.ToString());
}
});
But I am getting weird results:
From the debug output, I can see that all the threads are starting at the exact same time (as expected).
I am also recording the time taken to process the API call from within the web service (this is stored in a log table). Each call is taking around about the same time... about 2.5 seconds.
But now the console output doesn't correlate. I would expect it to be only slightly longer than what the web service records. Output:
2883
2914
5653
5822
8000
8250
10215
10539
11622
12494
I can come up with the following possible reasons for this:
It is as if WebClient.DownloadData is queuing up my requests across instances of itself.
IIS is queuing up my web requests. This can't be possible as nothing else is hitting the API.
All HTTP requests are moderated by the ServicePointManager, which manages pools of connections to various hosts. There is a limit for concurrent connections (and therefore HTTP requests) per host. This can be increased with a call to:
ServicePointManager.FindServicePoint("http://myapiurl/webservice.svc")
.ConnectionLimit = 100; //arbitrary value
It's also worth remembering that the .Net implementation of HttpWebRequest (which is what WebClient uses) can never be truly asynchronous because the DNS lookup occurs synchronously before the request is issued asynchronously. I've always considered this to be an utterly retarded design decision that prevents high performance http requests (esp. in spidering/crawling scenarios).

Task.Factory.StartNew - confused about the pool

Hi I'm getting myself tied up with Task.Factory.StartNew. Just as I think I get the idea of it someone has suggested I write the following code;
bool exitLoop = false;
while (!exitLoop)
{
exitLoop = true;
var messages = Queue.GetMessages(20);
foreach (var message in messages)
{
exitLoop = false;
Task.Factory.StartNew(() =>
{
DeliverMessage(message);
});
}
}
In theory this is going to drain a queue, 20 messages at a time, attempting to creat a Task for every message in the queue. So if we had a 1000 messages in the queue then in an instant we'd have 25 tasks and it would eat its way through all the msgs. I previously thought I understood this, I thought StartNew would block once it ran out of entries - in the old days that would have been ~ 25. But given this is .net 4.5 which I'm now under the impression that the upper limit for a pool is now pretty high. What puzzles me is that I would have assumed that is going to flood the pool with new tasks and start blocking, i.e. in an instant I now have 1000 tasks running. So if the pool limit is now hardly a limit why am I not seeing 1000 tasks?
[Edit]
ok, so what I'm seeing is that 1000 tasks are queued to run, rather than are running. So how do I determine the number of running/runnable tasks?
I know this is quite a while after your post, but I hope this may help someone facing your specific challenge. Your last comment stated that the 'DeliverMessage' method was making HTTP requests.
If you are using the 'WebClient' object (for example) to make your requests, it will be bound by the ServicePointManager.DefaultConnectionLimit property. This means it will create at most two (by default) concurrent connections to the host. If you created 1,000 parallel tasks, all 1,000 of those would have to be serviced by those two connections.
You'll have to play around with different values for this setting to find the right balance between throughput in your application and load on the web server.

Why does this code fail when executed via TPL/Tasks?

I am using System.Net.Http to use network resources. When running on a single thread it works perfectly. When I run the code via TPL, it hangs and never completes until the timeout is hit.
What happens is that all the threads end up waiting on the sendTask.Result line. I am not sure what they are waiting on, but I assume it is something in HttpClient.
The networking code is:
using (var request = new HttpRequestMessage(HttpMethod.Get, "http://google.com/"))
{
using (var client = new HttpClient())
{
var sendTask = client.SendAsync
(request, HttpCompletionOption.ResponseHeadersRead);
using (var response = sendTask.Result)
{
var streamTask = response.Content.ReadAsStreamAsync();
using (var stream = streamTask.Result)
{
// problem occurs in line above
}
}
}
}
The TPL code that I am using is as follows. The Do method contains exactly the code above.
var taskEnumerables = Enumerable.Range(0, 100);
var tasks = taskEnumerables.Select
(x => Task.Factory.StartNew(() => _Do(ref count))).ToArray();
Task.WaitAll(tasks);
I have tried a couple of different schedulers, and the only way that I can get it to work is to write a scheduler that limits the number of running tasks to 2 or 3. However, even this fails sometimes.
I would assume that my problem is in HttpClient, but for the life of me I can't see any shared state in my code. Does anyone have any ideas?
Thanks,
Erick
I finally found the issue. The problem was that HttpClient issues its own additional tasks, so a single task that I start might actually end spawning 5 or more tasks.
The scheduler was configured with a limit on the number of tasks. I started the task, which caused the number of running tasks to hit the max limit. The HttpClient then attempted to start its own tasks, but because the limit was reached, it blocked until the number of tasks went down, which of course never happened, as they were waiting for my tasks to finish. Hello deadlock.
The morals of the story:
Tasks might be a global resource
There are often non-obvious interdependencies between tasks
Schedulers are not easy to work with
Don't assume that you control either schedulers or number of tasks
I ended up using another method to throttle the number of connections.
Erick

Multi-threaded HttpListener with await async and Tasks

Would this be a good example of a scalable HttpListener that is multi-threaded?
Is this how for example a real IIS would do it?
public class Program
{
private static readonly HttpListener Listener = new HttpListener();
public static void Main()
{
Listener.Prefixes.Add("http://+:80/");
Listener.Start();
Listen();
Console.WriteLine("Listening...");
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
private static async void Listen()
{
while (true)
{
var context = await Listener.GetContextAsync();
Console.WriteLine("Client connected");
Task.Factory.StartNew(() => ProcessRequest(context));
}
Listener.Close();
}
private static void ProcessRequest(HttpListenerContext context)
{
System.Threading.Thread.Sleep(10*1000);
Console.WriteLine("Response");
}
}
I'm specifically looking for a scalable solution that DOES NOT rely on IIS. Instead only on http.sys (which is the httplistener class) -- The reason for not relying on iIS is because the govt. area I work in requires extremely reduced surface area of attack.
I've done something similar at https://github.com/JamesDunne/Aardwolf and have done some extensive testing on this.
See the code at https://github.com/JamesDunne/aardwolf/blob/master/Aardwolf/HttpAsyncHost.cs#L107 for the core event loop's implementation.
I find that using a Semaphore to control how many concurrent GetContextAsync requests are active is the best approach. Essentially, the main loop continues running until the semaphore blocks the thread due to the count being reached. Then there will be N concurrent "connection accepts" active. Each time a connection is accepted, the semaphore is released and a new request can take its place.
The semaphore's initial and max count values require some fine tuning, depending on the load you expect to receive. It's a delicate balancing act between the number of concurrent connections you expect vs. the average response times that your clients desire. Higher values mean more connections can be maintained yet at a much slower average response time; fewer connections will be rejected. Lower values mean less connections can be maintained yet at a much faster average response time; more connections will be rejected.
I've found, experimentally (on my hardware), that values around 128 allow the server to handle large amounts of concurrent connections (up to 1,024) at acceptable response times. Test using your own hardware and tune your parameters accordingly.
I've also found that a single instance of WCAT does not like to handle more than 1,024 connections itself. So if you're serious about load-testing, use multiple client machines with WCAT against your server and be sure to test over a fast network e.g. 10 GbE and that your OS's limits are not slowing you down. Be sure to test on Windows Server SKUs because the Desktop SKUs are limited by default.
Summary:
How you write your connection accept loop is critical to the scalability of your server.
Technically you're right. To make it scalable you probably want to have multiple GetContextAsync running at the same time (performance testing needed to know exactly how many, but "a few for each core" is probably the right answer).
Then naturally, as pointed out by comments; not using IIS means you need to be pretty serious about security for a lot of things IIS gives you "for free".
I know I'm tremendously late to the party on this, but I published a library (source here https://github.com/jchristn/WatsonWebserver) on NuGet which encapsulates an async webserver.
Here's a pattern to use a cancellation token to shut the listener down cleanly:
try
{
while (active)
{
Task<HttpListenerContext> listenTask = httpListener.GetContextAsync();
listenTask.Wait(myCancelToken.Token);
HttpListenerContext listenerContext = listenTask.Result;
// Do something with listenerContext in a seperate thread or task..
}
}
catch (System.OperationCanceledException)
{
// This is expected!
}
httpListener.Close();
Note that this should be executed in its own thread or task to prevent blocking of other code.

Categories

Resources