Long Polling in AWS SQS causes application to hang

Long Polling in AWS SQS causes application to hang - c#

I have an application written in C# that long polls a SQS queue with a ReceiveWaitTime of 20 seconds, and the max number of messages read is one.
This application uses AWSSDK.SQS (3.3.3.62).
I am running into a bit of an issue where the polling seems to just hang indefinitely and does not stop polling until the application is restarted (when the application is restarted, we re-create the Message Queue Monitor and start polling from there).
Here is the bit of code that does the polling:
private async Task ReceiveInternalAsync(Func<IMessageReceipt, Task> onMessageReceived,
bool processAsynchronously, TimeSpan? maximumWait = null, CancellationToken? cancellationToken = null, int maxNumberOfMessages = 1)
{
var request = new ReceiveMessageRequest();
var totalWait = maximumWait ?? TimeSpan.MaxValue;
request.QueueUrl = Address;
request.WaitTimeSeconds = GetSqsWaitTimeSeconds(totalWait, cancellationToken.HasValue);
request.MaxNumberOfMessages = maxNumberOfMessages;
var stopwatch = Stopwatch.StartNew();
Amazon.SQS.Model.Message[] sqsMessages;
while (true)
{
var stopwatch2 = Stopwatch.StartNew();
var response = await _sqsClient.ReceiveMessageAsync(request).ConfigureAwait(false);
stopwatch2.Stop();
sqsMessages = response.Messages.Where(i => i != null).ToArray();
_logger.LogDebug($"{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms");
...
}
}
Where the parameters being sent to this method are:
onMessagedReceived = a delegate to handle the received message
processAsynchronously = true
maximumWait = 20 seconds (new TimeSpan(0,0,20))
cancellationToken = null
maxNumberOfMessages = 1
I have omitted the rest of the while loop, as I don't believe it's indefinitely looping in there, but I am more than welcome to share the rest of it, if we think it can be the crux of the issue.
The reason why I believe it's the sdk that is hanging is because I don't see the debug message:
{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms
Appear, and I know it has hit this method because the caller has a log that states that it has called this method (let me know if I should share the caller's code as well).
I looked up similar issues online and I found this:
https://github.com/aws/aws-sdk-net/issues/609
which seems similar to what I have.
The issue is that this seems to only happen on production whereas locally I cannot replicate to the full extent where it never polls again.
What I have done locally is:
Scenario 1: disconnect completely from the internet
Long Poll queue that has no messages in it
Disconnect from the Internet before 20 seconds are up
About 1 minute and 40 seconds, the AWS SDK will not throw an error but continue on as if there were empty messages in the queue
About 2 minutes in or 2 minutes and 30 seconds in I get a DNS name resolution error
Scenario 2: disconnect from the internet for 1 minute and 40 seconds and reconnect
Based on my analysis from the above scenario, I wondered then what would happen if I were to reconnect after step 3) in scenario 1.
I found that the AWS SDK will wait for 20 seconds to retrieve any messages from the queue.
Theory as to what's happening
I suppose we can have indefinite polling if the client's network keeps going in and out of a connection, such that they are not disconnected for a total of 1 minute and 40 seconds but keep disconnecting and reconnecting before the 20 seconds are up?
Has anyone encountered this issue before? My temporary solution is to send a cancellationToken with a client-side timeout specified. Just wondering if there is any other reasons for the indefinite polling?
Thank you very much for the help!

Related

AWS Lambda times out on 30 seconds even though timeout configured to be 2 minutes

I have an AWS Lambda function written in C# which:
is triggered by a message on a SQS queue
makes 2 (slow/long duration) HTTP REST calls to external (non-AWS) services
sends a message to an SQS Queue
I have configured the Lambda Basic Settings Timeout to 2 minutes.
However, if 2 HTTP REST calls take more than 30 seconds the Lambda times out:
Here is the relevant code, you can see the aligned log statements in the code and logs:
static void get1()
{
using var client = new HttpClient();
Console.WriteLine("Before get1");
var task = Task.Run(() => client.GetAsync("http://slowwly.robertomurray.co.uk/delay/35000/url/http://www.google.co.uk"));
Console.WriteLine("get1 initiated, about to wait");
task.Wait();
Console.WriteLine("get1 wait complete");
var result = task.Result;
Console.WriteLine("After get1, result: " + result.StatusCode);
}
This service http://slowwly.robertomurray.co.uk/delay/35000/url/http://www.google.co.uk, just delays for 35000 milliseconds then provides a response from "http://www.google.co.uk".
If the HTTP REST calls take less than 30 seconds, the Lambda completes and writes a message to the output SQS queue. In this example, I changed the delay/sleep durations to 5 seconds instead of 35 seconds, so the total execution time was less than 30 seconds:
In case the issue was somehow related to the usage of C# GetAsync / task.Wait(), I just tested and found the same timeout behaviour if I instead call:
static void sleepSome(int durationInSeconds)
{
Console.WriteLine("About to sleep for " + durationInSeconds + " seconds");
Thread.Sleep(durationInSeconds * 1000);
Console.WriteLine("Sleep over");
}
Which gives me log output of:
I am wondering if I should use an AWS SDK API from within my Lambda to log to console the configured timeout, just to prove that the timeout I have configured is "active/valid/heeded" etc.
The full end to end orchestration here, in case it is relevant is:
Postman Test client ->
AWS API GW ->
AWS Lambda1 ->
AWS SQS ->
AWS Lambda2 ->
REST API Calls
AWS SQS
AWS Lambda2 is the one that is timing out prematurely, and shown in the logs.
I only seem to have a single version:
And a single alias:

While Lambda itself has a 2 minute timeout, the timeout you see occurring might actually be due to the AWS API Gateway limit of 30 seconds. https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

What is the role of "MaxAutoRenewDuration" in azure service bus?

I'm using Microsoft.Azure.ServiceBus. (doc)
I was getting an exception of:
The lock supplied is invalid. Either the lock expired, or the message
has already been removed from the queue.
By the help of these questions:
1, 2, 3,
I am able to avoid the Exception by setting the AutoComplete to false and by increment the Azure's queue lock duration to its max (from 30 seconds to 5 minutes).
_queueClient.RegisterMessageHandler(ProcessMessagesAsync, new
MessageHandlerOptions(ExceptionReceivedHandler)
{
MaxConcurrentCalls = 1,
MaxAutoRenewDuration = TimeSpan.FromSeconds(10),
AutoComplete = false
}
);
private async Task ProcessMessagesAsync(Message message, CancellationToken token)
{
await ProccesMessage(message);
}
private async Task ProccesMessage(Message message)
{
//The complete should be closed before long-timed process
await _queueClient.CompleteAsync(message.SystemProperties.LockToken);
await DoFoo(message.Body); //some long running process
}
My questions are:
This answer suggested that the exception was raised because the lock was being expired before the long time process, but in my case I was marking the message as complete immediately (before the long run process), so I'm not sure why changing the locking duration from azure made any difference? when I change it back to 30 seconds I can see the exception again.
Not sure if it related to the question but what is the purpose MaxAutoRenewDuration, the official docs is The maximum duration during which locks are automatically renewed.. If in my case I have only one app receiver that en-queue from this queue, so is it not needed because I do not need to lock the message from another app to capture it? and why this value should be greater than the longest message lock duration?

There are a few things you need to consider.
Lock duration
Total time since a message acquired from the broker
The lock duration is simple - for how long a single competing consumer can lease a message w/o having that message leased to any other competing consumer.
The total time is a bit tricker. Your callback ProcessMessagesAsync registered with to receive the message is not the only thing that is involved. In the code sample, you've provided, you're setting the concurrency to 1. If there's a prefetch configured (queue gets more than one message with every request for a message or several), the lock duration clock on the server starts ticking for all those messages. So if your processing is done slightly under MaxLockDuration but for the same of example, the last prefetched message was waiting to get processed too long, even if it's done within less than lock duration time, it might lose its lock and the exception will be thrown when attempting completion of that message.
This is where MaxAutoRenewDuration comes into the game. What it does is extends the message lease with the broker, "re-locking" it for the competing consumer that is currently handling the message. MaxAutoRenewDuration should be set to the "possibly maximum processing time a lease will be required". In your sample, it's set to TimeSpan.FromSeconds(10) which is extremely low. It needs to be set to be at least longer than the MaxLockDuration and adjusted to the longest period of time ProccesMessage will need to run. Taking prefetching into consideration.
To help to visualize it, think of the client-side having an in-memory queue where the messages can be stored while you perform the serial processing of the messages one by one in your handler. Lease starts the moment a message arrives from the broker to that in-memory queue. If the total time in the in-memory queue plus the processing exceeds the lock duration, the lease is lost. Your options are:
Enable concurrent processing by setting MaxConcurrentCalls > 1
Increase MaxLockDuration
Reduce message prefetch (if you use it)
Configure MaxAutoRenewDuration to renew the lock and overcome the MaxLockDuration constraint
Note about #4 - it's not a guaranteed operation. Therefore there's a chance a call to the broker will fail and message lock will not be extended. I recommend designing your solutions to work within the lock duration limit. Alternatively, persist message information so that your processing doesn't have to be constrained by the messaging.

Azure Service Bus Topic Subscriber lock expired exception

I have an web job that consumes message from Azure Service Bus Topic by registering a OnMessage callback . The message lock duration was set to 30 seconds and the lock renew timeout to 60 seconds. As such jobs taking more than 30 seconds to process service bus message were getting lock expired exception.
Now,I have set the message lock duration to more than lock renew time out. But somehow it still throws same exception. I also restarted my webjob, but still no luck.
I tried running same webjob consuming messages from different topic with later settings and it works fine. Is this behaviour expected, and after how much time does this setting change normally reflect.
Any help will be great

I have set the message lock duration to more than lock renew time out. But somehow it still throws same exception.
The max value of lock duration is 5 min. If you need less than 5 min to process the job, you could increase the lock duration of your message to meet your requirement.
If you need more than 5 min to process your job, you need to set the AutoRenewTimeout property of OnMessageOptions. It will renew the lock if the lock expired before it reached the AutoRenewTimeout. For example, if you set lock duration to 1 min and set AutoRenewTimeout to 5 min. The message will keep in locked for up to 5 min if you don't release the lock.
Here are the sample code I used to test the lock duration and AutoRenewTimeout on my side. If the job spent more time than lock duration and AutoRenewTimeout, it will throw a exception when we complete the message(it means timeout happened). I also modified the lock duration on portal and the configuration will be applied immediately when I receive a message.
SubscriptionClient Client = SubscriptionClient.CreateFromConnectionString(connectionString, "topic name", "subscription name");
// Configure the callback options.
OnMessageOptions options = new OnMessageOptions();
options.AutoComplete = false;
options.AutoRenewTimeout = TimeSpan.FromSeconds(60);
Client.OnMessage((message) =>
{
try
{
//process the message here, I used following code to simulation a long time spent job
for (int i = 0; i < 30; i++)
{
Thread.Sleep(3000);
}
// Remove message from subscription.
message.Complete();
}
catch (Exception ex)
{
// Indicates a problem, unlock message in subscription.
message.Abandon();
}
}, options);
For your issue, please check how much time will be spent on your job and choose a right way to set lock duration and AutoRenewTimeout.

The settings should be reflected almost immediately. Also lock renewal should probably be more than the lock duration or disabled.
Lock renewal feature is ASB client feature and it doesn't override lock duration set on entities. If you can reproduce this issue and share the repro, raise a support issue with Microsoft.

Timeout Exception - Queuing of Requests? Not enough threads?

Background:
I have a service which aggregates data from multiple other services. To make things happen in a timely manner I use async throughout the code, and then gather the various requests into a list of tasks.
Here is some excerpts from the code:
private async Task<List<Foo>> Baz(..., int timeout)
{
var tasks = new List<Task<IEnumerable<Foo>>>();
Tasks.Add(GetFoo1(..., timeout));
Tasks.Add(GetFoo2(..., timeout));
// Up to 6, depending on other parameters. Some tasks return multiple objects.
return await Task.WhenAll(tasks).ContinueWith((antecedent) => { return antecedent.Result.AsEnumerable().SelectMany(f => f).ToList(); }).ConfigureAwait(false);
}
private async Task<IEnumerable<Foo>> GetFoo1(..., int timeout)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var value = await SomeAsyncronousService.GetAsync(..., timeout).ConfigureAwait(false);
sw.Stop();
// Record timing...
return new[] { new Foo(..., value) };
}
private async Task<IEnumerable<Foo>> GetFoo2(..., int timeout)
{
return await Task.Run(() => {
Stopwatch sw = new Stopwatch();
sw.Start();
var r = new[] { new Foo(..., SomeSyncronousService.Get(..., timeout)) };
sw.Start();
sw.Stop();
// Record timing...
return r;
}).ConfigureAwait(false);
}
// In class SomeAsyncronousService
public async Task<string> GetAsync(..., int timeout)
{
...
try
{
using (var httpClient = HttpClientFactory.Create())
{
// I have tried it with both timeout and CTS. The behavior is the same.
//httpClient.Timeout = TimeSpan.FromMilliseconds(timeout);
var cts = new CancellationTokenSource();
cts.CancelAfter(timeout);
var content = ...;
var responseMessage = await httpClient.PostAsync(Endpoint, content, cts.Token).ConfigureAwait(false);
if (responseMessage.IsSuccessStatusCode)
{
var contentData = await responseMessage.Content.ReadAsStringAsync().ConfigureAwait(false);
...
return ...
}
...
}
}
catch (OperationCanceledException ex)
{
// Log statement ...
}
catch (Exception ex)
{
// Log statement ...
}
return ...;
}
The Symptoms:
This code works great on my local machine, and it works fine on our test servers most of the time. However, occasionally we get a bunch of mass recorded timeouts - recorded by the "Record timing" comments above and the Log statements on OperationCanceledExceptions. I do not have any way of telling if the services I call actually timed out.
Now, when I say a series of timeouts I mean that most or all of the tasks (and the HttpClients that all but one use, the other uses a WCF service) all timeout at about the same time.
Now, I know what you are thinking, I am passing in the same timeout. Thats right, but I pass in 250 ms and the run time that is being reported by the various stop watches are around 800 ms or higher.
Now, I do see the OperationCanceledExceptions in the log, but the time stamp of the exception is the same as the time stamp of when the stopwatch ended (or within 2-3 ms) and my service is failing because clients are expecting it to respond in 500 ms or less, not 800 ms.
Now, normally the various services respond in less than 100 ms, with a wide variance among the results. When we a problem occurs, and most / all return in 800 ms or more, they vary only by ~10 ms. The dependencies I call are all on different domains. It seems highly unlikely that all of them are really taking that long to respond, all at the same time.
I suppose there could be a network issue, affecting all requests at the same time, but the other services in our network do not experience the same behavior - it is limited to the new service I am writing.
Even if that was the case, I would expect the cancellation exceptions to occur after 250 ms, then for the task to end and the stopwatch to record 250 (plus 5-20ms or so for exception handling).
So I do not think that it is a network issue. Now I am sure that at least part of the problem is related to me not cancelling / timing out correctly, but it seem to me that all of the out going requests from the service are being affected at the same time independent of HttpClient.
The reason I say that is because the WCF service also shows 800+ ms (according to the stopwatch) when the rest of the requests timeout. The WCF service is not asynchronous. The timeout is set like this:
var binding = new BasicHttpBinding()
{
Security = new BasicHttpSecurity()
{
Mode = BasicHttpSecurityMode.TransportCredentialOnly,
Transport = new HttpTransportSecurity()
{
ClientCredentialType = HttpClientCredentialType.Ntlm
}
},
ReceiveTimeout = TimeSpan.FromMilliseconds(timeout)
};
The Problem:
So, in short I think that something is causing all outgoing requests to any domain to pause or queue which is causing the observed behavior.
I have spent days trying to figure out what is going on, but have had no luck. Any ideas?
EDIT
I think what is happening is that the requests are being put put on hold because there isn't a thread available, and then a few hundred milliseconds later a thread is available and the task starts. Timing the method call shows that it is taking 800 ms, but the timeout on the HttpClient doesn't start until a thread is available to run the async call.
It would also explain why I see that the method takes 800+ ms, but sometimes it still completes without showing a timeout exception. Other times it does throw a timeout exception and does not complete.
I have tried setting the ServicePointManager.DefaultConnectionLimit to 200 in Application_Start, but that did not solve the issue.
The service isn't taking that much traffic compared to our other services, and none of the others appear to have the same problem.
Any ideas?
Edit 2
I logged into the box and monitored netstat while doing (minor) load tests.
Using HttpClient, with 1-2 requests per second the ports would show ESTABLISHED, then move to TIME_WAIT for about 4 minutes. With 3+ requests per second I would end up with about a constant 100 x requests per second ESTABLISHED ports (so 300 for a 3 per second load test), and then I would start seeing them go to CLOSE_WAIT instead of TIME_WAIT - indicating an error condition on close. At the same time I would see the spike in the number of exceptions and time to execute the requests. (TcpTimedWaitDelay does not apply to CLOSE_WAIT).
So I rewrote the whole thing to use HttpWebRequests in serial, instead of HttpClient in parallel. Then I ran the same tests.
Now the ESTABLISHED ports equal 0-2 x requests per second, and the ports then move on to TIME_CLOSE as expected. The performance and throughput improved, but didn't clear up completely.
Then I set TcpTimedWaitDelay to 30 (default 240). The performance has increased dramatically. I have a primitive load test that hits it with 40 requests per second without any issues. I will get a more thorough test setup but I think the problem has been solved.
I don't know what is going on, but it appears that the HttpClient was not closing the ephemoral ports correctly underneath. Many of the developers and architects at my company looked at it and couldn't not see anything wrong with the code. I tried having a single HttpClient in a using statement per request, as well as having a single HttpClient per api I call on the back end. I have tried using HttpClient in parallel and serial. I have tried it with async/await and without. No matter what I tried the behavior was the same.
I would like to be able to use HttpClient, but I can't spend anymore time on this issue as I have it working with HttpWebRequest. My next step is to make the HttpWebRequests occur in Parallel.
Thank you for your input.

I've experienced similar frustrations with the HttpClient. In my scenario I found setting MaxServicePointIdleTime to a much lower value and DefaultConnectionLimit to a high value on the ServicePointManager resolved my issues. I believe in my case I was experiencing pool starvation as the connections were being held open.
You may also want to test without the debugger attached, in release, if you are not already doing so, as the TaskScheduler behaves differently when debugging.
The following MSDN article is very helpful: http://blogs.msdn.com/b/jpsanders/archive/2009/05/20/understanding-maxservicepointidletime-and-defaultconnectionlimit.aspx

Task.Factory.StartNew - confused about the pool

Hi I'm getting myself tied up with Task.Factory.StartNew. Just as I think I get the idea of it someone has suggested I write the following code;
bool exitLoop = false;
while (!exitLoop)
{
exitLoop = true;
var messages = Queue.GetMessages(20);
foreach (var message in messages)
{
exitLoop = false;
Task.Factory.StartNew(() =>
{
DeliverMessage(message);
});
}
}
In theory this is going to drain a queue, 20 messages at a time, attempting to creat a Task for every message in the queue. So if we had a 1000 messages in the queue then in an instant we'd have 25 tasks and it would eat its way through all the msgs. I previously thought I understood this, I thought StartNew would block once it ran out of entries - in the old days that would have been ~ 25. But given this is .net 4.5 which I'm now under the impression that the upper limit for a pool is now pretty high. What puzzles me is that I would have assumed that is going to flood the pool with new tasks and start blocking, i.e. in an instant I now have 1000 tasks running. So if the pool limit is now hardly a limit why am I not seeing 1000 tasks?
[Edit]
ok, so what I'm seeing is that 1000 tasks are queued to run, rather than are running. So how do I determine the number of running/runnable tasks?

I know this is quite a while after your post, but I hope this may help someone facing your specific challenge. Your last comment stated that the 'DeliverMessage' method was making HTTP requests.
If you are using the 'WebClient' object (for example) to make your requests, it will be bound by the ServicePointManager.DefaultConnectionLimit property. This means it will create at most two (by default) concurrent connections to the host. If you created 1,000 parallel tasks, all 1,000 of those would have to be serviced by those two connections.
You'll have to play around with different values for this setting to find the right balance between throughput in your application and load on the web server.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.