AWS Lambda times out on 30 seconds even though timeout configured to be 2 minutes - c#

I have an AWS Lambda function written in C# which:
is triggered by a message on a SQS queue
makes 2 (slow/long duration) HTTP REST calls to external (non-AWS) services
sends a message to an SQS Queue
I have configured the Lambda Basic Settings Timeout to 2 minutes.
However, if 2 HTTP REST calls take more than 30 seconds the Lambda times out:
Here is the relevant code, you can see the aligned log statements in the code and logs:
static void get1()
{
using var client = new HttpClient();
Console.WriteLine("Before get1");
var task = Task.Run(() => client.GetAsync("http://slowwly.robertomurray.co.uk/delay/35000/url/http://www.google.co.uk"));
Console.WriteLine("get1 initiated, about to wait");
task.Wait();
Console.WriteLine("get1 wait complete");
var result = task.Result;
Console.WriteLine("After get1, result: " + result.StatusCode);
}
This service http://slowwly.robertomurray.co.uk/delay/35000/url/http://www.google.co.uk, just delays for 35000 milliseconds then provides a response from "http://www.google.co.uk".
If the HTTP REST calls take less than 30 seconds, the Lambda completes and writes a message to the output SQS queue. In this example, I changed the delay/sleep durations to 5 seconds instead of 35 seconds, so the total execution time was less than 30 seconds:
In case the issue was somehow related to the usage of C# GetAsync / task.Wait(), I just tested and found the same timeout behaviour if I instead call:
static void sleepSome(int durationInSeconds)
{
Console.WriteLine("About to sleep for " + durationInSeconds + " seconds");
Thread.Sleep(durationInSeconds * 1000);
Console.WriteLine("Sleep over");
}
Which gives me log output of:
I am wondering if I should use an AWS SDK API from within my Lambda to log to console the configured timeout, just to prove that the timeout I have configured is "active/valid/heeded" etc.
The full end to end orchestration here, in case it is relevant is:
Postman Test client ->
AWS API GW ->
AWS Lambda1 ->
AWS SQS ->
AWS Lambda2 ->
REST API Calls
AWS SQS
AWS Lambda2 is the one that is timing out prematurely, and shown in the logs.
I only seem to have a single version:
And a single alias:

While Lambda itself has a 2 minute timeout, the timeout you see occurring might actually be due to the AWS API Gateway limit of 30 seconds. https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

Related

how to extend the timeout of azure fuctions using durables?

let's say I have an orchestrated function that is chaining like that:
[FunctionName("E1")] //default timeout of 5 minutes
public static async Task<List<string>> Run(
[OrchestrationTrigger] IDurableOrchestrationContext context)
{
var outputs = new List<string>();
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Tokyo")); //takes 5 minutes to complete
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Seattle")); //takes 5 minutes complete
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello_DirectInput", "London")); //takes 5 minutes complete
// should return ["Hello Tokyo!", "Hello Seattle!", "Hello London!"]
return outputs;
}
now we have three functions
let's say each one needs 5 minutes to complete (on the default azure consumptionl plan) ,people say each function has it has its own timeout so we should have a total of around 15 minutes in order to complete (5+5+5) for all ,however the top level function E1 has only a timeout of 5 minutes.Will it timeout before complete because the total of all sub-functions exceeds its limit of 5?
if E1 orchestrator timedout then does the activities or subfunctions stop if the orchestrator itself timedout?
The beauty about durable functions is that it is only active when orchestrating the function. When it reaches await context.CallActivityAsync it will start E1_SayHello but it won't wait for its completion. Instead the durable function will unload and resume once E1_SayHello is completed.
What you are doing is called the Function chaining pattern and this behavior I described above is documented there like this:
Each time the code calls await, the Durable Functions framework checkpoints the progress of the current function instance. If the process or virtual machine recycles midway through the execution, the function instance resumes from the preceding await call.
So no, the durable function won't be active the whole 15 minutes.

Long Polling in AWS SQS causes application to hang

I have an application written in C# that long polls a SQS queue with a ReceiveWaitTime of 20 seconds, and the max number of messages read is one.
This application uses AWSSDK.SQS (3.3.3.62).
I am running into a bit of an issue where the polling seems to just hang indefinitely and does not stop polling until the application is restarted (when the application is restarted, we re-create the Message Queue Monitor and start polling from there).
Here is the bit of code that does the polling:
private async Task ReceiveInternalAsync(Func<IMessageReceipt, Task> onMessageReceived,
bool processAsynchronously, TimeSpan? maximumWait = null, CancellationToken? cancellationToken = null, int maxNumberOfMessages = 1)
{
var request = new ReceiveMessageRequest();
var totalWait = maximumWait ?? TimeSpan.MaxValue;
request.QueueUrl = Address;
request.WaitTimeSeconds = GetSqsWaitTimeSeconds(totalWait, cancellationToken.HasValue);
request.MaxNumberOfMessages = maxNumberOfMessages;
var stopwatch = Stopwatch.StartNew();
Amazon.SQS.Model.Message[] sqsMessages;
while (true)
{
var stopwatch2 = Stopwatch.StartNew();
var response = await _sqsClient.ReceiveMessageAsync(request).ConfigureAwait(false);
stopwatch2.Stop();
sqsMessages = response.Messages.Where(i => i != null).ToArray();
_logger.LogDebug($"{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms");
...
}
}
Where the parameters being sent to this method are:
onMessagedReceived = a delegate to handle the received message
processAsynchronously = true
maximumWait = 20 seconds (new TimeSpan(0,0,20))
cancellationToken = null
maxNumberOfMessages = 1
I have omitted the rest of the while loop, as I don't believe it's indefinitely looping in there, but I am more than welcome to share the rest of it, if we think it can be the crux of the issue.
The reason why I believe it's the sdk that is hanging is because I don't see the debug message:
{request.QueueUrl} {sqsMessages.Length} messages received after {stopwatch2.ElapsedMilliseconds} ms
Appear, and I know it has hit this method because the caller has a log that states that it has called this method (let me know if I should share the caller's code as well).
I looked up similar issues online and I found this:
https://github.com/aws/aws-sdk-net/issues/609
which seems similar to what I have.
The issue is that this seems to only happen on production whereas locally I cannot replicate to the full extent where it never polls again.
What I have done locally is:
Scenario 1: disconnect completely from the internet
Long Poll queue that has no messages in it
Disconnect from the Internet before 20 seconds are up
About 1 minute and 40 seconds, the AWS SDK will not throw an error but continue on as if there were empty messages in the queue
About 2 minutes in or 2 minutes and 30 seconds in I get a DNS name resolution error
Scenario 2: disconnect from the internet for 1 minute and 40 seconds and reconnect
Based on my analysis from the above scenario, I wondered then what would happen if I were to reconnect after step 3) in scenario 1.
I found that the AWS SDK will wait for 20 seconds to retrieve any messages from the queue.
Theory as to what's happening
I suppose we can have indefinite polling if the client's network keeps going in and out of a connection, such that they are not disconnected for a total of 1 minute and 40 seconds but keep disconnecting and reconnecting before the 20 seconds are up?
Has anyone encountered this issue before? My temporary solution is to send a cancellationToken with a client-side timeout specified. Just wondering if there is any other reasons for the indefinite polling?
Thank you very much for the help!

How to implement exponential backoff in Azure Functions?

How to implement exponential backoff in Azure Functions?
I have a function that depends on external API. I would like to handle the unavailability of this service using the retry policy.
This function is triggered when a new message appears in the queue and in this case, this policy is turned on by default:
For most triggers, there is no built-in retry when errors occur during function execution. The two triggers that have retry support are Azure Queue storage and Azure Blob storage. By default, these triggers are retried up to five times. After the fifth retry, both triggers write a message to a special poison queue.
Unfortunately, the retry starts immediately after the exception (TimeSpan.Zero), and this is pointless in this case, because the service is most likely still unavailable.
Is there a way to dynamically modify the time the message is again available in the queue?
I know that I can set visibilityTimeout (host.json reference), but it's set for all queues and that is not what I want to achieve here.
I found one workaround, but it is far from ideal solution. In case of exception, we can add the message again to the queue and set visibilityTimeout for this message:
[FunctionName("Test")]
public static async Task Run([QueueTrigger("queue-test")]string myQueueItem, TraceWriter log,
ExecutionContext context, [Queue("queue-test")] CloudQueue outputQueue)
{
if (true)
{
log.Error("Error message");
await outputQueue.AddMessageAsync(new CloudQueueMessage(myQueueItem), TimeSpan.FromDays(7),
TimeSpan.FromMinutes(1), // <-- visibilityTimeout
null, null).ConfigureAwait(false);
return;
}
}
Unfortunately, this solution is weak because it does not have a context (I do not know which attempt it is and for this reason I can not limit the number of calls and modify the time (exponential backoff)).
Internal retry policy also is not welcome, because it can drastically increase costs (pricing models).
Microsoft added retry policies around November 2020 (preview), which support exponential backoff:
[FunctionName("Test")]
[ExponentialBackoffRetry(5, "00:00:04", "00:15:00")] // retries with delays increasing from 4 seconds to 15 minutes
public static async Task Run([QueueTrigger("queue-test")]string myQueueItem, TraceWriter log, ExecutionContext context)
{
// ...
}
I had a similar problem and ended up using durable functions which have an automatic retry feature built-in. This can be used when you wrap your external API call into activity and when calling this activity you can configure retry behavior through the options object. You can set the following options:
Max number of attempts: The maximum number of retry attempts.
First retry interval: The amount of time to wait before the first retry attempt.
Backoff coefficient: The coefficient used to determine rate of increase of backoff. Defaults to 1.
Max retry interval: The maximum amount of time to wait in between retry attempts.
Retry timeout: The maximum amount of time to spend doing retries. The default behavior is to retry indefinitely.
Handle: A user-defined callback can be specified to determine whether a function should be retried.
One option to consider is to have your Function invoke a Logic App that has a delay set to your desired amount of time and then after the delay invokes the function again. You could also add other retry logic (like # of attempts) to the Logic App using some persistent storage to tally your attempts. You would only invoke the Logic App if there was a connection issue.
Alternatively you could shift your process starting point to Logic Apps as it also can be triggered (think bound) queue messages. In either case Logic Apps adds the ability to pause and re-invoke the Function and/or process.
If you are explicitly completing/deadlettering messages ("autoComplete": false), here's an helper function that will exponentially delay and retry until the max delivery count is reached:
public static async Task ExceptionHandler(IMessageSession MessageSession, string LockToken, int DeliveryCount)
{
if (DeliveryCount < Globals.MaxDeliveryCount)
{
var DelaySeconds = Math.Pow(Globals.ExponentialBackoff, DeliveryCount);
await Task.Delay(TimeSpan.FromSeconds(DelaySeconds));
await MessageSession.AbandonAsync(LockToken);
}
else
{
await MessageSession.DeadLetterAsync(LockToken);
}
}
Since November 2022, there hasn't been anymore support for Function-level retries for QueueTrigger (source).
Instead of this, you must use Binding extensions:
{
"version": "2.0",
"extensions": {
"serviceBus": {
"clientRetryOptions":{
"mode": "exponential",
"tryTimeout": "00:01:00",
"delay": "00:00:00.80",
"maxDelay": "00:01:00",
"maxRetries": 3
}
}
}
}

service bus receive not returning immediately

I'm using Azure Service Bus Topic/Subscriptions and running into unexplained performance with receiving messages.
According to MSDN:
If Zero is passed in serverWaitTime, then there will be no wait time.
Instead server bus will fetch whatever messages are immediately
available, or return null as a result.
In my code, which is stripped down to bare essentials, I pass in a 0, but then the receive takes 60 seconds to complete. When it completes, I get passed back a null object. Why does it take 60 seconds when it should return immediately?
var client = ServiceBusClient.GetOrCreateSubscriptionClient(topicName, subscriptionName, false);
var msg = client.Receive(TimeSpan.FromSeconds(0));

Timeout Exception - Queuing of Requests? Not enough threads?

Background:
I have a service which aggregates data from multiple other services. To make things happen in a timely manner I use async throughout the code, and then gather the various requests into a list of tasks.
Here is some excerpts from the code:
private async Task<List<Foo>> Baz(..., int timeout)
{
var tasks = new List<Task<IEnumerable<Foo>>>();
Tasks.Add(GetFoo1(..., timeout));
Tasks.Add(GetFoo2(..., timeout));
// Up to 6, depending on other parameters. Some tasks return multiple objects.
return await Task.WhenAll(tasks).ContinueWith((antecedent) => { return antecedent.Result.AsEnumerable().SelectMany(f => f).ToList(); }).ConfigureAwait(false);
}
private async Task<IEnumerable<Foo>> GetFoo1(..., int timeout)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var value = await SomeAsyncronousService.GetAsync(..., timeout).ConfigureAwait(false);
sw.Stop();
// Record timing...
return new[] { new Foo(..., value) };
}
private async Task<IEnumerable<Foo>> GetFoo2(..., int timeout)
{
return await Task.Run(() => {
Stopwatch sw = new Stopwatch();
sw.Start();
var r = new[] { new Foo(..., SomeSyncronousService.Get(..., timeout)) };
sw.Start();
sw.Stop();
// Record timing...
return r;
}).ConfigureAwait(false);
}
// In class SomeAsyncronousService
public async Task<string> GetAsync(..., int timeout)
{
...
try
{
using (var httpClient = HttpClientFactory.Create())
{
// I have tried it with both timeout and CTS. The behavior is the same.
//httpClient.Timeout = TimeSpan.FromMilliseconds(timeout);
var cts = new CancellationTokenSource();
cts.CancelAfter(timeout);
var content = ...;
var responseMessage = await httpClient.PostAsync(Endpoint, content, cts.Token).ConfigureAwait(false);
if (responseMessage.IsSuccessStatusCode)
{
var contentData = await responseMessage.Content.ReadAsStringAsync().ConfigureAwait(false);
...
return ...
}
...
}
}
catch (OperationCanceledException ex)
{
// Log statement ...
}
catch (Exception ex)
{
// Log statement ...
}
return ...;
}
The Symptoms:
This code works great on my local machine, and it works fine on our test servers most of the time. However, occasionally we get a bunch of mass recorded timeouts - recorded by the "Record timing" comments above and the Log statements on OperationCanceledExceptions. I do not have any way of telling if the services I call actually timed out.
Now, when I say a series of timeouts I mean that most or all of the tasks (and the HttpClients that all but one use, the other uses a WCF service) all timeout at about the same time.
Now, I know what you are thinking, I am passing in the same timeout. Thats right, but I pass in 250 ms and the run time that is being reported by the various stop watches are around 800 ms or higher.
Now, I do see the OperationCanceledExceptions in the log, but the time stamp of the exception is the same as the time stamp of when the stopwatch ended (or within 2-3 ms) and my service is failing because clients are expecting it to respond in 500 ms or less, not 800 ms.
Now, normally the various services respond in less than 100 ms, with a wide variance among the results. When we a problem occurs, and most / all return in 800 ms or more, they vary only by ~10 ms. The dependencies I call are all on different domains. It seems highly unlikely that all of them are really taking that long to respond, all at the same time.
I suppose there could be a network issue, affecting all requests at the same time, but the other services in our network do not experience the same behavior - it is limited to the new service I am writing.
Even if that was the case, I would expect the cancellation exceptions to occur after 250 ms, then for the task to end and the stopwatch to record 250 (plus 5-20ms or so for exception handling).
So I do not think that it is a network issue. Now I am sure that at least part of the problem is related to me not cancelling / timing out correctly, but it seem to me that all of the out going requests from the service are being affected at the same time independent of HttpClient.
The reason I say that is because the WCF service also shows 800+ ms (according to the stopwatch) when the rest of the requests timeout. The WCF service is not asynchronous. The timeout is set like this:
var binding = new BasicHttpBinding()
{
Security = new BasicHttpSecurity()
{
Mode = BasicHttpSecurityMode.TransportCredentialOnly,
Transport = new HttpTransportSecurity()
{
ClientCredentialType = HttpClientCredentialType.Ntlm
}
},
ReceiveTimeout = TimeSpan.FromMilliseconds(timeout)
};
The Problem:
So, in short I think that something is causing all outgoing requests to any domain to pause or queue which is causing the observed behavior.
I have spent days trying to figure out what is going on, but have had no luck. Any ideas?
EDIT
I think what is happening is that the requests are being put put on hold because there isn't a thread available, and then a few hundred milliseconds later a thread is available and the task starts. Timing the method call shows that it is taking 800 ms, but the timeout on the HttpClient doesn't start until a thread is available to run the async call.
It would also explain why I see that the method takes 800+ ms, but sometimes it still completes without showing a timeout exception. Other times it does throw a timeout exception and does not complete.
I have tried setting the ServicePointManager.DefaultConnectionLimit to 200 in Application_Start, but that did not solve the issue.
The service isn't taking that much traffic compared to our other services, and none of the others appear to have the same problem.
Any ideas?
Edit 2
I logged into the box and monitored netstat while doing (minor) load tests.
Using HttpClient, with 1-2 requests per second the ports would show ESTABLISHED, then move to TIME_WAIT for about 4 minutes. With 3+ requests per second I would end up with about a constant 100 x requests per second ESTABLISHED ports (so 300 for a 3 per second load test), and then I would start seeing them go to CLOSE_WAIT instead of TIME_WAIT - indicating an error condition on close. At the same time I would see the spike in the number of exceptions and time to execute the requests. (TcpTimedWaitDelay does not apply to CLOSE_WAIT).
So I rewrote the whole thing to use HttpWebRequests in serial, instead of HttpClient in parallel. Then I ran the same tests.
Now the ESTABLISHED ports equal 0-2 x requests per second, and the ports then move on to TIME_CLOSE as expected. The performance and throughput improved, but didn't clear up completely.
Then I set TcpTimedWaitDelay to 30 (default 240). The performance has increased dramatically. I have a primitive load test that hits it with 40 requests per second without any issues. I will get a more thorough test setup but I think the problem has been solved.
I don't know what is going on, but it appears that the HttpClient was not closing the ephemoral ports correctly underneath. Many of the developers and architects at my company looked at it and couldn't not see anything wrong with the code. I tried having a single HttpClient in a using statement per request, as well as having a single HttpClient per api I call on the back end. I have tried using HttpClient in parallel and serial. I have tried it with async/await and without. No matter what I tried the behavior was the same.
I would like to be able to use HttpClient, but I can't spend anymore time on this issue as I have it working with HttpWebRequest. My next step is to make the HttpWebRequests occur in Parallel.
Thank you for your input.
I've experienced similar frustrations with the HttpClient. In my scenario I found setting MaxServicePointIdleTime to a much lower value and DefaultConnectionLimit to a high value on the ServicePointManager resolved my issues. I believe in my case I was experiencing pool starvation as the connections were being held open.
You may also want to test without the debugger attached, in release, if you are not already doing so, as the TaskScheduler behaves differently when debugging.
The following MSDN article is very helpful: http://blogs.msdn.com/b/jpsanders/archive/2009/05/20/understanding-maxservicepointidletime-and-defaultconnectionlimit.aspx

Categories

Resources