How to implement exponential backoff in Azure Functions?
I have a function that depends on external API. I would like to handle the unavailability of this service using the retry policy.
This function is triggered when a new message appears in the queue and in this case, this policy is turned on by default:
For most triggers, there is no built-in retry when errors occur during function execution. The two triggers that have retry support are Azure Queue storage and Azure Blob storage. By default, these triggers are retried up to five times. After the fifth retry, both triggers write a message to a special poison queue.
Unfortunately, the retry starts immediately after the exception (TimeSpan.Zero), and this is pointless in this case, because the service is most likely still unavailable.
Is there a way to dynamically modify the time the message is again available in the queue?
I know that I can set visibilityTimeout (host.json reference), but it's set for all queues and that is not what I want to achieve here.
I found one workaround, but it is far from ideal solution. In case of exception, we can add the message again to the queue and set visibilityTimeout for this message:
[FunctionName("Test")]
public static async Task Run([QueueTrigger("queue-test")]string myQueueItem, TraceWriter log,
ExecutionContext context, [Queue("queue-test")] CloudQueue outputQueue)
{
if (true)
{
log.Error("Error message");
await outputQueue.AddMessageAsync(new CloudQueueMessage(myQueueItem), TimeSpan.FromDays(7),
TimeSpan.FromMinutes(1), // <-- visibilityTimeout
null, null).ConfigureAwait(false);
return;
}
}
Unfortunately, this solution is weak because it does not have a context (I do not know which attempt it is and for this reason I can not limit the number of calls and modify the time (exponential backoff)).
Internal retry policy also is not welcome, because it can drastically increase costs (pricing models).
Microsoft added retry policies around November 2020 (preview), which support exponential backoff:
[FunctionName("Test")]
[ExponentialBackoffRetry(5, "00:00:04", "00:15:00")] // retries with delays increasing from 4 seconds to 15 minutes
public static async Task Run([QueueTrigger("queue-test")]string myQueueItem, TraceWriter log, ExecutionContext context)
{
// ...
}
I had a similar problem and ended up using durable functions which have an automatic retry feature built-in. This can be used when you wrap your external API call into activity and when calling this activity you can configure retry behavior through the options object. You can set the following options:
Max number of attempts: The maximum number of retry attempts.
First retry interval: The amount of time to wait before the first retry attempt.
Backoff coefficient: The coefficient used to determine rate of increase of backoff. Defaults to 1.
Max retry interval: The maximum amount of time to wait in between retry attempts.
Retry timeout: The maximum amount of time to spend doing retries. The default behavior is to retry indefinitely.
Handle: A user-defined callback can be specified to determine whether a function should be retried.
One option to consider is to have your Function invoke a Logic App that has a delay set to your desired amount of time and then after the delay invokes the function again. You could also add other retry logic (like # of attempts) to the Logic App using some persistent storage to tally your attempts. You would only invoke the Logic App if there was a connection issue.
Alternatively you could shift your process starting point to Logic Apps as it also can be triggered (think bound) queue messages. In either case Logic Apps adds the ability to pause and re-invoke the Function and/or process.
If you are explicitly completing/deadlettering messages ("autoComplete": false), here's an helper function that will exponentially delay and retry until the max delivery count is reached:
public static async Task ExceptionHandler(IMessageSession MessageSession, string LockToken, int DeliveryCount)
{
if (DeliveryCount < Globals.MaxDeliveryCount)
{
var DelaySeconds = Math.Pow(Globals.ExponentialBackoff, DeliveryCount);
await Task.Delay(TimeSpan.FromSeconds(DelaySeconds));
await MessageSession.AbandonAsync(LockToken);
}
else
{
await MessageSession.DeadLetterAsync(LockToken);
}
}
Since November 2022, there hasn't been anymore support for Function-level retries for QueueTrigger (source).
Instead of this, you must use Binding extensions:
{
"version": "2.0",
"extensions": {
"serviceBus": {
"clientRetryOptions":{
"mode": "exponential",
"tryTimeout": "00:01:00",
"delay": "00:00:00.80",
"maxDelay": "00:01:00",
"maxRetries": 3
}
}
}
}
Related
I trying to implement my own version of WebHooks for my application. When a user registers their URL hook (assume its wrong URL or will not respond with 2XX code), i would like to retry up to five times in an custom exponential interval say 5 mins, 30 mins, 2 hrs, 4 hrs and 16 hrs. I have implemented this using .NET Polly library.
My question is,
1) In worst case is it safe to extend up to 16 hrs for 5th retry?
2) Is it thread safe (Polly says it is thread safe as long as my code is thread safe) but my concerns is 16 hrs long intervals.
3) Say 10 requests are failed and all the requests are retrying in their own interval. So, increasingly more requests are failed after some time does my server thread pools will become full and unable to accept any new requests?
4) Due to long intervals is it really worth using Polly like libraries or better go for CRON job schedulers?
My implementation details are very similar or identical to Polly's official example IHttpClientFactory (link).
Thanks for you suggestion.
I highly encourage you to deep dive into the Polly source code, it's relatively easy to read.
From example if you start from the WaitAndRetryAsync function you will shortly reach the AsyncRetryEngine. This contains the retry implementation, that's why it's only method called ImplementationAsync. If we jump to the wait related part then you will find there the following piece of code:
if (waitDuration > TimeSpan.Zero)
{
await SystemClock.SleepAsync(waitDuration, cancellationToken).ConfigureAwait(continueOnCapturedContext);
}
If you look at the SystemClock then you will find the Sleep and SleepAsync fields (methods) definition:
public static Func<TimeSpan, CancellationToken, Task> SleepAsync =
new Func<TimeSpan, CancellationToken, Task>(Task.Delay);
SystemClock.Sleep = (Action<TimeSpan, CancellationToken>) ((timeSpan, cancellationToken) =>
{
if (!cancellationToken.WaitHandle.WaitOne(timeSpan))
return;
cancellationToken.ThrowIfCancellationRequested();
});
As you can see if you call the WaitAndRetryAsync then your policy will call the Task.Delay, which is non-blocking. If you call the WaitAndRetry then your policy will call the WaitHandle's WaitOne, which is blocking.
So if you use WaitAndRetry for that long period then that will block the thread (unless it is terminated for some reason). In case of WaitAndRetryAsync the ThreadPool will receive the notification about completion after the Delay completed.
But still I would suggest to use cron jobs for this sort of problems.
I'm using Microsoft.Azure.ServiceBus. (doc)
I was getting an exception of:
The lock supplied is invalid. Either the lock expired, or the message
has already been removed from the queue.
By the help of these questions:
1, 2, 3,
I am able to avoid the Exception by setting the AutoComplete to false and by increment the Azure's queue lock duration to its max (from 30 seconds to 5 minutes).
_queueClient.RegisterMessageHandler(ProcessMessagesAsync, new
MessageHandlerOptions(ExceptionReceivedHandler)
{
MaxConcurrentCalls = 1,
MaxAutoRenewDuration = TimeSpan.FromSeconds(10),
AutoComplete = false
}
);
private async Task ProcessMessagesAsync(Message message, CancellationToken token)
{
await ProccesMessage(message);
}
private async Task ProccesMessage(Message message)
{
//The complete should be closed before long-timed process
await _queueClient.CompleteAsync(message.SystemProperties.LockToken);
await DoFoo(message.Body); //some long running process
}
My questions are:
This answer suggested that the exception was raised because the lock was being expired before the long time process, but in my case I was marking the message as complete immediately (before the long run process), so I'm not sure why changing the locking duration from azure made any difference? when I change it back to 30 seconds I can see the exception again.
Not sure if it related to the question but what is the purpose MaxAutoRenewDuration, the official docs is The maximum duration during which locks are automatically renewed.. If in my case I have only one app receiver that en-queue from this queue, so is it not needed because I do not need to lock the message from another app to capture it? and why this value should be greater than the longest message lock duration?
There are a few things you need to consider.
Lock duration
Total time since a message acquired from the broker
The lock duration is simple - for how long a single competing consumer can lease a message w/o having that message leased to any other competing consumer.
The total time is a bit tricker. Your callback ProcessMessagesAsync registered with to receive the message is not the only thing that is involved. In the code sample, you've provided, you're setting the concurrency to 1. If there's a prefetch configured (queue gets more than one message with every request for a message or several), the lock duration clock on the server starts ticking for all those messages. So if your processing is done slightly under MaxLockDuration but for the same of example, the last prefetched message was waiting to get processed too long, even if it's done within less than lock duration time, it might lose its lock and the exception will be thrown when attempting completion of that message.
This is where MaxAutoRenewDuration comes into the game. What it does is extends the message lease with the broker, "re-locking" it for the competing consumer that is currently handling the message. MaxAutoRenewDuration should be set to the "possibly maximum processing time a lease will be required". In your sample, it's set to TimeSpan.FromSeconds(10) which is extremely low. It needs to be set to be at least longer than the MaxLockDuration and adjusted to the longest period of time ProccesMessage will need to run. Taking prefetching into consideration.
To help to visualize it, think of the client-side having an in-memory queue where the messages can be stored while you perform the serial processing of the messages one by one in your handler. Lease starts the moment a message arrives from the broker to that in-memory queue. If the total time in the in-memory queue plus the processing exceeds the lock duration, the lease is lost. Your options are:
Enable concurrent processing by setting MaxConcurrentCalls > 1
Increase MaxLockDuration
Reduce message prefetch (if you use it)
Configure MaxAutoRenewDuration to renew the lock and overcome the MaxLockDuration constraint
Note about #4 - it's not a guaranteed operation. Therefore there's a chance a call to the broker will fail and message lock will not be extended. I recommend designing your solutions to work within the lock duration limit. Alternatively, persist message information so that your processing doesn't have to be constrained by the messaging.
I have a long running process which performs matches between millions of records I call this code using a Service Bus, However when my process passes the 5 minute limit Azure starts processing the already processed records from the start again.
How can I avoid this
Here is my code:
private static async Task ProcessMessagesAsync(Message message, CancellationToken token)
{
long receivedMessageTrasactionId = 0;
try
{
IQueueClient queueClient = new QueueClient(serviceBusConnectionString, serviceBusQueueName, ReceiveMode.PeekLock);
// Process the message
receivedMessageTrasactionId = Convert.ToInt64(Encoding.UTF8.GetString(message.Body));
// My Very Long Running Method
await DataCleanse.PerformDataCleanse(receivedMessageTrasactionId);
//Get Transaction and Metric details
await queueClient.CompleteAsync(message.SystemProperties.LockToken);
}
catch (Exception ex)
{
Log4NetErrorLogger(ex);
throw ex;
}
}
Messages are intended for notifications and not long running processing.
You've got a fewoptions:
Receive the message and rely on receiver's RenewLock() operation to extend the lock.
Use user-callback API and specify maximum processing time, if known, via MessageHandlerOptions.MaxAutoRenewDuration setting to auto-renew message's lock.
Record the processing started but do not complete the incoming message. Rather leverage message deferral feature, sending yourself a new delayed message with the reference to the deferred message SequenceNumber. This will allow you to periodically receive a "reminder" message to see if the work is finished. If it is, complete the deferred message by its SequenceNumber. Otherise, complete the "reminder" message along with sending a new one. This approach would require some level of your architecture redesign.
Similar to option 3, but offload processing to an external process that will report the status later. There are frameworks that can help you with that. MassTransit or NServiceBus. The latter has a sample you can download and play with.
Note that option 1 and 2 are not guaranteed as those are client-side initiated operations.
I have service which creates Actors of some type by some name:
var storer = this.serviceClient.Create<IStorer>(new ActorId(agencyToProcess.Name));
and then I call Actor's method.
await storer.StoreStatusesAsync().ConfigureAwait(false);
On this call I receive error :
System.AggregateException: One or more errors occurred. --->
Microsoft.ServiceFabric.Actors.ActorConcurrencyLockTimeoutException:
Acquisition of turn based concurrency lock for actor 'actorName' timed
out after 00:01:13.6480000. at
Microsoft.ServiceFabric.Actors.Runtime.ActorConcurrencyLock.d__17.MoveNext()
I can't understand what this erorr means and how to fix it.
This problem doesn't happen everytime. (20 times out of 100).
[ServiceDescription("Storer", ServicePrefix = "MyApp")]
public interface IStorer : IActor
{
Task StoreStatusesAsync();
}
This services are created in Observer. There are code full code which creates actors in observer.
public async void OnNext(AgencyModel agencyToProcess)
{
try
{
var storer = this.serviceClient.Create<IStorer>(new ActorId(agencyToProcess.Name));
await storer.StoreStatusesAsync().ConfigureAwait(false);
}
catch (Exception exception)
{
this.Logger.Error(exception);
}
}
This happens because your service is trying to call an Actor that is currently locked processing another call.
By the looks of you code, you are triggering actors based on events, if two events targeting the same actor get called consecutively, one of them will wait for the previous to finish, and if the previous takes too long to complete, a timeout will throw the 'ActorConcurrencyLockTimeoutException'.
It does not happen often because each call might take a few seconds or less to process, but when you have many calls enqueued, the latest will wait for all previous to process in their respective order, and this will timeout soon or later.
To reduce these exceptions you could increase the timeout threshold, The default is 60 seconds, but in my opinion this is not a good idea, as it might enqueue too many requests and possibly not be able to process all of them, while holding resources and connections. These requests may also get lost when services are re-balanced.
The best solution is find an approach to throttle these requests.
I would like to write a timeout function for the BasicPublish method of the RabbitMQ C# client. For many reasons sometimes the queue is blocked, or rabbit is down or whatever. But I want to detect when the publish is failing right away. I do not want to block the site for any reason.
I'm worried to implement a timeout with Task or threads adding overhead for a simple publish, that we are doing it millions of times in production.
Does anyone have and idea how to write a quick timeout on a fast blocking method as BasicPublish?
Clarification: Also I'm working in .Net 4, I do not have async.
Polly has a TimeoutPolicy aimed at exactly this scenario.
Polly's TimeoutStrategy.Optimistic is close to #ThiagoCustodio's answer, but it also disposes the CancellationTokenSource correctly. RabbitMQ's C# client doesn't however (at time of writing) offer a BasicPublish() overload taking CancellationToken, so this approach is not relevant.
Polly's TimeoutStrategy.Pessimistic is aimed at scenarios such as BasicPublish(), where you want to impose a timeout on a delegate which doesn't have CancellationToken support.
Polly's TimeoutStrategy.Pessimistic:
[1] allows the calling thread to time-out on (walk away from waiting for) the execution, even when the executed delegate doesn't support cancellation.
[2] does so at the cost of an extra task/thread (in synchronous executions), and manages this for you.
[3] also captures the timed-out Task (the task you have walked away from). This can be valuable for logging, and is essential to avoid UnobservedTaskExceptions - particularly in .NET4.0, where an UnobservedTaskException can bring down your entire process.
Simple example:
Policy.Timeout(TimeSpan.FromSeconds(10), TimeoutStrategy.Pessimistic).Execute(() => BasicPublish(...));
Full example properly avoiding UnobservedTaskExceptions:
Policy timeoutPolicy = Policy.Timeout(TimeSpan.FromSeconds(10), TimeoutStrategy.Pessimistic, (context, timespan, task) =>
{
task.ContinueWith(t => { // ContinueWith important!: the abandoned task may very well still be executing, when the caller times out on waiting for it!
if (t.IsFaulted)
{
logger.Error($"{context.PolicyKey} at {context.ExecutionKey}: execution timed out after {timespan.TotalSeconds} seconds, eventually terminated with: {t.Exception}.");
}
else
{
// extra logic (if desired) for tasks which complete, despite the caller having 'walked away' earlier due to timeout.
}
});
});
timeoutPolicy.Execute(() => BasicPublish(...));
To avoid building up too many concurrent pending tasks/threads in the case where RabbitMQ becomes unavailable, you can use a Bulkhead Isolation policy to limit parallelization and/or a CircuitBreaker to prevent putting calls through for a period, once you detect a certain level of failures. These can be combined with the TimeoutPolicy using PolicyWrap.
I would say that the easiest way is using tasks / cancellation token. Do you think it's an overhead?
public static async Task WithTimeoutAfterStart(
Func<CancellationToken, Task> operation, TimeSpan timeout)
{
var source = new CancellationTokenSource();
var task = operation(source.Token);
source.CancelAfter(timeout);
await task;
}
Usage:
await WithTimeoutAfterStart(
ct => SomeOperationAsync(ct), TimeSpan.FromMilliseconds(n));