So, I'm writing some retry logic for acquiring a lock using Polly. The overall timeout value will be provided by the API caller. I know I can wrap a policy in an overall timeout. However, if the supplied timeout value is too low is there a way I can ensure that the policy is executed at least once?
Obviously I could call the delegate separately before the policy is executed but I was just wondering if there was a way to express this requriment in Polly.
var result = Policy.Timeout(timeoutFromApiCaller)
.Wrap(Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500))
.Execute(() => this.TryEnterLock());
If timeoutFromApiCaller is say 1 tick and there's a good chance it takes longer than that to reach the timeout policy then the delegate wouldn't get called (the policy would timeout and throw TimeoutRejectedException).
What I'd like to happen can be expressed as:
var result = this.TryEnterLock();
if (!result)
{
result = Policy.Timeout(timeoutFromApiCaller)
.Wrap(Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500))
.Execute(() => this.TryEnterLock());
}
But it'd be really nice if it could be expressed in pure-Polly...
To be honest I don't understand what does it mean 1 tick, in your case? Is it a nanosecond or greater than that? Your global timeout should be greater than your local timeout.
But as I can see you have not specified a local one. TryEnterLock should receive a TimeSpan in order to do not block the caller for infinite time. If you look at the built in sync primitives most of them provide such a capabilities: Monitor.TryEnter, SpinLock.TryEnter, WaitHandle.WaitOne, etc.
So, just to wrap it up:
var timeoutPolicy = Policy.Timeout(TimeSpan.FromMilliseconds(1000));
var retryPolicy = Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500));
var resilientStrategy = Policy.Wrap(timeoutPolicy, retryPolicy);
var result = resilientStrategy.Execute(() => this.TryEnterLock(TimeSpan.FromMilliseconds(100)));
The timeout and delay values should be adjusted to your business needs. I highly encourage you to log when the global Timeout (onTimeout / onTimeoutAsync) fires and when the retries (onRetry / onRetryAsync) to be able to fine tune / calibrate these values.
EDIT: Based on the comments of this post
As it turned out there is no control over the timeoutFromApiCaller so it can be arbitrary small. (In the given example it is just a few nano-seconds, with the intent to emphasize the problem.) So, in order to have at least one call guarantee we have to make use the Fallback policy.
Instead of calling manually upfront the TryEnterLock outside the policies, we should call it as the last action to satisfy the requirement. Because policies uses escalation, that's why whenever the inner fails then it delegates the problem to the next outer policy.
So, if the provided timeout is so tiny that action can not finish until that period then it will throw a TimeoutRejectedException. With the Fallback we can handle that and the action can be performed again but now without any timeout constraint. This will provide us the desired at least one guarantee.
var atLeastOnce = Policy.Handle<TimeoutRejectedException>
.Fallback((ct) => this.TryEnterLock());
var globalTimeout = Policy.Timeout(TimeSpan.FromMilliseconds(1000));
var foreverRetry = Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500));
var resilientStrategy = Policy.Wrap(atLeastOnce, globalTimeout, foreverRetry);
var result = resilientStrategy.Execute(() => this.TryEnterLock());
Related
We're using the Dropbox API wrapped in Polly to handle retries.
We have it set up as an exponential back-off, like explained here.
The issue we have is that we make plenty of concurrent calls.
When the API starts throwing rate limit exceptions, each individual caller backs off
but new callers will still call the API and "steal" the retry of callers that are waiting.
That means that on high load we are experiencing failed API calls and errors.
What we would like to achieve is that on rate limit errors all calls (including new callers) to the API are synchronized and wait for the rate limit to expire.
Then calls can resume (ideally in sequence to make sure the calls don't return rate limit exceptions anymore).
Is there a Polly-supported way of achieving that?
According to my understanding you want to have the following:
The downstream system can throttle incoming requests
1.1 The system is smart enough to provide a RetryAfter time span
You want to avoid flooding the downstream system if you already know that you are throttled
But you don't want to lose any incoming request rather prefer processing all of them eventually
Let's put together a working example
#1 - Downstream system
Here we will implement a super simple mock which can mimic throttling.
Let's start with the exception
public class DownstreamServiceException: Exception
{
public TimeSpan RetryAfter { get; set; }
}
Now, let's see the service code
public class DownstreamService
{
private readonly CancellationTokenSource initCompletionSignal;
private readonly TimeSpan initDuration;
private bool isAvailable = false;
private DateTime initEstimatedEnd;
public DownstreamService()
{
initDuration = TimeSpan.FromSeconds(10);
initCompletionSignal = new CancellationTokenSource(initDuration);
initCompletionSignal.Token.Register(() => isAvailable = true);
initEstimatedEnd = DateTime.UtcNow.Add(initDuration);
}
public Task<string> GetAsync()
{
if (!isAvailable) throw new DownstreamServiceException { RetryAfter = initEstimatedEnd - DateTime.UtcNow };
return Task.FromResult("Available");
}
}
For the sake of simplicity I've used made the service unavailable for the first 10 seconds
I've used a CancellationTokenSource as a timer to make the service available
If the GetAsync is called while it is not available (we are throttled) it returns an exception otherwise with the "Available" string
#2 - Avoid flooding is downstream is not available
Here we will define a Circuit Breaker to short-cut the requests if the downstream is not available (we are throttled)
var throttledPolicy = Policy<string>
.Handle<DownstreamServiceException>()
.CircuitBreakerAsync(1, TimeSpan.FromSeconds(0),
onBreak: (result, state, _, __) => {
if (state == CircuitState.Open) return;
Console.WriteLine("onBreak");
throw result.Exception;
},
onReset: (_) => Console.WriteLine("onReset"),
onHalfOpen: () => { });
The Circuit Breaker will transit from Closed to Open when we receive the first DownstreamServiceException
The duration of break (TimeSpan.FromSeconds(0)) does not matter here
We will control the Circuit Breaker's state from the Retry logic
if (state == CircuitState.Open): This will be explained under the retry section
And finally re-throw the original exception (I know, I know ... it should be avoided, but it keeps our example application simple)
#3 - Retry until eventually processed
This is the most complicated part of the solution, because this retry policy handles multiple exceptions (DownstreamServiceException, IsolatedCircuitException) in a different way
CancellationTokenSource throttlingEndSignal;
var retryPolicy = Policy<string>
.Handle<DownstreamServiceException>()
.Or<IsolatedCircuitException>()
.WaitAndRetryForeverAsync(_ => TimeSpan.FromSeconds(3),
onRetry: (dr, __) =>
{
Console.WriteLine($"onRetry caused by {dr.Exception.GetType().Name}");
if (dr.Exception is DownstreamServiceException dse)
{
throttledPolicy.Isolate();
throttlingEndSignal = new(dse.RetryAfter);
throttlingEndSignal.Token.Register(() => throttledPolicy.Reset());
}
});
Let's start with the DownstreamServiceException
We will receive this exception because we are going to chain together the two policies and Circuit Breaker's onBreak delegate re-throws the received exception
Inside the onRetry we have a guard expression for DownstreamServiceException
Here we call the Isolate on the Circuit Breaker, which tries to transit from Open state to Isolated state >> calls the onBreak delegate
In order to avoid infinite loop that's why we had this if (state == CircuitState.Open) return; code there
We do the same timer trick here with the CancellationTokenSource, when ever the throttling ends we push the Circuit Breaker back to Closed state (Reset)
The IsolatedCircuitException case is much more simple
We receive this exception whenever we tries to perform a retry attempt but the Circuit Breaker is in Isolated state
So, the CB short cuts the execution and because of WaitAndRetryForever call we will eventually succeed
Put things together
var combinedPolicy = Policy.WrapAsync(retryPolicy, throttledPolicy);
var result = await combinedPolicy.ExecuteAsync(async () => await service.GetAsync());
Please note the followings:
This solution works well with multiple requests as well because Circuit Breaker is shared
This solution is a workaround, because we ca not set the duration of break dynamically
I hope you found this little sample application useful :)
This question already has answers here:
RateLimiting - Incorrect limiting
(2 answers)
Closed 8 months ago.
I'm trying to get my head around Polly rate-limit policy.
public class RateLimiter
{
private readonly AsyncRateLimitPolicy _throttlingPolicy;
private readonly Action<string> _rateLimitedAction;
public RateLimiter(int numberOfExecutions, TimeSpan perTimeSpan, Action<string> rateLimitedAction)
{
_throttlingPolicy = Policy.RateLimitAsync(numberOfExecutions, perTimeSpan);
_rateLimitedAction = rateLimitedAction;
}
public async Task<T> Throttle<T>(Func<Task<T>> func)
{
var result = await _throttlingPolicy.ExecuteAndCaptureAsync(func);
if (result.Outcome == OutcomeType.Failure)
{
var retryAfter = (result.FinalException as RateLimitRejectedException)?.RetryAfter ?? TimeSpan.FromSeconds(1);
_rateLimitedAction($"Rate limited. Should retry in {retryAfter}.");
return default;
}
return result.Result;
}
}
In my console application, I'm instantiating a RateLimiter with up to 5 calls per 10 seconds.
var rateLimiter = new RateLimiter(5, TimeSpan.FromSeconds(10), err => Console.WriteLine(err));
var rdm = new Random();
while (true)
{
var result = await rateLimiter.Throttle(() => Task.FromResult(rdm.Next(1, 10)));
if (result != default) Console.WriteLine($"Result: {result}");
await Task.Delay(200);
}
I would expect to see 5 results, and be rate limited on the 6th one. But this is what I get
Result: 9
Rate limited. Should retry in 00:00:01.7744615.
Rate limited. Should retry in 00:00:01.5119933.
Rate limited. Should retry in 00:00:01.2313921.
Rate limited. Should retry in 00:00:00.9797322.
Rate limited. Should retry in 00:00:00.7309150.
Rate limited. Should retry in 00:00:00.4812646.
Rate limited. Should retry in 00:00:00.2313643.
Result: 7
Rate limited. Should retry in 00:00:01.7982864.
Rate limited. Should retry in 00:00:01.5327321.
Rate limited. Should retry in 00:00:01.2517093.
Rate limited. Should retry in 00:00:00.9843077.
Rate limited. Should retry in 00:00:00.7203371.
Rate limited. Should retry in 00:00:00.4700262.
Rate limited. Should retry in 00:00:00.2205184.
I've also tried to use ExecuteAsync instead of ExecuteAndCaptureAsync and it didn't change the results.
public async Task<T> Throttle<T>(Func<Task<T>> func)
{
try
{
var result = await _throttlingPolicy.ExecuteAsync(func);
return result;
}
catch (RateLimitRejectedException ex)
{
_rateLimitedAction($"Rate limited. Should retry in {ex.RetryAfter}.");
return default;
}
}
This doesn't make any sense to me. Is there something I'm missing?
The rate limiter works in a bit different way than as you might expect. The expected behaviour could be the following:
Let's suppose I have 500 requests and I want to throttle it to 50 per minute
In that case after the first 50 executions the rate limiter should kick in if they were executed less than a minute
This intuitive approach does not put into account the equal distribution of the incoming load. This might induce the following observable behaviour:
Let's suppose the first 50 executions took 30 seconds
Then you have to wait another 30 seconds to execute the 51st request
Polly's rate limiter uses the Leaky bucket algorithm
This works in the following way:
The bucket has a fix capacity
The bucket has a leak at the bottom
Water drops are leaving the bucket on a given frequency
The bucket can receive new water drops from top
The bucket can overflow if the incoming frequency is greater than the outgoing
So, technically speaking:
it is a fixed sized queue
the dequeue is called periodically
if the queue is full then the enqueue throws an exception
The most important information from the above description is the following: the leaky bucket algorithm uses a constant rate to empty the bucket.
UPDATE 14/11/22
Let me correct myself. Polly's rate limiter is using token bucket not leaky bucket. There are also other algorithms like fixed window counter, sliding window log or sliding window counter. You can read about the alternatives here or inside the System Design Interview Volume 1 book's chapter 4
So, let's talk about the token bucket algorithm:
The bucket has a fix capacity
Tokens are put into the bucket in a fixed periodic rate
If the bucket is full no more token is added to it (overflow)
Each request tries to consume a single token
If there is at least one then the request consumes it and the request is allowed
If there isn't at least one token inside the bucket then the request is dropped
(Source)
If we scrutinise the implementation then we can see the following things:
The RateLimiterPolicy calls the RateLimiterEngine's static method
The RateLimiterEngine calls a method on a IRateLimiter interface
There is only one class (at the time of writing) which implements this interface
The RateLimiterFactory exposes a method to create LockFreeTokenBucketRateLimiter
public static IRateLimiter Create(TimeSpan onePer, int bucketCapacity)
=> new LockFreeTokenBucketRateLimiter(onePer, bucketCapacity);
Please be aware of how the parameters are named (onePer and bucketCapacity)!
If you are interested about the actual implementation then you can find here. (Almost each line is commented)
I want to emphasize one more thing. The rate limiter does not perform any retry. If you want to continue the execution after the penalty time is over then you have to do it yourself. Either by writing some custom code or by combining a retry policy with the rate limiter policy.
There is an overload accepting third parameter - maxBurst:
The maximum number of executions that will be permitted in a single burst (for example if none have been executed for a while).
The default value is 1, if you will set it to numberOfExecutions you will see the desired effect for the first execution, though after that it will deteriorate to the similar pattern as you observe (I would guess it is based on how the limiter "frees" the resources and var onePer = TimeSpan.FromTicks(perTimeSpan.Ticks / numberOfExecutions); calculation, but I have not dug too deep, but based on the docs and code it seems that rate limiting is happening with "1 execution per perTimeSpan/numberOfExecutions" rate rather than "numberOfExecutions in any selected perTimeSpan"):
_throttlingPolicy = Policy.RateLimitAsync(numberOfExecutions, perTimeSpan, numberOfExecutions);
Adding periodic wait for several seconds brings back the "bursts" though.
Also see:
docs
allow for bursts part of the doc
issue about rate limiter engine.
Introduction
Hello all, we're currently working on a microservice platform that uses Azure EventHubs and events to sent data in between the services.
Let's just name these services: CustomerService, OrderService and MobileBFF.
The CustomerService mainly sends updates (with events) which will then be stored by the OrderService and MobileBFF to be able to respond to queries without having to call the CustomerService for this data.
All these 3 services + our developers on the DEV environment make use of the same ConsumerGroup to connect to these event hubs.
We currently make use of only 1 partition but plan to expand to multiple later. (You can see our code is already made to be able to read from multiple partitions)
Exception
Every now and then we're running into an exception though (if it starts it usually keeps throwing this error for an hour or something). For now we've only seen this error on DEV/TEST environments though.
The exception:
Azure.Messaging.EventHubs.EventHubsException(ConsumerDisconnected): At least one receiver for the endpoint is created with epoch of '0', and so non-epoch receiver is not allowed. Either reconnect with a higher epoch, or make sure all epoch receivers are closed or disconnected.
All consumers of the EventHub, store their SequenceNumber in their own Database. This allows us to have each consumer consume events separately and also store the last processed SequenceNumber in it's own SQL database. When the service (re)starts, it loads the SequenceNumber from the db and then requests events from here onwards untill no more events can be found. It then sleeps for 100ms and then retries. Here's the (somewhat simplified) code:
var consumerGroup = EventHubConsumerClient.DefaultConsumerGroupName;
string[] allPartitions = null;
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
allPartitions = await consumer.GetPartitionIdsAsync(stoppingToken);
}
var allTasks = new List<Task>();
foreach (var partitionId in allPartitions)
{
//This is required if you reuse variables inside a Task.Run();
var partitionIdInternal = partitionId;
allTasks.Add(Task.Run(async () =>
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
EventPosition startingPosition;
using (var testScope = _serviceProvider.CreateScope())
{
var messageProcessor = testScope.ServiceProvider.GetService<EventHubInboxManager<T, EH>>();
//Obtains starting position from the database or sets to "Earliest" or "Latest" based on configuration
startingPosition = await messageProcessor.GetStartingPosition(_inboxOptions.InboxIdentifier, partitionIdInternal);
}
while (!stoppingToken.IsCancellationRequested)
{
bool processedSomething = false;
await foreach (PartitionEvent partitionEvent in consumer.ReadEventsFromPartitionAsync(partitionIdInternal, startingPosition, stoppingToken))
{
processedSomething = true;
startingPosition = await messageProcessor.Handle(partitionEvent);
}
if (processedSomething == false)
{
await Task.Delay(100, stoppingToken);
}
}
}
}
catch (Exception ex)
{
//Log error / delay / retry
}
}
}
}
The exception is thrown on the following line:
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
More investigation
The code described above is running in the MicroServices (which are hosted as AppServices in Azure)
Next to that we're also running 1 Azure Function that also reads events from the EventHub. (Probably uses the same consumer group).
According to the documentation here: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-features#consumer-groups it should be possible to have 5 consumers per consumer group. It seems to be suggested to only have one, but it's not clear to us what could happen if we don't follow this guidance.
We did do some tests with manually spawning multiple instances of our service that reads events and when there were more then 5 this resulted in a different error which stated quite clearly that there could only be 5 consumers per partition per consumer group (or something similar).
Furthermore it seems like (we're not 100% sure) that this issue started happening when we rewrote the code (above) to be able to spawn one thread per partition. (Even though we only have 1 partition in the EventHub). Edit: we did some more log-digging and also found a few exception before merging in the code to spawn one thread per partition.
That exception indicates that there is another consumer configured to use the same consumer group and asserting exclusive access over the partition. Unless you're explicitly setting the OwnerLevel property in your client options, the likely candidate is that there is at least one EventProcessorClient running.
To remediate, you can:
Stop any event processors running against the same Event Hub and Consumer Group combination, and ensure that no other consumers are explicitly setting the OwnerLevel.
Run these consumers in a dedicated consumer group; this will allow them to co-exist with the exclusive consumer(s) and/or event processors.
Explicitly set the OwnerLevel to 1 or greater for these consumers; that will assert ownership and force any other consumers in the same consumer group to disconnect.
(note: depending on what the other consumer is, you may need to test different values here. The event processor types use 0, so anything above that will take precedence.)
To add to the Jesse's answer, I think the exception message is part of
the old SDK.
If you look into the docs, there 3 types of receiving modes defined there:
Epoch
Epoch is a unique identifier (epoch value) that the service uses, to enforce partition/lease ownership.
The epoch feature provides users the ability to ensure that there is only one receiver on a consumer group at any point in time...
Non-epoch:
... There are some scenarios in stream processing where users would like to create multiple receivers on a single consumer group. To support such scenarios, we do have ability to create a receiver without epoch and in this case we allow upto 5 concurrent receivers on the consumer group.
Mixed:
... If there is a receiver already created with epoch e1 and is actively receiving events and a new receiver is created with no epoch, the creation of new receiver will fail. Epoch receivers always take precedence in the system.
So I have a lambda that makes a point-to-point call to another lambda. We have AWS X-Ray set up so we can monitor performance. However, X-Ray show this odd result where even though the invocation itself takes only a second, the "invoke" call from the original takes a minute and a half.
This makes no sense, since we are calling the lambda as an event (ACK and forget) and using an async call on which we do not await. It really causes problems because even though all lambdas successfully complete and do their work (as we can see from Cloudwatch logs and resulting data in our data store), occasionally that secondary lambda call takes so long that X-Ray times out, which bombs the whole rest of the trace.
Other notes:
We have Active tracing enabled on both lambdas
We do occasionally have cold start times, but as you can see from the screenshot, there is no "initialization" step here, so both lambdas are warm
This particular example was a singular action with no other activity in the system, so it's not like there was a bottleneck due to high load
Does anyone have an explanation for this, and hopefully what we can do to fix it?
Our invocation code (simplified):
var assetIds = new List<Guid> { Guid.NewGuid() };
var request= new AddBulkAssetHistoryRequest();
request.AssetIds = assetIds.ToList();
request.EventType = AssetHistoryEventTypeConstants.AssetDownloaded;
request.UserId = tokenUserId.Value;
var invokeRequest = new InvokeRequest
{
FunctionName = "devkarl02-BulkAddAssetHistory",
InvocationType = InvocationType.Event,
Payload = JsonConvert.SerializeObject(request)
};
var region = RegionEndpoint.GetBySystemName("us-east-1");
var lambdaClient= new AmazonLambdaClient(region)
_ = lambdaClient.InvokeAsync(invokeRequest);
This is also posted over in the AWS Forums (for whatever that is worth): https://forums.aws.amazon.com/thread.jspa?threadID=307615
So it turns out the issue was that we weren't using the await operator. For some reason, that made the calls interminably slow. Making this small change:
_ = await lambdaClient.InvokeAsync(invokeRequest);
made everything else behave properly, both in logs and in x-ray. Not sure why, but hey, it solved the issue.
As far as I understand, not adding the await, causes the call to execute synchronously while adding the await causes the call to happen async.
I have this assert in my test code
Assert.That(() => eventData.Count == 0,
Is.True.After(notificationPollingDelay),
"Received unexpected event with last event data" + eventData.Last().Description());
that asserts some condition after a period of time and on failure produces a message. it fails to run because the message string is constructed when the assert starts and not when the assert ends. therefore the eventData collection is still empty (as it is initially) and the attempt to get the Description of the last item in the collection fails. is there a workaround or decent alternative to this in NUnit or do I have to revert to using Thread.Sleep in my tests?
PS: I'm using NUnit 2.5.10.
You may use this scheme:
var constrain = Is.True.After(notificationPollingDelay);
var condition = constrain.Matches(() => eventData.Count == 0);
Assert.IsTrue(condition,
"Received unexpected event with last event data" +
eventData.Last().Description());
This method is similar to the use Thread.Sleep
In NUnit version 3.50 I had to use a different syntax.
Here is the example:
var delayedConstraint = Is.True.After( delayInMilliseconds: 100000, pollingInterval: 100);
Assert.That( () => yourCondition, delayedConstraint );
This will test whether `yourCondition` is true waiting a certain maximum time using the `DelayedConstraint` created by the `Is.True.After` method.
In this example the DelayedConstraint is configured to use maximum time of 100 seconds polling every 0.1 seconds.
See aslo the legacy NUnit 2.5 documentation for DelayedConstraint.
The simplest answer is "don't include that text in your failure message". I personally almost never include a failure message; if your test is atomic enough you don't need to do it. Usually if I need to figure out a cryptic failure, only a debugger helps anyway.
If you really want to do it, this code should work without managing the threads yourself.
try
{
Assert.That(() => eventData.Count == 0, Is.True.After(notificationPollingDelay));
}
catch(AssertionException)
{
throw new Exception("Received unexpected event with last event data" + eventData.Last().Description());
}