We're using the Dropbox API wrapped in Polly to handle retries.
We have it set up as an exponential back-off, like explained here.
The issue we have is that we make plenty of concurrent calls.
When the API starts throwing rate limit exceptions, each individual caller backs off
but new callers will still call the API and "steal" the retry of callers that are waiting.
That means that on high load we are experiencing failed API calls and errors.
What we would like to achieve is that on rate limit errors all calls (including new callers) to the API are synchronized and wait for the rate limit to expire.
Then calls can resume (ideally in sequence to make sure the calls don't return rate limit exceptions anymore).
Is there a Polly-supported way of achieving that?
According to my understanding you want to have the following:
The downstream system can throttle incoming requests
1.1 The system is smart enough to provide a RetryAfter time span
You want to avoid flooding the downstream system if you already know that you are throttled
But you don't want to lose any incoming request rather prefer processing all of them eventually
Let's put together a working example
#1 - Downstream system
Here we will implement a super simple mock which can mimic throttling.
Let's start with the exception
public class DownstreamServiceException: Exception
{
public TimeSpan RetryAfter { get; set; }
}
Now, let's see the service code
public class DownstreamService
{
private readonly CancellationTokenSource initCompletionSignal;
private readonly TimeSpan initDuration;
private bool isAvailable = false;
private DateTime initEstimatedEnd;
public DownstreamService()
{
initDuration = TimeSpan.FromSeconds(10);
initCompletionSignal = new CancellationTokenSource(initDuration);
initCompletionSignal.Token.Register(() => isAvailable = true);
initEstimatedEnd = DateTime.UtcNow.Add(initDuration);
}
public Task<string> GetAsync()
{
if (!isAvailable) throw new DownstreamServiceException { RetryAfter = initEstimatedEnd - DateTime.UtcNow };
return Task.FromResult("Available");
}
}
For the sake of simplicity I've used made the service unavailable for the first 10 seconds
I've used a CancellationTokenSource as a timer to make the service available
If the GetAsync is called while it is not available (we are throttled) it returns an exception otherwise with the "Available" string
#2 - Avoid flooding is downstream is not available
Here we will define a Circuit Breaker to short-cut the requests if the downstream is not available (we are throttled)
var throttledPolicy = Policy<string>
.Handle<DownstreamServiceException>()
.CircuitBreakerAsync(1, TimeSpan.FromSeconds(0),
onBreak: (result, state, _, __) => {
if (state == CircuitState.Open) return;
Console.WriteLine("onBreak");
throw result.Exception;
},
onReset: (_) => Console.WriteLine("onReset"),
onHalfOpen: () => { });
The Circuit Breaker will transit from Closed to Open when we receive the first DownstreamServiceException
The duration of break (TimeSpan.FromSeconds(0)) does not matter here
We will control the Circuit Breaker's state from the Retry logic
if (state == CircuitState.Open): This will be explained under the retry section
And finally re-throw the original exception (I know, I know ... it should be avoided, but it keeps our example application simple)
#3 - Retry until eventually processed
This is the most complicated part of the solution, because this retry policy handles multiple exceptions (DownstreamServiceException, IsolatedCircuitException) in a different way
CancellationTokenSource throttlingEndSignal;
var retryPolicy = Policy<string>
.Handle<DownstreamServiceException>()
.Or<IsolatedCircuitException>()
.WaitAndRetryForeverAsync(_ => TimeSpan.FromSeconds(3),
onRetry: (dr, __) =>
{
Console.WriteLine($"onRetry caused by {dr.Exception.GetType().Name}");
if (dr.Exception is DownstreamServiceException dse)
{
throttledPolicy.Isolate();
throttlingEndSignal = new(dse.RetryAfter);
throttlingEndSignal.Token.Register(() => throttledPolicy.Reset());
}
});
Let's start with the DownstreamServiceException
We will receive this exception because we are going to chain together the two policies and Circuit Breaker's onBreak delegate re-throws the received exception
Inside the onRetry we have a guard expression for DownstreamServiceException
Here we call the Isolate on the Circuit Breaker, which tries to transit from Open state to Isolated state >> calls the onBreak delegate
In order to avoid infinite loop that's why we had this if (state == CircuitState.Open) return; code there
We do the same timer trick here with the CancellationTokenSource, when ever the throttling ends we push the Circuit Breaker back to Closed state (Reset)
The IsolatedCircuitException case is much more simple
We receive this exception whenever we tries to perform a retry attempt but the Circuit Breaker is in Isolated state
So, the CB short cuts the execution and because of WaitAndRetryForever call we will eventually succeed
Put things together
var combinedPolicy = Policy.WrapAsync(retryPolicy, throttledPolicy);
var result = await combinedPolicy.ExecuteAsync(async () => await service.GetAsync());
Please note the followings:
This solution works well with multiple requests as well because Circuit Breaker is shared
This solution is a workaround, because we ca not set the duration of break dynamically
I hope you found this little sample application useful :)
Related
This question already has answers here:
RateLimiting - Incorrect limiting
(2 answers)
Closed 8 months ago.
I'm trying to get my head around Polly rate-limit policy.
public class RateLimiter
{
private readonly AsyncRateLimitPolicy _throttlingPolicy;
private readonly Action<string> _rateLimitedAction;
public RateLimiter(int numberOfExecutions, TimeSpan perTimeSpan, Action<string> rateLimitedAction)
{
_throttlingPolicy = Policy.RateLimitAsync(numberOfExecutions, perTimeSpan);
_rateLimitedAction = rateLimitedAction;
}
public async Task<T> Throttle<T>(Func<Task<T>> func)
{
var result = await _throttlingPolicy.ExecuteAndCaptureAsync(func);
if (result.Outcome == OutcomeType.Failure)
{
var retryAfter = (result.FinalException as RateLimitRejectedException)?.RetryAfter ?? TimeSpan.FromSeconds(1);
_rateLimitedAction($"Rate limited. Should retry in {retryAfter}.");
return default;
}
return result.Result;
}
}
In my console application, I'm instantiating a RateLimiter with up to 5 calls per 10 seconds.
var rateLimiter = new RateLimiter(5, TimeSpan.FromSeconds(10), err => Console.WriteLine(err));
var rdm = new Random();
while (true)
{
var result = await rateLimiter.Throttle(() => Task.FromResult(rdm.Next(1, 10)));
if (result != default) Console.WriteLine($"Result: {result}");
await Task.Delay(200);
}
I would expect to see 5 results, and be rate limited on the 6th one. But this is what I get
Result: 9
Rate limited. Should retry in 00:00:01.7744615.
Rate limited. Should retry in 00:00:01.5119933.
Rate limited. Should retry in 00:00:01.2313921.
Rate limited. Should retry in 00:00:00.9797322.
Rate limited. Should retry in 00:00:00.7309150.
Rate limited. Should retry in 00:00:00.4812646.
Rate limited. Should retry in 00:00:00.2313643.
Result: 7
Rate limited. Should retry in 00:00:01.7982864.
Rate limited. Should retry in 00:00:01.5327321.
Rate limited. Should retry in 00:00:01.2517093.
Rate limited. Should retry in 00:00:00.9843077.
Rate limited. Should retry in 00:00:00.7203371.
Rate limited. Should retry in 00:00:00.4700262.
Rate limited. Should retry in 00:00:00.2205184.
I've also tried to use ExecuteAsync instead of ExecuteAndCaptureAsync and it didn't change the results.
public async Task<T> Throttle<T>(Func<Task<T>> func)
{
try
{
var result = await _throttlingPolicy.ExecuteAsync(func);
return result;
}
catch (RateLimitRejectedException ex)
{
_rateLimitedAction($"Rate limited. Should retry in {ex.RetryAfter}.");
return default;
}
}
This doesn't make any sense to me. Is there something I'm missing?
The rate limiter works in a bit different way than as you might expect. The expected behaviour could be the following:
Let's suppose I have 500 requests and I want to throttle it to 50 per minute
In that case after the first 50 executions the rate limiter should kick in if they were executed less than a minute
This intuitive approach does not put into account the equal distribution of the incoming load. This might induce the following observable behaviour:
Let's suppose the first 50 executions took 30 seconds
Then you have to wait another 30 seconds to execute the 51st request
Polly's rate limiter uses the Leaky bucket algorithm
This works in the following way:
The bucket has a fix capacity
The bucket has a leak at the bottom
Water drops are leaving the bucket on a given frequency
The bucket can receive new water drops from top
The bucket can overflow if the incoming frequency is greater than the outgoing
So, technically speaking:
it is a fixed sized queue
the dequeue is called periodically
if the queue is full then the enqueue throws an exception
The most important information from the above description is the following: the leaky bucket algorithm uses a constant rate to empty the bucket.
UPDATE 14/11/22
Let me correct myself. Polly's rate limiter is using token bucket not leaky bucket. There are also other algorithms like fixed window counter, sliding window log or sliding window counter. You can read about the alternatives here or inside the System Design Interview Volume 1 book's chapter 4
So, let's talk about the token bucket algorithm:
The bucket has a fix capacity
Tokens are put into the bucket in a fixed periodic rate
If the bucket is full no more token is added to it (overflow)
Each request tries to consume a single token
If there is at least one then the request consumes it and the request is allowed
If there isn't at least one token inside the bucket then the request is dropped
(Source)
If we scrutinise the implementation then we can see the following things:
The RateLimiterPolicy calls the RateLimiterEngine's static method
The RateLimiterEngine calls a method on a IRateLimiter interface
There is only one class (at the time of writing) which implements this interface
The RateLimiterFactory exposes a method to create LockFreeTokenBucketRateLimiter
public static IRateLimiter Create(TimeSpan onePer, int bucketCapacity)
=> new LockFreeTokenBucketRateLimiter(onePer, bucketCapacity);
Please be aware of how the parameters are named (onePer and bucketCapacity)!
If you are interested about the actual implementation then you can find here. (Almost each line is commented)
I want to emphasize one more thing. The rate limiter does not perform any retry. If you want to continue the execution after the penalty time is over then you have to do it yourself. Either by writing some custom code or by combining a retry policy with the rate limiter policy.
There is an overload accepting third parameter - maxBurst:
The maximum number of executions that will be permitted in a single burst (for example if none have been executed for a while).
The default value is 1, if you will set it to numberOfExecutions you will see the desired effect for the first execution, though after that it will deteriorate to the similar pattern as you observe (I would guess it is based on how the limiter "frees" the resources and var onePer = TimeSpan.FromTicks(perTimeSpan.Ticks / numberOfExecutions); calculation, but I have not dug too deep, but based on the docs and code it seems that rate limiting is happening with "1 execution per perTimeSpan/numberOfExecutions" rate rather than "numberOfExecutions in any selected perTimeSpan"):
_throttlingPolicy = Policy.RateLimitAsync(numberOfExecutions, perTimeSpan, numberOfExecutions);
Adding periodic wait for several seconds brings back the "bursts" though.
Also see:
docs
allow for bursts part of the doc
issue about rate limiter engine.
Introduction
Hello all, we're currently working on a microservice platform that uses Azure EventHubs and events to sent data in between the services.
Let's just name these services: CustomerService, OrderService and MobileBFF.
The CustomerService mainly sends updates (with events) which will then be stored by the OrderService and MobileBFF to be able to respond to queries without having to call the CustomerService for this data.
All these 3 services + our developers on the DEV environment make use of the same ConsumerGroup to connect to these event hubs.
We currently make use of only 1 partition but plan to expand to multiple later. (You can see our code is already made to be able to read from multiple partitions)
Exception
Every now and then we're running into an exception though (if it starts it usually keeps throwing this error for an hour or something). For now we've only seen this error on DEV/TEST environments though.
The exception:
Azure.Messaging.EventHubs.EventHubsException(ConsumerDisconnected): At least one receiver for the endpoint is created with epoch of '0', and so non-epoch receiver is not allowed. Either reconnect with a higher epoch, or make sure all epoch receivers are closed or disconnected.
All consumers of the EventHub, store their SequenceNumber in their own Database. This allows us to have each consumer consume events separately and also store the last processed SequenceNumber in it's own SQL database. When the service (re)starts, it loads the SequenceNumber from the db and then requests events from here onwards untill no more events can be found. It then sleeps for 100ms and then retries. Here's the (somewhat simplified) code:
var consumerGroup = EventHubConsumerClient.DefaultConsumerGroupName;
string[] allPartitions = null;
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
allPartitions = await consumer.GetPartitionIdsAsync(stoppingToken);
}
var allTasks = new List<Task>();
foreach (var partitionId in allPartitions)
{
//This is required if you reuse variables inside a Task.Run();
var partitionIdInternal = partitionId;
allTasks.Add(Task.Run(async () =>
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
EventPosition startingPosition;
using (var testScope = _serviceProvider.CreateScope())
{
var messageProcessor = testScope.ServiceProvider.GetService<EventHubInboxManager<T, EH>>();
//Obtains starting position from the database or sets to "Earliest" or "Latest" based on configuration
startingPosition = await messageProcessor.GetStartingPosition(_inboxOptions.InboxIdentifier, partitionIdInternal);
}
while (!stoppingToken.IsCancellationRequested)
{
bool processedSomething = false;
await foreach (PartitionEvent partitionEvent in consumer.ReadEventsFromPartitionAsync(partitionIdInternal, startingPosition, stoppingToken))
{
processedSomething = true;
startingPosition = await messageProcessor.Handle(partitionEvent);
}
if (processedSomething == false)
{
await Task.Delay(100, stoppingToken);
}
}
}
}
catch (Exception ex)
{
//Log error / delay / retry
}
}
}
}
The exception is thrown on the following line:
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
More investigation
The code described above is running in the MicroServices (which are hosted as AppServices in Azure)
Next to that we're also running 1 Azure Function that also reads events from the EventHub. (Probably uses the same consumer group).
According to the documentation here: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-features#consumer-groups it should be possible to have 5 consumers per consumer group. It seems to be suggested to only have one, but it's not clear to us what could happen if we don't follow this guidance.
We did do some tests with manually spawning multiple instances of our service that reads events and when there were more then 5 this resulted in a different error which stated quite clearly that there could only be 5 consumers per partition per consumer group (or something similar).
Furthermore it seems like (we're not 100% sure) that this issue started happening when we rewrote the code (above) to be able to spawn one thread per partition. (Even though we only have 1 partition in the EventHub). Edit: we did some more log-digging and also found a few exception before merging in the code to spawn one thread per partition.
That exception indicates that there is another consumer configured to use the same consumer group and asserting exclusive access over the partition. Unless you're explicitly setting the OwnerLevel property in your client options, the likely candidate is that there is at least one EventProcessorClient running.
To remediate, you can:
Stop any event processors running against the same Event Hub and Consumer Group combination, and ensure that no other consumers are explicitly setting the OwnerLevel.
Run these consumers in a dedicated consumer group; this will allow them to co-exist with the exclusive consumer(s) and/or event processors.
Explicitly set the OwnerLevel to 1 or greater for these consumers; that will assert ownership and force any other consumers in the same consumer group to disconnect.
(note: depending on what the other consumer is, you may need to test different values here. The event processor types use 0, so anything above that will take precedence.)
To add to the Jesse's answer, I think the exception message is part of
the old SDK.
If you look into the docs, there 3 types of receiving modes defined there:
Epoch
Epoch is a unique identifier (epoch value) that the service uses, to enforce partition/lease ownership.
The epoch feature provides users the ability to ensure that there is only one receiver on a consumer group at any point in time...
Non-epoch:
... There are some scenarios in stream processing where users would like to create multiple receivers on a single consumer group. To support such scenarios, we do have ability to create a receiver without epoch and in this case we allow upto 5 concurrent receivers on the consumer group.
Mixed:
... If there is a receiver already created with epoch e1 and is actively receiving events and a new receiver is created with no epoch, the creation of new receiver will fail. Epoch receivers always take precedence in the system.
ServiceBusProcessor cannot be used without specifying ProcessMessageAsync and ProcessErrorAsync. What is the first method, it's very clear but I'm not sure what to do in ProcessErrorAsync?
Are the following methods identical?
var client = new ServiceBusClient(connectionString);
var processor = _client.CreateProcessor(queueName);
processor.ProcessMessageAsync += async arg =>
{
try
{
//process message
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
};
processor.ProcessErrorAsync += arg =>
{
return Task.CompletedTask;
};
await _processor.StartProcessingAsync(cancellationToken);
and
var client = new ServiceBusClient(connectionString);
var processor = _client.CreateProcessor(queueName);
processor.ProcessMessageAsync += async arg =>
{
//process message
};
processor.ProcessErrorAsync += arg =>
{
Console.WriteLine(ex.Message);
return Task.CompletedTask;
};
await _processor.StartProcessingAsync(cancellationToken);
ProcessMessageAsync is a handler that will be called each time a message is read from your Service Bus instance and needs to be processed. This is where your business logic for handling messages should be.
ProcessErrorAsync is a handler that allows you to observe exceptions that occur during processor operation - both in your message processing code and within the processor infrastructure itself.
The processor is built to be resilient and do it's best to recover from problems and continue processing. Because of this, it will shrug off most exceptions, surface them to the handler, and then continue moving forward. The error handler is how your application is notified of problems and take the actions appropriate for your application.
As for what you should do in the handler, much of that depends on your application and its needs. At minimum, most applications want to understand when errors occur and log them in case analysis is needed at a later point. You may also want to use this to detect poison messages or other processing problems and takes the action that is appropriate for your application.
The processor has no knowledge of your application's design or needs, nor that of the the environment in which it is hosted. That means that it cannot make intelligent decisions about when a normally transient issue should be fatal or when there's a bigger issue in your application ecosystem. It is important to remember is your application is responsible for understanding the pattern of errors and determining if the application or processor is not healthy in a non-obvious way.
For example, if the processor cannot reach your Service Bus instance, it will continue to retry forever. If your application sees these exceptions consistently for a period of time, it may be a sign of an unhealthy host network and your application may choose to stop processing and reset the host. Likewise, if your application is expecting a specific schema for incoming messages and those published to your Service Bus instance aren't correct, the processor will continue to try handling them, but your application should be better able to recognize the larger problem and take the appropriate action.
I have pretty naive code :
public async Task Produce(string topic, object message, MessageHeader messageHeaders)
{
try
{
var producerClient = _EventHubProducerClientFactory.Get(topic);
var eventData = CreateEventData(message, messageHeaders);
messageHeaders.Times?.Add(DateTime.Now);
await producerClient.SendAsync(new EventData[] { eventData });
messageHeaders.Times?.Add(DateTime.Now);
//.....
Log.Info($"Milliseconds spent: {(messageHeaders.Times[1]- messageHeaders.Times[0]).TotalMilliseconds});
}
}
private EventData CreateEventData(object message, MessageHeader messageHeaders)
{
var eventData = new EventData(Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(message)));
eventData.Properties.Add("CorrelationId", messageHeaders.CorrelationId);
if (messageHeaders.DateTime != null)
eventData.Properties.Add("DateTime", messageHeaders.DateTime?.ToString("s"));
if (messageHeaders.Version != null)
eventData.Properties.Add("Version", messageHeaders.Version);
return eventData;
}
in logs I had values for almost 1 second (~ 800 milliseconds)
What could be a reason for such long execution time?
The EventHubProducerClient opens connections to the Event Hubs service lazily, waiting until the first time an operation requires it. In your snippet, the call to SendAsync triggers an AMQP connection to be created, an AMQP link to be created, and authentication to be performed.
Unless the client is closed, most future calls won't incur that overhead as the connection and link are persistent. Most being an important distinction in that statement, as the client may need to reconnect in the face of a network error, when activity is low and the connection idles out, or if the Event Hubs service terminates the connection/link.
As Serkant mentions, if you're looking to understand timings, you'd probably be best served using a library like Benchmark.NET that works ove a large number of iterations to derive statistically meaningful results.
You are measuring the first 'Send'. That will incur some overhead that other Sends won't. So, always do warm up first like send single event and then measure the next one.
Another important thing. It is not right to measure just single 'Send' call. Measure bunch of calls instead and calculate latency percentile. That should provide a better figure for your tests.
So, I'm writing some retry logic for acquiring a lock using Polly. The overall timeout value will be provided by the API caller. I know I can wrap a policy in an overall timeout. However, if the supplied timeout value is too low is there a way I can ensure that the policy is executed at least once?
Obviously I could call the delegate separately before the policy is executed but I was just wondering if there was a way to express this requriment in Polly.
var result = Policy.Timeout(timeoutFromApiCaller)
.Wrap(Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500))
.Execute(() => this.TryEnterLock());
If timeoutFromApiCaller is say 1 tick and there's a good chance it takes longer than that to reach the timeout policy then the delegate wouldn't get called (the policy would timeout and throw TimeoutRejectedException).
What I'd like to happen can be expressed as:
var result = this.TryEnterLock();
if (!result)
{
result = Policy.Timeout(timeoutFromApiCaller)
.Wrap(Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500))
.Execute(() => this.TryEnterLock());
}
But it'd be really nice if it could be expressed in pure-Polly...
To be honest I don't understand what does it mean 1 tick, in your case? Is it a nanosecond or greater than that? Your global timeout should be greater than your local timeout.
But as I can see you have not specified a local one. TryEnterLock should receive a TimeSpan in order to do not block the caller for infinite time. If you look at the built in sync primitives most of them provide such a capabilities: Monitor.TryEnter, SpinLock.TryEnter, WaitHandle.WaitOne, etc.
So, just to wrap it up:
var timeoutPolicy = Policy.Timeout(TimeSpan.FromMilliseconds(1000));
var retryPolicy = Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500));
var resilientStrategy = Policy.Wrap(timeoutPolicy, retryPolicy);
var result = resilientStrategy.Execute(() => this.TryEnterLock(TimeSpan.FromMilliseconds(100)));
The timeout and delay values should be adjusted to your business needs. I highly encourage you to log when the global Timeout (onTimeout / onTimeoutAsync) fires and when the retries (onRetry / onRetryAsync) to be able to fine tune / calibrate these values.
EDIT: Based on the comments of this post
As it turned out there is no control over the timeoutFromApiCaller so it can be arbitrary small. (In the given example it is just a few nano-seconds, with the intent to emphasize the problem.) So, in order to have at least one call guarantee we have to make use the Fallback policy.
Instead of calling manually upfront the TryEnterLock outside the policies, we should call it as the last action to satisfy the requirement. Because policies uses escalation, that's why whenever the inner fails then it delegates the problem to the next outer policy.
So, if the provided timeout is so tiny that action can not finish until that period then it will throw a TimeoutRejectedException. With the Fallback we can handle that and the action can be performed again but now without any timeout constraint. This will provide us the desired at least one guarantee.
var atLeastOnce = Policy.Handle<TimeoutRejectedException>
.Fallback((ct) => this.TryEnterLock());
var globalTimeout = Policy.Timeout(TimeSpan.FromMilliseconds(1000));
var foreverRetry = Policy.HandleResult(false)
.WaitAndRetryForever(_ => TimeSpan.FromMilliseconds(500));
var resilientStrategy = Policy.Wrap(atLeastOnce, globalTimeout, foreverRetry);
var result = resilientStrategy.Execute(() => this.TryEnterLock());