Orleans retry mechanism - c#

I am aiming for exactly-once-delivery in Orleans. I have taken care of at-most-once by using sequence numbers and was relying on Orleans retry mechanism for at-least-once. So, I have configured something like this - increased timeout limit to 2 min and set resendCount to 60. .Configure<SiloMessagingOptions>(options => { options.ResendOnTimeout = true; options.MaxResendCount = 60; options.ResponseTimeout = new TimeSpan(0,2,0); });
Is this enough?
Is there any way to know my SiloMessagingOptions like resendCount, after the Silo has been started?
How does Orleans determine that a message has failed? If I don't await the Task and the message fails, does Orleans still detect it and resend the message? Is there any way for the application to know that a message has failed?
What benefit do I get, in context of message reliability, by awaiting a Task (assuming I don't care about the return value of the Task)?
UPDATE: I have been told that using the at-least-once delivery of Orleans is not the best way to go and that I should use features like reminders instead. The question above is here just to get some doubts cleared.

Related

Deferring and re-receiving a deferred message in an IHostBuilder hosted service

If the processing of an Azure Service Bus message depends on another resource, e.g. an API or a database service, and this resource is not available, not calling CompleteMessageAsync() is not an option, because the message will be immediately received again until the Max Delivery Count is reached, and then put into the DLQ. If an API is down for maintenance, we want to wait a bit before retrying.
One of the answers to this question has the general steps for deferring and receiving deferred messages. This is a little better than Microsoft's documentation, but not enough for me to understand the intent of the API, and how it is to be implemented in a hosted service that basically sits in ServiceBusProcessor.StartProcessingAsync all day long.
This is the basic structure of my service:
public class ServiceBusWatcher : IHostedService, IDisposable
{
public Task StartAsync(CancellationToken stoppingToken)
{
ReceiveMessagesAsync();
return Task.CompletedTask;
}
private async void ReceiveMessagesAsync()
{
ServiceBusClient client = new ServiceBusClient(connectionString);
processor = client.CreateProcessor(queueName, new ServiceBusProcessorOptions());
processor.ProcessMessageAsync += MessageHandler;
await processor.StartProcessingAsync();
}
async Task MessageHandler(ProcessMessageEventArgs args)
{
// a dependency is not available that allows me to process a message. so:
await args.DeferMessageAsync(args.Message);
Once the message is deferred, it is my understanding that the processor will not get to it anymore (or will it?). Instead, I have to use ReceiveDeferredMessageAsync() to receive it, along with the sequence number of the originally received message.
In my case, it will make sense to wait minutes or hours before trying again.
This could be done with a separate service that uses a timer and an explicit call to ReceiveDeferredMessageAsync(), as opposed to using a ServiceBusProcessor. I also suppose that the deferred message sequence numbers will have to be persisted in non-volatile storage so that they don't get lost.
Does this sound like a viable approach? I don't like having to remember its sequence numbers so that I can get to a message later. It goes against everything that using a message queue brings to the table in the first place.
Or, instead of deferring, I could just post a new "internal" message with the sequence number and use the ScheduledEnqueueTimeUtc property to delay receiving it. Once I receive this message, I could call ReceiveDeferredMessageAsync() with that sequence number to get to the original message. This seems elegant at the surface, but messages could quickly multiply if there is a longer outage of a dependency.
Another idea that could work without another service: I could complete and repost the payload of the message and set ScheduledEnqueueTimeUtc to a time in the future, as described in another answer to the question I mentioned earlier. Assuming that this works (Microsoft's documentation does not mention what this property is for), it seems simple and clean, and I like simple.
How have you solved this? Is there a better/preferred way that balances low complexity with high robustness without requiring a large amount of code?
Deferring a message works when you know what message you want to retrieve later and your receiver will have the message sequence number saved to retrieve the deferred message. If the receiver has no ability to save message sequence number, the delaying the message is a better option. Delaying a message will mean to copy the original message data into a newly scheduled one and completing the original message. That way the consumer doesn't have to neither hold on to the message sequence number nor initiate the retrieval of a specific message.

Azure Service Bus - MaxConcurrentCalls=1 - The lock supplied is invalid. Either the lock expired

I am using Azure Service Bus and I have the code below (c# .NetCore 3.1). I am constantly getting the error "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance." when I call "CompleteAsync"
As you can see in the code I have the "ReceiveMode.PeekLock", "AutoComplete = false" and MaxAutoRenewDuration to 5 min. The code that handles the message completes in less than 1 second and I still get that error every single time.
What drove me crazy is that after hours reading posts, rewriting my code and a lot of "try and error" I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Does anybody knows what is going on here?
public void OpenQueue(string queueName)
{
var messageHandlerOptions = new MessageHandlerOptions(exceptionReceivedEventArgs =>
{
Log.Error($"Message handler encountered an exception {exceptionReceivedEventArgs.Exception}.");
return Task.CompletedTask;
});
messageHandlerOptions.MaxConcurrentCalls = 1;
messageHandlerOptions.AutoComplete = false;
messageHandlerOptions.MaxAutoRenewDuration = TimeSpan.FromSeconds(300);
messageReceiver = queueManagers.OpenReceiver(queueName, ReceiveMode.PeekLock);
messageReceiver.RegisterMessageHandler(async (message, token) =>
{
if (await ProcessMessage(message)) //really quick operation less than 1 second
{
await messageReceiver.CompleteAsync(message.SystemProperties.LockToken);
}
else
{
await messageReceiver.AbandonAsync(message.SystemProperties.LockToken);
}
}, messageHandlerOptions);
}
I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Concurrency and lock duration is not the only variables in the equation. This sounds like a prefetch issue. If enabled, more messages are prefetched than processed to save on the latency and the roundtrips. If the prefetch is too aggressive, messages that are pre-fetched and waiting are still going to be processed, and while the processing would normally be short enough, the combined time of waiting for processing and the actual processing would exceed the lock duration.
I would suggest to:
Increase MaxLockDuration on the queue
Validate the prefetch count
Regarding MaxLockDuration vs MaxAutoRenewDuration these two are tricky. While the first is guaranteed, the second is not and is a best-effort by the client.
I'm writing the solution for my problem as it may help others.
Turns out the root cause of the problem was a quite basic mistake, but the error got me really confused.
The method OpenQueue was called more than once on the same class instance (multiple queues scenario) what was a mistake. The behavior was quite weird. Looks like queueManagers registered all queues as expected but the token got overwritten causing it to always be invalid.
When I wrote:
I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Later that statement proved to be incorrect. When I enabled multiple queues that failed miserably.
The block of code I posted here is actually working. What was around it was broken. I was trying to gain some time and ended up writing bad code. I fixed my design to manage things properly and everything is now running smooth.

Amazon SQS "Long Polling" configuration. Server vs Client

A long time ago, Amazon introduced the long polling feature. And with that, it is possible to configure on the Queue the "Receive Message Wait Time" parameter. According to the documentation, a valid value falls in the range 0 - 20 seconds.
In the client, we can also configure this parameter on each MessageReceiveRequest. I'm using the AWS SDK for .NET.
var receiveRequest = new ReceiveMessageRequest
{
QueueUrl = "https://queue-url-goes-here.com",
MaxNumberOfMessages = 10,
VisibilityTimeout = 30,
WaitTimeSeconds = 20 // This should tell if we want long polling or not
};
Questions:
a) What is the relationship between the Receive Message Wait Time configured in the Queue VS the WaitTimeSeconds attribute set in the Message Receive Request? Will they work independently? Or the value set in the client overrides the value set in the Queue (for that single request).
b) Under certain conditions, can the C# client time out? I am thinking about setting both values to the max (20 seconds) but I'm afraid that might cause the C# long polling operation to Time Out.
c) What is the best-practice. WaitTimeSeconds > Receive Message Wait Time?
a) As noted in pastk's answer, the WaitTimeSeconds on the message will override the Receive Message Wait Time configured in the queue. See the long polling documentation for details.
b) The AWS SDK for .NET uses System.Net.HttpWebRequest under the hood - its default timeout is 100 seconds. If you're using the defaults, setting the WaitTimeSeconds to 20 seconds will not cause the operation to time out.
c) There is no best practice prescribed by Amazon on this point. Do whatever you think is best for your scenario.
Its just a different way to set wait time you need.
Request-level wait time always overrides queues value: "A value set between 1 to 20 for the WaitTimeSeconds parameter for ReceiveMessage has priority over any value set for the queue attribute ReceiveMessageWaitTimeSeconds." (http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html)
In case some of queue's consumers need to use long polling and others don't then it makes sense to use per-request wait time setting, otherwise simpler to use queue's setting.

Serial processing of a certain message type in Rebus

We have a Rebus message handler that talks to a third party webservice. Due to reasons beyond our immediate control, this WCF service frequently throws an exception because it encountered a database deadlock in its own database. Rebus will then try to process this message five times, which in most cases means that one of those five times will be lucky and not get a deadlock. But it frequently happens that a message does get deadlock after deadlock and ends up in our error queue.
Besides fixing the source of the deadlocks, which would be a longterm goal, I can think of two options:
Keep trying with only this particular message type until it succeeds. Preferably I would be able to set a timeout, so "if five deadlocks then try again in 5 minutes" rather than choke the process up even more by trying continuously. I already do a Thread.Sleep(random) to spread the messages somewhat, but it will still give up after five tries.
Send this particular message type to a different queue that has only one worker that processes the message, so that this happens serially rather than in parallel. Our current configuration uses 8 worker threads, but this just makes the deadlock situation worse as the webservice now gets called concurrently and the messages get in each other's way.
Option #2 has my preference, but I'm not sure if this is possible. Our configuration on the receiving side currently looks like this:
var adapter = new Rebus.Ninject.NinjectContainerAdapter(this.Kernel);
var bus = Rebus.Configuration.Configure.With(adapter)
.Logging(x => x.Log4Net())
.Transport(t => t.UseMsmqAndGetInputQueueNameFromAppConfig())
.MessageOwnership(d => d.FromRebusConfigurationSection())
.CreateBus().Start();
And the .config for the receiving side:
<rebus inputQueue="app.msg.input" errorQueue="app.msg.error" workers="8">
<endpoints>
</endpoints>
</rebus>
From what I can tell from the config, it's only possible to set one input queue to 'listen' to. I can't really find a way to do this via the fluent mapping API either. That seems to take only one input- and error queue as well:
.Transport(t =>t.UseMsmq("input", "error"))
Basically, what I'm looking for is something along the lines of:
<rebus workers="8">
<input name="app.msg.input" error="app.msg.error" />
<input name="another.input.queue" error="app.msg.error" />
</rebus>
Any tips on how to handle my requirements?
I suggest you make use of a saga and Rebus' timeout service to implement a retry strategy that fits your needs. This way, in your Rebus-enabled web service facade, you could do something like this:
public void Handle(TryMakeWebServiceCall message)
{
try
{
var result = client.MakeWebServiceCall(whatever);
bus.Reply(new ResponseWithTheResult{ ... });
}
catch(Exception e)
{
Data.FailedAttempts++;
if (Data.FailedAttempts < 10)
{
bus.Defer(TimeSpan.FromSeconds(1), message);
return;
}
// oh no! we failed 10 times... this is probably where we'd
// go and do something like this:
emailService.NotifyAdministrator("Something went wrong!");
}
}
where Data is the saga data that is made magically available to you and persisted between calls.
For inspiration on how to create a saga, check out the wiki page on coordinating stuff that happens over time where you can see an example on how a service might have some state (i.e. number of failed attempts in your case) stored locally that is made available between handling messages.
When the time comes to make bus.Defer work, you have two options: 1) use an external timeout service (which I usually have installed one of on each server), or 2) just use "yourself" as a timeout service.
At configuration time, you go
Configure.With(...)
.(...)
.Timeouts(t => // configure it here)
where you can either StoreInMemory, StoreInSqlServer, StoreInMongoDb, StoreInRavenDb, or UseExternalTimeoutManager.
If you choose (1), you need to check out the Rebus code and build Rebus.Timeout yourself - it's basically just a configurable, Topshelf-enabled console application that has a Rebus endpoint inside.
Please let me know if you need more help making this work - bus.Defer is where your system becomes awesome, and will be capable of overcoming all of the little glitches that make all others' go down :)

Bloomberg API request timing out

Having set up a ReferenceDataRequest I send it along to an EventQueue
Service refdata = _session.GetService("//blp/refdata");
Request request = refdata.CreateRequest("ReferenceDataRequest");
// append the appropriate symbol and field data to the request
EventQueue eventQueue = new EventQueue();
Guid guid = Guid.NewGuid();
CorrelationID id = new CorrelationID(guid);
_session.SendRequest(request, eventQueue, id);
long _eventWaitTimeout = 60000;
myEvent = eventQueue.NextEvent(_eventWaitTimeout);
Normally I can grab the message from the queue, but I'm hitting the situation now that if I'm making a number of requests in the same run of the app (normally around the tenth), I see a TIMEOUT EventType
if (myEvent.Type == Event.EventType.TIMEOUT)
throw new Exception("Timed Out - need to rethink this strategy");
else
msg = myEvent.GetMessages().First();
These are being made on the same thread, but I'm assuming that there's something somewhere along the line that I'm consuming and not releasing.
Anyone have any clues or advice?
There aren't many references on SO to BLP's API, but hopefully we can start to rectify that situation.
I just wanted to share something, thanks to the code you included in your initial post.
If you make a request for historical intraday data for a long duration (which results in many events generated by Bloomberg API), do not use the pattern specified in the API documentation, as it may end up making your application very slow to retrieve all events.
Basically, do not call NextEvent() on a Session object! Use a dedicated EventQueue instead.
Instead of doing this:
var cID = new CorrelationID(1);
session.SendRequest(request, cID);
do {
Event eventObj = session.NextEvent();
...
}
Do this:
var cID = new CorrelationID(1);
var eventQueue = new EventQueue();
session.SendRequest(request, eventQueue, cID);
do {
Event eventObj = eventQueue.NextEvent();
...
}
This can result in some performance improvement, though the API is known to not be particularly deterministic...
I didn't really ever get around to solving this question, but we did find a workaround.
Based on a small, apparently throwaway, comment in the Server API documentation, we opted to create a second session. One session is responsible for static requests, the other for real-time. e.g.
_marketDataSession.OpenService("//blp/mktdata");
_staticSession.OpenService("//blp/refdata");
The means one session operates in subscription mode, the other more synchronously - I think it was this duality which was at the root of our problems.
Since making that change, we've not had any problems.
My reading of the docs agrees that you need separate sessions for the "//blp/mktdata" and "//blp/refdata" services.
A client appeared to have a similar problem. I solved it by making hundreds of sessions rather than passing in hundreds of requests in one session. Bloomberg may not be to happy with this BFI (brute force and ignorance) approach as we are sending the field requests for each session but it works.
Nice to see another person on stackoverflow enjoying the pain of bloomberg API :-)
I'm ashamed to say I use the following pattern (I suspect copied from the example code). It seems to work reasonably robustly, but probably ignores some important messages. But I don't get your time-out problem. It's Java, but all the languages work basically the same.
cid = session.sendRequest(request, null);
while (true) {
Event event = session.nextEvent();
MessageIterator msgIter = event.messageIterator();
while (msgIter.hasNext()) {
Message msg = msgIter.next();
if (msg.correlationID() == cid) {
processMessage(msg, fieldStrings, result);
}
}
if (event.eventType() == Event.EventType.RESPONSE) {
break;
}
}
This may work because it consumes all messages off each event.
It sounds like you are making too many requests at once. BB will only process a certain number of requests per connection at any given time. Note that opening more and more connections will not help because there are limits per subscription as well. If you make a large number of time consuming requests simultaneously, some may timeout. Also, you should process the request completely(until you receive RESPONSE message), or cancel them. A partial request that is outstanding is wasting a slot. Since splitting into two sessions, seems to have helped you, it sounds like you are also making a lot of subscription requests at the same time. Are you using subscriptions as a way to take snapshots? That is subscribe to an instrument, get initial values, and de-subscribe. If so, you should try to find a different design. This is not the way the subscriptions are intended to be used. An outstanding subscription request also uses a request slot. That is why it is best to batch as many subscriptions as possible in a single subscription list instead of making many individual requests. Hope this helps with your use of the api.
By the way, I can't tell from your sample code, but while you are blocked on messages from the event queue, are you also reading from the main event queue while(in a seperate event queue)? You must process all the messages out of the queue, especially if you have outstanding subscriptions. Responses can queue up really fast. If you are not processing messages, the session may hit some queue limits which may be why you are getting timeouts. Also, if you don't read messages, you may be marked a slow consumer and not receive more data until you start consuming the pending messages. The api is async. Event queues are just a way to block on specific requests without having to process all messages from the main queue in a context where blocking is ok, and it would otherwise be be difficult to interrupt the logic flow to process parts asynchronously.

Categories

Resources