I'm working with Azure ServiceBus, standard tier.
I'm trying to figure out what's happened since a couple of weeks, (it seems it started when bus traffic has increased, maybe 10-15 messages per second).
I have automatic creation of subscription using
subscriptionOpts.AutoDeleteOnIdle = TimeSpan.FromHours(3);
Starting from lasts weeks, (when we got a traffic increment), sometimes our subscriptionclients stopped receiving messages and after 3 hours they get deleted.
var messageOptions = new MessageHandlerOptions(args =>
{
Emaillog.Warn(args.Exception, $"Client ExceptionReceived: {args.Exception}");
return Task.CompletedTask;
}) { AutoComplete = true };
_subscriptionClient.RegisterMessageHandler(async (message, token) => await OnMessageReceived(message, $"{_subscriptionClient.SubscriptionName}", token), messageOptions);
Is it possible that a subscription client gets disconnected and doesn't connect anymore?
I have 4-5 clients processes that connect to this topic, each one with his own subscription.
When I find one of these subscriptions deleted, sometimes they have all been deleted, sometimes only some of them have been deleted.
Is it a bug? The only method call I do on the subscriptionClient is RegisterMessageHandler. I don't manage manually anything else...
Thank you in advance
The property AutoDeleteOnIdle is used to delete the Subscription when there is no message processing with in the Subscription for the specified time span.
As you mentioned that the message flow increased to 15 messages per second, there is no chance that the Subscription is left empty (with out message flow). So there is no reason for the Subscriptions to delete. The idleness of the Subscription is decided by both incoming and outgoing messages.
There can be chances that due to heavy message traffic, the downstream application processing the messages may went offline, leaving the messages unprocessed, eventually when the message flow reduced there is no receiver to process the messages, leaving the Subscription idle for 3 hours and delete.
Related
So we've been using PubSub for receiving GCB events for a while.
We have 4 subscribers to our subscription, so they can split the workload.
The subscribers are identical and written using the official C# client
The subscribers use the default settings, we configure that only 1 thread should be pulling.
They are running as a HostedService in AspNetCore inside Kubernetes.
The subscriber application has only that one responsibility
This application is deployed a couple of times every week since it's bundle with a more heavy use api.
The issue we are facing is this:
When looking at our Kibana logs we sometimes see what appears to a delayed of the pubs message of 1 or more minutes (notice that QUEUED has a later timestamp than WORKING).
However looking at the publishTime it is clear that problem is not that the event is published later, but rather that it is handled by our code later.
Now if we look at the PubSub graphs we get:
Which confirms that there indeed WAS an incident where message where not acked.
This explains why we are seeing the delayed handling of the message :).
But it does not explain WHY we appear to exceed the deadline of 60 seconds.
There are no errors / exceptions anywhere to be found
We are using the C# client in a standard way (defaults)
Now here is where it gets interesting, I discovered that if I do a PURGE messages using the google UI, everything seems to run smoothly for a while (1-3 days). But then I happens again.
Now if we look at the metrics across all the instances when the issue occurs (this is from another incident) we are at no point in time over 200ms of computation time:
Thoughts:
We are misunderstanding something basic about the pubsub ack configuration
Maybe the deploys we do somehow leads the subscription to think that there are still active subscribers and therefore it awaits them to fail before trying the next subscriber? This is indicated by the PURGE reaction, however I have no way of inspecting how many subscribers currently are registered with the subscription and I can't see a bug in the code that could imply this.
Looking at the metrics the problem is not with our code. However there might be something with the official client default config / bug.
Im really puzzled and im missing insights into what is going on inside the pubsub clusters and the official client. Some tracing from the client would be nice or query tools for pubsub like the ones we have with our Kafka clusters.
The code:
public class GoogleCloudBuildHostedService : BackgroundService
{
...
private async Task<SubscriberClient> BuildSubscriberClient()
{
var subscriptionToUse = $"{_subscriptionName}";
var subscriptionName = new SubscriptionName(_projectId,subscriptionToUse);
var settings = new SubscriberServiceApiSettings();
var client = new SubscriberClient.ClientCreationSettings(1,
credentials: GoogleCredentials.Get().UnderlyingCredential.ToChannelCredentials(),
subscriberServiceApiSettings: settings);
return await SubscriberClient.CreateAsync(subscriptionName, client);
}
protected override async Task ExecuteAsync(CancellationToken cancellationToken)
{
await Task.Yield();
cancellationToken.Register(() => _log.Info("Consumer thread stopping."));
while (cancellationToken.IsCancellationRequested == false)
{
try
{
_log.Info($"Consumer starting...");
var client = await BuildSubscriberClient();
await client.StartAsync((msg, cancellationToken) =>
{
using (eventTimer.NewTimer())
{
try
{
...
}
catch (Exception e)
{
_log.Error(e);
}
}
return Task.FromResult(SubscriberClient.Reply.Ack);
});
await client.StopAsync(cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(10), cancellationToken);
}
catch (Exception e)
{
_log.Info($"Consumer failed: {e.Message}");
}
}
_log.Info($"Consumer stopping...");
}
}
Hope someone out there in the great big void can enlighten me :).
Kind regards
Christian
UPDATE
So I looked into one of the cases again, and here below we see:
the same instance of the application handling messages from the same topic and subscription.
there's 1 client thread only configured
Notice that at 15:23:04 and 15:23:10 there's 2 messages handled at the same time of publication, now 2 minutes later a message that was published at 15:23:07 is handled. And in the mean time 2 other messages are being handled.
So why is a message published at 15:23:07 not handled until 15:25:25, when other messages arrive in the mean time?
This can be happening due to different reasons and it is not a trivial task to find and troubleshoot the root of the issue.
Possible latency reasons
Related to latency, it is normal for subscriptions to have backlogged messages if they are not consuming messages fast enough or have not finished working though the backlog.
I would start by reading the following documentation, where it mentions some reasons on why you might be exceeding the deadline in some cases.
Another reason for message latency might be due to an increase in message or payload size. Check if all of your messages are moreless the same size, or if those getting handled with delay are bigger in size.
Handling message failures
I would also like to suggest taking a read here, where it talks about how to handle message failures by setting a subscription retry policy or forwarding undelivered messages to a dead-letter topic (also known as a dead-letter queue).
Good practices
This article contains some good tips and tricks for understanding how latency and message stuckness can happen and some suggestions that can help to improve that.
xBurnsed offered some good advice links, so let me just supplement it with some other things:
This looks like a classic case of a subscriber holding on to a message and then upon its lease expiring, the message gets delivered to another subscriber. The original subscriber is perhaps one that went down after it received a message, but before it could process and ack the message. You could see if the backlog correlates with restarts of instances of your subscriber. If so, this is a likely culprit. You could check to see if your subscribers are shutting down cleanly, where the subscriber StopAsync call exits cleanly and you have acknowledged all messages received by your callback before actually stopping the subscriber application.
Make sure you are using the latest version of the client. Earlier versions were subject to issues with large backlogs of small messages, though in reality the issues were not limited to those cases. This is mostly relevant if your subscribers are running up against their flow control limits.
Do the machines on which the subscribe is running have any other tasks running on them that could be using up CPU or RAM? If so, it's possible that one of the subscriber applications is starved for resources and can't process messages quickly enough.
If you are still having issues, the next best step to take is to put in a request with Cloud Support, providing the name of your project, name of your subscription, and the message ID of a message that was delayed. Support will be able to track the lifetime of a message and determine if the delivery was delayed on the server or if it was delivered multiple times and not acked.
I am experiencing a racing condition issue with my rabbitmq client. My service has multiple instances listening on a single queue, storing received messages into a db.
When they all get restarted at once, i sometimes see messages being redelivered and stored in the db twice. This is normally handled on client side by checking if the correlationid has already been stored in the db. This works 99.9% of the time (i am processing 5mill messages a day, it happens once or twice a day).
So as i said, i suspect a racing condition being responsible for this. I think i receive the message again while my first message is still being processed. So when i check i dont see it stored in the db, and in the end, store it twice.
I should not that this is a non-issue, but has been bothering me because i can't really explain what happens.
I suspect that it happens when i restart the services. I think i disconnect from the queue, while i am still processing the message, triggering rabbitmq to redeliver again to another instance that is not shutdown yet.
What i want to do is when i am stopping the service is to
tell rabbitmq that i dont want to receive further messages
wait for all currently processing messages to finish
send acks / nacks
shutdown
Right now i am first deregistering the received event
_consumerServer.Received -= MessageReceived;
then i am disposing the channel and the server
if (_channel != null)
{
_channel.Close();
_channel.Dispose();
}
if (_connectionServer != null)
{
_connectionServer.Close();
_connectionServer.Dispose();
}
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
Rather than try and shut down a consumer so that messages won't be redelivered, you should handle redelivery correctly. Check for and handle the case where the redelivered flag is set on a message, and act appropriately. You should also try store your messages in such a way that the store operation is idempotent - i.e. it can happen multiple times and you will only have one record in your database.
Please see the guidelines that the team have provided here:
https://www.rabbitmq.com/reliability.html#consumer
Do messages in dead letter queues in Azure Service Bus expire?
Some explanation
I have these queue settings:
var queueDescription = new QueueDescription("MyTestQueue")
{
RequiresSession = false,
DefaultMessageTimeToLive = TimeSpan.FromMinutes(1),
EnableDeadLetteringOnMessageExpiration = true,
MaxDeliveryCount = 10
};
namespaceManager.CreateQueue(queueDescription);
When I place some messages in a Azure Service Bus message queue (not queues from Azure Storage) and don't consume them (ever), they'll be moved to the dead letter queue automatically.
However, if I have no consumer for the dead letter queue either, will the messages ever be deleted from the dead letter queue or will they stay there forever? (Is there some official documentation stating how this is supposed to work?)
My Trials
In my trials, I placed 3 messages in the queue. They were dead lettered after 2 minutes or so. They remained in the dead letter queue for at least a day and weren't removed.
Although calling NamespaceManager.GetQueueAsync() gave me the values above (notice how MessageCount is still 3 but DeadLetterMessageCount is strangely 0), I could still receive the messages from the dead letter queue. (So they weren't removed from the queue.)
Sebastian your observation is correct, in that messages once placed in the DeadLetter sub-queue never expire. They will be available there forever until removed explicitly from the DeadLetter sub-queue. In the above error regarding the tooling/api it could be a refresh issue? The call to GetQueueAsync() needs to be made after the messages have been dead-lettered which is not a deterministic time, say if you had a queue with a thousand messages that were expired but that Queue was not being used (send/receive operations) then the count may still return as Active until some operations are performed.
After doing some research I stumbled over a fact I missed completely:
Messages can expire even when dead lettering is disabled.
When messages expire while dead lettering is disabled (which is the default), they'll just get deleted.
So, Microsoft's reasoning for not auto-deleting messages from the dead letter queue is probably:
If you're enabling dead lettering, you explicitly want expired message not to be thrown away but stored somewhere else (the dead letter queue) so that you can review them.
We have pub/sub application that involves an external client subscribing to a Web Role publisher via an Azure Service Bus Topic. Our current billing cycle indicates we've sent/received >25K messages, while our dashboard indicates we've sent <100. We're investigating our implementation and checking our assumptions in order to understand the disparity.
As part of our investigation we've gathered wireshark captures of client<=>service bus traffic on the client machine. We've noticed a regular pattern of communication that we haven't seen documented and would like to better understand. The following exchange occurs once every 50s when there is otherwise no activity on the bus:
The client pushes ~200B to the service bus.
10s later, the service bus pushes ~800B to the client. The client registers the receipt of an empty message (determined via breakpoint.)
The client immediately responds by pushing ~1000B to the service bus.
Some relevant information:
This occurs when our web role is not actively pushing data to the service bus.
Upon receiving a legit message from the Web Role, the pattern described above will not occur again until a full 50s has passed.
Both client and server connect to sb://namespace.servicebus.windows.net via TCP.
Our application messages are <64 KB
Questions
What is responsible for the regular, 3-packet message exchange we're seeing? Is it some sort of keep-alive?
Do each of the 3 packets count as a separately billable message?
Is this behavior configurable or otherwise documented?
EDIT:
This is the code the receives the messages:
private void Listen()
{
_subscriptionClient.ReceiveAsync().ContinueWith(MessageReceived);
}
private void MessageReceived(Task<BrokeredMessage> task)
{
if (task.Status != TaskStatus.Faulted && task.Result != null)
{
task.Result.CompleteAsync();
// Do some things...
}
Listen();
}
I think what you are seeing is the Receive call in the background. Behind the scenes the Receive calls are all using long polling. Which means they call out to the Service Bus endpoint and ask for a message. The Service Bus service gets that request and if it has a message it will return it immediately. If it doesn't have a message it will hold the connection open for a time period in case a message arrives. If a message arrives within that time frame it will be returned to the client. If a message is not available by the end of the time frame a response is sent to the client indicating that no message was there (aka, your null BrokeredMessage). If you call Receive with no overloads (like you've done here) it will immediately make another request. This loop continues to happend until a message is received.
Thus, what you are seeing are the number of times the client requests a message but there isn't one there. The long polling makes it nicer than what the Windows Azure Storage Queues have because they will just immediately return a null result if there is no message. For both technologies it is common to implement an exponential back off for requests. There are lots of examples out there of how to do this. This cuts back on how often you need to go check the queue and can reduce your transaction count.
To answer your questions:
Yes, this is normal expected behaviour.
No, this is only one transaction. For Service Bus you get charged a transaction each time you put a message on a queue and each time a message is requested (which can be a little opaque given that Recieve makes calls multiple times in the background). Note that the docs point out that you get charged for each idle transaction (meaning a null result from a Receive call).
Again, you can implement a back off methodology so that you aren't hitting the queue so often. Another suggestion I've recently heard was if you have a queue that isn't seeing a lot of traffic you could also check the queue depth to see if it was > 0 before entering the loop for processing and if you get no messages back from a receive call you could go back to watching the queue depth. I've not tried that and it is possible that you could get throttled if you did the queue depth check too often I'd think.
If these are your production numbers then your subscription isn't really processing a lot of messages. It would likely be a really good idea to have a back off policy to a time that is acceptable to wait before it is processed. Like, if it is okay that a message sits for more than 10 minutes then create a back off approach that will eventually just be checking for a message every 10 minutes, then when it gets one process it and immediately check again.
Oh, there is a Receive overload that takes a timeout, but I'm not 100% that is a server timeout or a local timeout. If it is local then it could still be making the calls every X seconds to the service. I think this is based on the OperationTimeout value set on the Messaging Factory Settings when creating the SubscriptionClient. You'd have to test that.
I'm working with Azure Service Bus Queues in a request/response pattern using two queues and in general it is working well. I'm using pretty simple code from some good examples I've found. My queues are between web and worker roles, using MVC4, Visual Studio 2012 and .NET 4.5.
During some stress testing, I end up overloading my system and some responses are not delivered before the client gives up (which I will fix, not the point of this question).
When this happens, I end up with many messages left in my response queue, all well beyond their ExpiresAtUtc time. My message TimeToLive is set for 5 minutes.
When I look at the properties for a message still in the queue, it is clearly set to expire in the past, with a TimeToLive of 5 minutes.
I create the queues if they don't exist with the following code:
namespaceManager.CreateQueue(
new QueueDescription( RequestQueueName )
{
RequiresSession = true,
DefaultMessageTimeToLive = TimeSpan.FromMinutes( 5 ) // messages expire if not handled within 5 minutes
} );
What would cause a message to remain in a queue long after it is set to expire?
As I understand it, there is no background process cleaning these up, only the act of moving the queue cursor forward with a call to Receive will cause the server to skip past and dispose of messages which are expired and actually return the first message that is not expired or none if all are expired.