Messages in Azure Service Bus Queues remain in queue after expire time - c#

I'm working with Azure Service Bus Queues in a request/response pattern using two queues and in general it is working well. I'm using pretty simple code from some good examples I've found. My queues are between web and worker roles, using MVC4, Visual Studio 2012 and .NET 4.5.
During some stress testing, I end up overloading my system and some responses are not delivered before the client gives up (which I will fix, not the point of this question).
When this happens, I end up with many messages left in my response queue, all well beyond their ExpiresAtUtc time. My message TimeToLive is set for 5 minutes.
When I look at the properties for a message still in the queue, it is clearly set to expire in the past, with a TimeToLive of 5 minutes.
I create the queues if they don't exist with the following code:
namespaceManager.CreateQueue(
new QueueDescription( RequestQueueName )
{
RequiresSession = true,
DefaultMessageTimeToLive = TimeSpan.FromMinutes( 5 ) // messages expire if not handled within 5 minutes
} );
What would cause a message to remain in a queue long after it is set to expire?

As I understand it, there is no background process cleaning these up, only the act of moving the queue cursor forward with a call to Receive will cause the server to skip past and dispose of messages which are expired and actually return the first message that is not expired or none if all are expired.

Related

Azure Service Bus Topics, Subscription Lost

I'm working with Azure ServiceBus, standard tier.
I'm trying to figure out what's happened since a couple of weeks, (it seems it started when bus traffic has increased, maybe 10-15 messages per second).
I have automatic creation of subscription using
subscriptionOpts.AutoDeleteOnIdle = TimeSpan.FromHours(3);
Starting from lasts weeks, (when we got a traffic increment), sometimes our subscriptionclients stopped receiving messages and after 3 hours they get deleted.
var messageOptions = new MessageHandlerOptions(args =>
{
Emaillog.Warn(args.Exception, $"Client ExceptionReceived: {args.Exception}");
return Task.CompletedTask;
}) { AutoComplete = true };
_subscriptionClient.RegisterMessageHandler(async (message, token) => await OnMessageReceived(message, $"{_subscriptionClient.SubscriptionName}", token), messageOptions);
Is it possible that a subscription client gets disconnected and doesn't connect anymore?
I have 4-5 clients processes that connect to this topic, each one with his own subscription.
When I find one of these subscriptions deleted, sometimes they have all been deleted, sometimes only some of them have been deleted.
Is it a bug? The only method call I do on the subscriptionClient is RegisterMessageHandler. I don't manage manually anything else...
Thank you in advance
The property AutoDeleteOnIdle is used to delete the Subscription when there is no message processing with in the Subscription for the specified time span.
As you mentioned that the message flow increased to 15 messages per second, there is no chance that the Subscription is left empty (with out message flow). So there is no reason for the Subscriptions to delete. The idleness of the Subscription is decided by both incoming and outgoing messages.
There can be chances that due to heavy message traffic, the downstream application processing the messages may went offline, leaving the messages unprocessed, eventually when the message flow reduced there is no receiver to process the messages, leaving the Subscription idle for 3 hours and delete.

Azure Service Bus Queues AutoRenewTimeout produces a "SessionLockLostException" on '.Complete()'

I have an Azure Service Bus Queue.
It's configured with:
Requires Duplicate Detection: true
Requires Session: true
Enable Partitions: false
Max Delivery Count: 10
Lock Duration: 1 minute
Batch Operations Enabled: true
Deadletter on Expiration Enabled: false
Enforce message ordering: true
When retrieving a message from the queue I use the following OnMessageOptions:
AutoComplete: false
AutoRenewTimeout: 12 minutes
Each message takes on average 2 minutes to complete.
Some of them succeed, others throw a "SessionLockLostException".
Why does the lock "AutoRenew" not keep the message lock renewed? It's supposed to keep doing it's job for 12 minutes, yet we get that exception after 2.
How do you debug the cause of the exception? The exception tells me roughly what happened, but not why. I can't find any information about logging within the Service Bus Queue client.
Where is the documentation? The MSDN in this instance is awful! It lacks even basic information about how these classes are supposed to work.
EDIT: As MaDeRkAn helpfully mentioned in a comment, the documentation for "SessionLockLostException" does mention that Azure can move around messages between partitions.
When I originally created a test application to see if this approach worked I had the queue configured to use partitions. While figuring out the code needed to handle the various exceptions that occur in various situations I read about that exception.
I have discounted this as being the problem for two reasons:
I've (literally) triple checked that Partitions are disabled. I also checked that the Queue we're using is the same Queue I'm looking at for the properties.
If Azure was causing failures this often (every 2-5 messages) then the service would be pretty much unusable! And while Azure has issues at times it's not normally totally broken like that.

MongoDB connection problems on Azure

We have an ASP.NET MVC application deployed to an Azure Website that connects to MongoDB and does both read and write operations. The application does this iteratively. A few thousand times per minute.
We initialize the C# driver using Autofac and we set the MaxConnectionIdleTime to 45 seconds as suggested in https://groups.google.com/forum/#!topic/mongodb-user/_Z8YepNHnbI and a few other places.
We are still getting a large number of the below error:
Unable to read data from the transport connection: A connection
attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because
connected host has failed to respond. Method
Message:":{"ClassName":"System.IO.IOException","Message":"Unable to
read data from the transport connection: A connection attempt failed
because the connected party did not properly respond after a period of
time, or established connection failed because connected host has
failed to respond.
We get this error while connecting to both a MongoDB instance deployed on a VM in the same datacenter/region on Azure and also while connecting to an external PaaS MongoDB provider.
I run the same code in my local computer and connect to the same DB and I don't receive these errors. It's only when I deploy the code to an Azure Website.
Any suggestions?
A few thousand requests per minute is a big load, and the only way to do it right, is by controlling and limiting the maximum number of threads which could be running at any one time.
As there's not much information posted as to how you've implemented this. I'm going to cover a few possible circumstances.
Time to experiment...
The constants:
Items to process:
50 per second, or in other words...
3,000 per minute, and one more way to look at it...
180,000 per hour
The variables:
Data transfer rates:
How much data you can transfer per second is going to play a role no matter what we do, and this will vary through out the day depending on the time of day.
The only thing we can do is fire off more requests from different cpu's to distribute the weight of traffic we're sending back n forth.
Processing power:
I'm assuming you have this in a WebJob as opposed to having this coded inside the MVC site it's self. It's highly inefficient and not fit for the purpose that you're trying to achieve. By using a WebJob we can queue work items to be processed by other WebJobs. The queue in question is the Azure Queue Storage.
Azure Queue storage is a service for storing large numbers of messages
that can be accessed from anywhere in the world via authenticated
calls using HTTP or HTTPS. A single queue message can be up to 64 KB
in size, and a queue can contain millions of messages, up to the total
capacity limit of a storage account. A storage account can contain up
to 200 TB of blob, queue, and table data. See Azure Storage
Scalability and Performance Targets for details about storage account
capacity.
Common uses of Queue storage include:
Creating a backlog of work to process asynchronously
Passing messages from an Azure Web role to an Azure Worker role
The issues:
We're attempting to complete 50 transactions per second, so each transaction should be done in under 1 second if we were utilising 50 threads. Our 45 second time out serves no purpose at this point.
We're expecting 50 threads to run concurrently, and all complete in under a second, every second, on a single cpu. (I'm exaggerating a point here, just to make a point... but imagine downloading 50 text files every single second. Processing it, then trying to shoot it back over to a colleague in the hopes they'll even be ready to catch it)
We need to have a retry logic in place, if after 3 attempts the item isn't processed, they need to be placed back in to the queue. Ideally we should be providing more time to the server to respond than just one second with each failure, lets say that we gave it a 2 second break on first failure, then 4 seconds, then 10, this will greatly increase the odds of us persisting / retrieving the data that we needed.
We're assuming that our MongoDb can handle this number of requests per second. If you haven't already, start looking at ways to scale it out, the issue isn't in the fact that it's a MongoDb, the data layer could have been anything, it's the fact that we're making this number of requests from a single source that is going to be the most likely cause of your issues.
The solution:
Set up a WebJob and name it EnqueueJob. This WebJob will have one sole purpose, to queue items of work to be process in the Queue Storage.
Create a Queue Storage Container named WorkItemQueue, this queue will act as a trigger to the next step and kick off our scaling out operations.
Create another WebJob named DequeueJob. This WebJob will also have one sole purpose, to dequeue the work items from the WorkItemQueue and fire out the requests to your data store.
Configure the DequeueJob to spin up once an item has been placed inside the WorkItemQueue, start 5 separate threads on each and while the queue is not empty, dequeue work items for each thread and attempt to execute the dequeued job.
Attempt 1, if fail, wait & retry.
Attempt 2, if fail, wait & retry.
Attempt 3, if fail, enqueue item back to WorkItemQueue
Configure your website to autoscale out to x amount of cpu's (note that your website and web jobs share the same resources)
Here's a short 10 minute video that gives an overview on how to utilise queue storages and web jobs.
Edit:
Another reason you may be getting those errors could be because of two other factors as well, again caused by it being in an MVC app...
If you're compiling the application with the DEBUG attribute applied but pushing the RELEASE version instead, you could be running into issues due to the settings in your web.config, without the DEBUG attribute, an ASP.NET web application will run a request for a maximum of 90 seconds, if the request takes longer than this, it will dispose of the request.
To increase the timeout to longer than 90 seconds you will need to change the [httpRuntime][3] property in your web.config...
<!-- Increase timeout to five minutes -->
<httpRuntime executionTimeout="300" />
The other thing that you need to be aware of is the request timeout settings of your browser > web app, I'd say that if you insist on keeping the code in MVC as opposed to extracting it and putting it into a WebJob, then you can use the following code to fire a request off to your web app and offset the timeout of the request.
string html = string.Empty;
string uri = "http://google.com";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Timeout = TimeSpan.FromMinutes(5);
using (HttpWebResponse response = (HttpWebResonse)request.GetResponse())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
Are you using mongoDB in a VM? It seems to be a network problem. This kind of transient faults should occur, so the best you can do is implement a retry pattern or use a lib such as Polly to do that:
Policy
.Handle<IOException>()
.Retry(3, (exception, retryCount) =>
{
// do something
});
https://github.com/michael-wolfenden/Polly

MSMQ messages always arrive delayed by exactly 3 minutes on the same machine

I'm facing an extremely puzzling problem. I have a Windows service that monitors two MSMQ queues for input and sends messages to another MSMQ queue. Although the send operation seems instant from the service's perspective it actually takes the message exactly three (3) minutes to arrive (as shown in the properties window in the MSMQ MMC). I've been testing this problem with nothing else listening on the other side so that I can see the messages piling up. This is how the service sends messages:
var proxyFactory = new ChannelFactory<IOtherServerInterface>(new NetMsmqBinding(NetMsmqSecurityMode.None)
{
Durable = true,
TimeToLive = new TimeSpan(1, 0, 0),
ReceiveTimeout = TimeSpan.MaxValue
});
IOtherServerInterface server = this.proxyFactory.CreateChannel(new EndpointAddress("net.msmq://localhost/private/myqueue"));
var task = new MyTask() { ... };
using (TransactionScope scope = new TransactionScope(TransactionScopeOption.Required))
{
server.QueueFile(task);
scope.Complete();
}
The service is running on Windows Server 2008 R2. I also tested it on R1 and noticed the same behavior. Again, everything happens on the same machine. All components are deployed there so I don't think it could be a network issue.
EDIT #1:
I turned on the WCF diagnostics and what I noticed is very strange. The MSMQ datagram does get written normally. However, after the "a message was closed" trace message there is nothing going on. It is as if the service is waiting for something to happen. Exactly 3 minutes later and exactly when the MSMQ message arrives (according to the MSMQ MMC), I see another trace message about a previous activity. I suspect there is some kind of interference.
Let me give you more details about how the services work. There is an IIS app which receives tasks from clients and drops them in an MSMQ queue. From there, the troublesome service (MainService) picks them up and starts processing them. In some cases, another service (AuxService) is required to complete the task so MainService sends a message (that always gets delayed) to AuxService. AuxService has its own inbox queue where it receives MSMQ messages and when it's done, it sends an MSMQ message to MainService. In the meanwhile, the thread that sent the message to AuxService waits until it gets a signal or until it times out. There is a special queue where MainService looks for messages from AuxServices. When a message is received the abovementioned thread is woken up and resumes its activity.
Here's a representation of the whole architecture:
IIS app -> Q1 -> MainService
MainService -> Q2 -> AuxService
AuxService -> Q3 -> MainService
Although all operations are marked with OneWay, I'm wondering whether starting a MSMQ operation from within another MSMQ operation is somehow illegal. It seems to be the case given the empirical evidence. If so, is there away to change this behavior?
EDIT #2:
Alright, after some more digging it seems WCF is the culprit. I switched both the client code in MainService and the server code in AuxService to use MSMQ SDK directly and it works as expected. The 3 minute timeout I was experiencing was actually the time after which MainService gave up and considered that AuxService failed. Therefore, it seems that for some reason WCF refuses to perform the send until the current WCF activity exits.
Is this by design or is it a bug? Can this behavior be controlled?
You have transactions setup on the queue code, do you have the msmq object setup for transactions? 3 minutes sounds like the timeout period for a Distributed Transaction Coordinator enlistment.

Azure Service Bus Subscriber regularly phoning home?

We have pub/sub application that involves an external client subscribing to a Web Role publisher via an Azure Service Bus Topic. Our current billing cycle indicates we've sent/received >25K messages, while our dashboard indicates we've sent <100. We're investigating our implementation and checking our assumptions in order to understand the disparity.
As part of our investigation we've gathered wireshark captures of client<=>service bus traffic on the client machine. We've noticed a regular pattern of communication that we haven't seen documented and would like to better understand. The following exchange occurs once every 50s when there is otherwise no activity on the bus:
The client pushes ~200B to the service bus.
10s later, the service bus pushes ~800B to the client. The client registers the receipt of an empty message (determined via breakpoint.)
The client immediately responds by pushing ~1000B to the service bus.
Some relevant information:
This occurs when our web role is not actively pushing data to the service bus.
Upon receiving a legit message from the Web Role, the pattern described above will not occur again until a full 50s has passed.
Both client and server connect to sb://namespace.servicebus.windows.net via TCP.
Our application messages are <64 KB
Questions
What is responsible for the regular, 3-packet message exchange we're seeing? Is it some sort of keep-alive?
Do each of the 3 packets count as a separately billable message?
Is this behavior configurable or otherwise documented?
EDIT:
This is the code the receives the messages:
private void Listen()
{
_subscriptionClient.ReceiveAsync().ContinueWith(MessageReceived);
}
private void MessageReceived(Task<BrokeredMessage> task)
{
if (task.Status != TaskStatus.Faulted && task.Result != null)
{
task.Result.CompleteAsync();
// Do some things...
}
Listen();
}
I think what you are seeing is the Receive call in the background. Behind the scenes the Receive calls are all using long polling. Which means they call out to the Service Bus endpoint and ask for a message. The Service Bus service gets that request and if it has a message it will return it immediately. If it doesn't have a message it will hold the connection open for a time period in case a message arrives. If a message arrives within that time frame it will be returned to the client. If a message is not available by the end of the time frame a response is sent to the client indicating that no message was there (aka, your null BrokeredMessage). If you call Receive with no overloads (like you've done here) it will immediately make another request. This loop continues to happend until a message is received.
Thus, what you are seeing are the number of times the client requests a message but there isn't one there. The long polling makes it nicer than what the Windows Azure Storage Queues have because they will just immediately return a null result if there is no message. For both technologies it is common to implement an exponential back off for requests. There are lots of examples out there of how to do this. This cuts back on how often you need to go check the queue and can reduce your transaction count.
To answer your questions:
Yes, this is normal expected behaviour.
No, this is only one transaction. For Service Bus you get charged a transaction each time you put a message on a queue and each time a message is requested (which can be a little opaque given that Recieve makes calls multiple times in the background). Note that the docs point out that you get charged for each idle transaction (meaning a null result from a Receive call).
Again, you can implement a back off methodology so that you aren't hitting the queue so often. Another suggestion I've recently heard was if you have a queue that isn't seeing a lot of traffic you could also check the queue depth to see if it was > 0 before entering the loop for processing and if you get no messages back from a receive call you could go back to watching the queue depth. I've not tried that and it is possible that you could get throttled if you did the queue depth check too often I'd think.
If these are your production numbers then your subscription isn't really processing a lot of messages. It would likely be a really good idea to have a back off policy to a time that is acceptable to wait before it is processed. Like, if it is okay that a message sits for more than 10 minutes then create a back off approach that will eventually just be checking for a message every 10 minutes, then when it gets one process it and immediately check again.
Oh, there is a Receive overload that takes a timeout, but I'm not 100% that is a server timeout or a local timeout. If it is local then it could still be making the calls every X seconds to the service. I think this is based on the OperationTimeout value set on the Messaging Factory Settings when creating the SubscriptionClient. You'd have to test that.

Categories

Resources