So many Service Bus Transient Errors?

So many Service Bus Transient Errors? - c#

We have two windows services that live on a Corporate On-Premise Server and that continually send messages to Azure Service Bus in the cloud. Although the messages do end up on the service bus eventually, there are periods of time where the messages just seem to never make it through for a long stretch of time.
This is causing delay issues for us, as we depend on the message arriving onto the service bus and being processed within a minute. However, as can be seen below, a message can be 'blocked' for stretches of up to 30-40 minutes before making its way through to Azure Service Bus. This happens every day, and almost at some time during every hour.
The errors are mainly one of the following (example logs at end of this post):
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 191.239.XX.XXX:443
Error during communication with Service Bus. Check the connection information, then retry.
No such host is known
The request operation did not complete within the allotted timeout of 00:01:10. The time allotted to this operation may have been a portion of a longer timeout. TrackingId:f2db6377-e17d-401a-b339-11fbb51c7bf7, Timestamp:19/05/2017 12:47:36 AM
The way that we send messages to the service bus is as follows, simplified below:
private TopicClient _azureTopic;
...
<Begin Loop>
if (_azureTopic == null)
{
var connectionString = "Endpoint=sb://mynamespace.servicebus.windows.net/;SharedAccessKeyName=managerfiddev;SharedAccessKey=AABBCCDDEEFFGGHHHASDFADFAadfadfdfz=EntityPath=mytopic";
_azureTopic = TopicClient.CreateFromConnectionString(connectionString);
_azureTopic.RetryPolicy = RetryPolicy.NoRetry;
}
var brokeredMessage = new BrokeredMessage(message.Message)
{
MessageId = message.Id.ToString()
};
brokeredMessage.Properties["ReceivedTimestamp"] = DateTime.Now;
_azureTopic.Send(brokeredMessage);
<End Loop>
Note:
There is a deliberate reason why we have a NoRetry policy. Without wanting to add too much noise to the question, the same message that failed will be tried again in the next iteration (it sends the message to subscribers in a round robin fashion).
Example log of errors during a small window of time.
20:31:51 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1191251
Error during communication with Service Bus. Check the connection
information, then retry.
20:32:00 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1191251
No such host is known
20:32:00 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1930029
No such host is known
20:32:10 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1930029
No such host is known
20:32:10 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1191251
No such host is known
20:32:10 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1930029
No such host is known
20:34:00 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1930034
Error during communication with Service Bus. Check the connection
information, then retry.
20:38:34 Event.WindowsService Event.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1191269
Error during communication with Service Bus. Check the connection
information, then retry.
20:38:51 RFID.WindowsService RFID.WindowsService::PublishAzureServiceBusTopicMessage()
error trying to synchronise message with Azure. Message ID: 1930043
Error during communication with Service Bus. Check the connection
information, then retry.

Service bus has native retry capabilities on Namespace Manager, Messaging Factory, and Client (see Retry guidance for specific services).
Because it is handling transient exception, you shouldn't have duplicated sent messages.
if you want to retry only once You can configure it like that:
var connectionString = "myconnectionstring";
var client = TopicClient.CreateFromConnectionString(connectionString);
client.RetryPolicy = new RetryExponential(minBackoff: TimeSpan.FromSeconds(2),
maxBackoff: TimeSpan.FromSeconds(2),
maxRetryCount: 1);
This should do the trick.
If you want to ensure deduplication, just google azure servicebus deduplication.

Related

DeviceClient occured timeout but message was send to IotHub

I have a question with SendEventAsync() method.
I tested insertion and removal of LAN cable
_sendDeviceClient.SetRetryPolicy(no);
_sendDeviceClient.OperationTimeoutInMillisecounds = xxx;
Do not retry.
Waiting xxx Millisecounds.
foreach() //Message1 Message2......
{
try
{
await _sendDeviceClient.SendEventAsync(message);
//Message send. Do success process
}
catch(Exception e)
{
//Message failed. Do failed process
}
}
My log is "Message send", but in IotHub message was not receive message.
Sometimes, "Message failed", but Iothub received message.
I don't know why this happened.
In any case, is it a problem to implement with try & catch?

In this scenario , i am assuming you don't want to break the loop until all your message sent to respected destination. I would suggested you to use Aggregate exception which could tell you about the overall message and their status:
At the end of your loop , You can pass the List to its constructor and throw that.
At the end of your loop do:
AggregateException aggregateEx = new AggregateException(errors);
throw aggregateEx;
An application that runs on a device has to manage the mechanisms for connection, re-connection, and the retry logic for sending and receiving messages. Also, the retry strategy requirements depend heavily on the device's IoT scenario, context, capabilities.
The Azure IoT Hub device SDKs aim to simplify connecting and communicating from cloud-to-device and device-to-cloud. These SDKs provide a robust way to connect to Azure IoT Hub and a comprehensive set of options for sending and receiving messages.
Most likely , message delivery gest failed due to connection failure , which can happen at many levels.
1) Network errors: disconnected socket and name resolution errors
2) Protocol-level errors for HTTP, AMQP, and MQTT transport: detached links or expired sessions
3) Application-level errors that result from either local mistakes: invalid credentials or service behavior (for example, exceeding the quota or throttling)
The device SDKs detect errors at all three levels. OS-related errors and hardware errors are not detected and handled by the device SDKs. The SDK design is based on The Transient Fault Handling Guidance from the Azure Architecture Center.
I can see that you have opted for no retry policy which means you have a bandwidth or cost concerns.
Ideally one should implement proper Retry logic so that it ensure the delivery.Here you can take a look at the complete sample for IOT HUb
you can read more about RetryGuidance here
Hope it helps.

MSMQ messages always arrive delayed by exactly 3 minutes on the same machine

I'm facing an extremely puzzling problem. I have a Windows service that monitors two MSMQ queues for input and sends messages to another MSMQ queue. Although the send operation seems instant from the service's perspective it actually takes the message exactly three (3) minutes to arrive (as shown in the properties window in the MSMQ MMC). I've been testing this problem with nothing else listening on the other side so that I can see the messages piling up. This is how the service sends messages:
var proxyFactory = new ChannelFactory<IOtherServerInterface>(new NetMsmqBinding(NetMsmqSecurityMode.None)
{
Durable = true,
TimeToLive = new TimeSpan(1, 0, 0),
ReceiveTimeout = TimeSpan.MaxValue
});
IOtherServerInterface server = this.proxyFactory.CreateChannel(new EndpointAddress("net.msmq://localhost/private/myqueue"));
var task = new MyTask() { ... };
using (TransactionScope scope = new TransactionScope(TransactionScopeOption.Required))
{
server.QueueFile(task);
scope.Complete();
}
The service is running on Windows Server 2008 R2. I also tested it on R1 and noticed the same behavior. Again, everything happens on the same machine. All components are deployed there so I don't think it could be a network issue.
EDIT #1:
I turned on the WCF diagnostics and what I noticed is very strange. The MSMQ datagram does get written normally. However, after the "a message was closed" trace message there is nothing going on. It is as if the service is waiting for something to happen. Exactly 3 minutes later and exactly when the MSMQ message arrives (according to the MSMQ MMC), I see another trace message about a previous activity. I suspect there is some kind of interference.
Let me give you more details about how the services work. There is an IIS app which receives tasks from clients and drops them in an MSMQ queue. From there, the troublesome service (MainService) picks them up and starts processing them. In some cases, another service (AuxService) is required to complete the task so MainService sends a message (that always gets delayed) to AuxService. AuxService has its own inbox queue where it receives MSMQ messages and when it's done, it sends an MSMQ message to MainService. In the meanwhile, the thread that sent the message to AuxService waits until it gets a signal or until it times out. There is a special queue where MainService looks for messages from AuxServices. When a message is received the abovementioned thread is woken up and resumes its activity.
Here's a representation of the whole architecture:
IIS app -> Q1 -> MainService
MainService -> Q2 -> AuxService
AuxService -> Q3 -> MainService
Although all operations are marked with OneWay, I'm wondering whether starting a MSMQ operation from within another MSMQ operation is somehow illegal. It seems to be the case given the empirical evidence. If so, is there away to change this behavior?
EDIT #2:
Alright, after some more digging it seems WCF is the culprit. I switched both the client code in MainService and the server code in AuxService to use MSMQ SDK directly and it works as expected. The 3 minute timeout I was experiencing was actually the time after which MainService gave up and considered that AuxService failed. Therefore, it seems that for some reason WCF refuses to perform the send until the current WCF activity exits.
Is this by design or is it a bug? Can this behavior be controlled?

You have transactions setup on the queue code, do you have the msmq object setup for transactions? 3 minutes sounds like the timeout period for a Distributed Transaction Coordinator enlistment.

Azure Service Bus Subscriber regularly phoning home?

We have pub/sub application that involves an external client subscribing to a Web Role publisher via an Azure Service Bus Topic. Our current billing cycle indicates we've sent/received >25K messages, while our dashboard indicates we've sent <100. We're investigating our implementation and checking our assumptions in order to understand the disparity.
As part of our investigation we've gathered wireshark captures of client<=>service bus traffic on the client machine. We've noticed a regular pattern of communication that we haven't seen documented and would like to better understand. The following exchange occurs once every 50s when there is otherwise no activity on the bus:
The client pushes ~200B to the service bus.
10s later, the service bus pushes ~800B to the client. The client registers the receipt of an empty message (determined via breakpoint.)
The client immediately responds by pushing ~1000B to the service bus.
Some relevant information:
This occurs when our web role is not actively pushing data to the service bus.
Upon receiving a legit message from the Web Role, the pattern described above will not occur again until a full 50s has passed.
Both client and server connect to sb://namespace.servicebus.windows.net via TCP.
Our application messages are <64 KB
Questions
What is responsible for the regular, 3-packet message exchange we're seeing? Is it some sort of keep-alive?
Do each of the 3 packets count as a separately billable message?
Is this behavior configurable or otherwise documented?
EDIT:
This is the code the receives the messages:
private void Listen()
{
_subscriptionClient.ReceiveAsync().ContinueWith(MessageReceived);
}
private void MessageReceived(Task<BrokeredMessage> task)
{
if (task.Status != TaskStatus.Faulted && task.Result != null)
{
task.Result.CompleteAsync();
// Do some things...
}
Listen();
}

I think what you are seeing is the Receive call in the background. Behind the scenes the Receive calls are all using long polling. Which means they call out to the Service Bus endpoint and ask for a message. The Service Bus service gets that request and if it has a message it will return it immediately. If it doesn't have a message it will hold the connection open for a time period in case a message arrives. If a message arrives within that time frame it will be returned to the client. If a message is not available by the end of the time frame a response is sent to the client indicating that no message was there (aka, your null BrokeredMessage). If you call Receive with no overloads (like you've done here) it will immediately make another request. This loop continues to happend until a message is received.
Thus, what you are seeing are the number of times the client requests a message but there isn't one there. The long polling makes it nicer than what the Windows Azure Storage Queues have because they will just immediately return a null result if there is no message. For both technologies it is common to implement an exponential back off for requests. There are lots of examples out there of how to do this. This cuts back on how often you need to go check the queue and can reduce your transaction count.
To answer your questions:
Yes, this is normal expected behaviour.
No, this is only one transaction. For Service Bus you get charged a transaction each time you put a message on a queue and each time a message is requested (which can be a little opaque given that Recieve makes calls multiple times in the background). Note that the docs point out that you get charged for each idle transaction (meaning a null result from a Receive call).
Again, you can implement a back off methodology so that you aren't hitting the queue so often. Another suggestion I've recently heard was if you have a queue that isn't seeing a lot of traffic you could also check the queue depth to see if it was > 0 before entering the loop for processing and if you get no messages back from a receive call you could go back to watching the queue depth. I've not tried that and it is possible that you could get throttled if you did the queue depth check too often I'd think.
If these are your production numbers then your subscription isn't really processing a lot of messages. It would likely be a really good idea to have a back off policy to a time that is acceptable to wait before it is processed. Like, if it is okay that a message sits for more than 10 minutes then create a back off approach that will eventually just be checking for a message every 10 minutes, then when it gets one process it and immediately check again.
Oh, there is a Receive overload that takes a timeout, but I'm not 100% that is a server timeout or a local timeout. If it is local then it could still be making the calls every X seconds to the service. I think this is based on the OperationTimeout value set on the Messaging Factory Settings when creating the SubscriptionClient. You'd have to test that.

MSMQ Poison message and TimeToReachQueue

I tried creating a poison message scenario in the following manner.
1- Created a message queue on a server (transactional queue).
2- Created a receiver app that handles incoming messages on that server.
3- Created a client app located on a client machine which sends messages to that server with the specific name for the queue.
4- I used the sender client app with the following code (C# 4.0 framework):
System.Messaging.Message mm = new System.Messaging.Message("Some msg");
mm.TimeToBeReceived = new TimeSpan(0, 0, 50);
mm.TimeToReachQueue = new TimeSpan(0, 0, 30);
mm.UseDeadLetterQueue = true;
mq.Send(mm);
So this is setting the timeout to reach queue to 30 seconds.
First test worked fine. Message went through and was received by the server app.
My second test, I disconnected my ethernet cable, then did another send from the client machine.
I can see in the message queue on the client machine that the message is waiting to be sent ("Waiting for connection"). My problem is that when it goes beyond the 30 sec (or 50sec too), the message never goes in the Dead-letter queue on the client machine.
Why is it so ? ... I was expecting it to go there some it timed-out.
Tested on Windows 7 (client) / Windows server 2008 r2 (server)

Your question is a few days old already. Did you find out anything?
My interpretation of your scenario would be that the unplugged cable is the key.
In the scenario John describes, there is an existing connection and the receiver could not process the message correctly within the set time limit.
In you scenario, however, the receiving endpoint never gets the chance to process the message, so the timeout can never occur. As you said, the state of the message is Waiting for connection. A message that was never sent cannot logically have a timeout to reach its destination.
Just ask yourself, how many resources Windows/ MSMQ would unneccessaryly sacrifice - and how often - to check MessageQueues for how-many conditions if the queues is essentially inactive? There might be a lot of queues with a lot of messages on a system.
The behavior I would expect is that if you plug the network cable back in and the connection is re-established that then, only when it is needed, your poison message wil be checked for the timeout and eventually moved to the DeadLetter queue.
You might want to check this scenario out - or did you already check it out the meantime?

Invalid or expired security context token in WCF web service

All,
I have a WCF web service (let's called service "B") hosted under IIS using a service account (VM, Windows 2003 SP2). The service exposes an endpoint that use WSHttpBinding with the default values except for maxReceivedMessageSize, maxBufferPoolSize, maxBufferSize and some of the time outs that have been increased.
The web service has been load tested using Visual Studio Load Test framework with around 800 concurrent users and successfully passed all tests with no exceptions being thrown. The proxy in the unit test has been created from configuration.
There is a sharepoint application that use the Office Sharepoint Server Search service to call web services "A" and "B". The application will get data from service "A" to create a request that will be sent to service "B". The response coming from service "B" is indexed for search. The proxy is created programmatically using the ChannelFactory.
When service "A" takes less than 10 minutes, the calls to service "B" are successfull. But when service "A" takes more time (~20 minutes) the calls to service "B" throw the following exception:
Exception Message: An unsecured or incorrectly secured fault was received from the other party. See the inner FaultException for the fault code and detail
Inner Exception Message: The message could not be processed. This is most likely because the action 'namespace/OperationName' is incorrect or because the message contains an invalid or expired security context token or because there is a mismatch between bindings. The security context token would be invalid if the service aborted the channel due to inactivity. To prevent the service from aborting idle sessions prematurely increase the Receive timeout on the service endpoint's binding.
The binding settings are the same, the time in both client server and web service server are synchronize with the Windows Time service, same time zone.
When i look at the server where web service "B" is hosted i can see the following security errors being logged:
Source: Security
Category: Logon/Logoff
Event ID: 537
User NT AUTHORITY\SYSTEM
Logon Failure:
Reason: An error occurred during logon
Logon Type: 3
Logon Process: Kerberos
Authentication Package: Kerberos
Status code: 0xC000006D
Substatus code: 0xC0000133
After reading some of the blogs online, the Status code means STATUS_LOGON_FAILURE and the substatus code means STATUS_TIME_DIFFERENCE_AT_DC. but i already checked both server and client clocks and they are syncronized.
I also noticed that the security token seems to be cached somewhere in the client server because they have another process that calls the web service "B" using the same service account and successfully gets data the first time is called. Then they start the proccess to update the office sharepoint server search service indexes and it fails. Then if they called the first proccess again it will fail too.
Has anyone experienced this type of problems or have any ideas?
Regards,
--Damian

10 mins is the default receive timeout. If you have an idled proxy for more than 10mins, the security session of that proxy is aborted by the server. Enable logging and you will see this in the diagnostics log of the server. The error message you reported fits for this behavior.
Search your system diagnostic file for "SessionIdleManager". If you find it, the above is your problem.
Give it a whirl and set the establishSecurityContext="false" for the client and the server.

Don't call the service operation in a using statement. Instead use a pattern such as...
client = new ServiceClient("Ws<binding>")
try
{
client.Operation(x,y);
client.Close();
}
catch ()
{
client.Abort();
}
I don't understand why this works but I would guess that when the proxy goes out of scope in the using statement, Close isn't called. The service then waits until receiveTimeout (on the binding) has expired and then aborts the connection causing subsequent calls to fail.

What I believe is happening here is that your channel is timing out (as you suspect).
If I understand correctly, it is not the calls to service A that are timing out, but rather to service B, before you call your operation.
I'm guessing that you are creating your channel before you call service A, rather than just in time (i.e. before calling service B). You should create the channel (proxy, service client) just before you use it like:
AResponse aResp = null;
BResponse bResp = null;
using (ServiceAProxy proxyA = new ServiceAProxy())
{
aResp = proxyA.DoServiceAWork();
using (ServiceBProxy proxyB = new ServiceBProxy())
{
bResp = proxyB.DoOtherork(aResp);
}
}
return bResp;
I believe however, that once you get over that problem (service B timing out), you'll realize that the sharepoint app's proxy (that called service A) will timeout.
To solve that, you may wish to change your service model from a request-response, to a publish-subscribe model.
With long-running services, you'll want your sharepoint app to subscribe to service A, and have service A publish its results when it is ready to do so - regardless of how long it takes.
Programming WCF Services (O'Reilly) by Juval Lowey, has a great explanation, and IDesign (Juval's company) published a great set of coding standards for WCF, as well as the code for a great Publish-Subscribe Framework.
Hope this helps,
Assaf.

I actually triggered this error just now by doing something silly. I have a unit test that modifies the system date in order to test some time-based features. And I guess the apparent time difference between when I created the context and when I called my method (because of the changes to the system date), caused something to expire.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.