Azure worker role recycling with unhandled Service Bus fault message - c#

I have been running an Azure worker role deployment that uses the Microsoft.ServiceBus 2.2 library to respond to jobs posted from other worker roles and web roles. Recently (suspiciously around the time of the OS update discussed here), the instances of the cluster started constantly recycling, rebooting, running for a short period of time, and then recycling again.
I can confirm that the role instances make it all the way through the OnStart() method of my RoleEntryPoint from the trace messages I have in my diagnostics. Occasionally, the Instances pane of the Azure Management Portal would mention that a recycling role had experienced an "unhandled exception," but would not give more detail. After logging in with remote desktop to one of the instances, the two clues I have are:
Performance counters indicate that \Processor(_Total)\% Processor Time is hovering at 100%, periodically dropping to the mid-80s coinciding with drops in \TCPv4\Connections Established. Some drops in \TCPv4\Connections Established do not correlate with drops in \Processor(_Total)\% Processor Time.
I was able to find, in the Local Server Events in the Server Manager of one of the instances, the following message:
Application: WaWorkerHost.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: Microsoft.ServiceBus.Common.CallbackException
Stack:
at Microsoft.ServiceBus.Common.Fx+IOCompletionThunk.UnhandledExceptionFrame(UInt32, UInt32, System.Threading.NativeOverlapped*)
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
There have been no permissions configuration changes associated with the service bus during this time, and this message occurs despite us not having updated any of our VMs. Nonetheless, it also appears that our service is still functioning => jobs are being processed and removed from the Service Bus Queues they are listening to.
Most Googling on these issues turns up suggestions that this is somehow related to IntelliTrace, however, these VMs do not have IntelliTrace enabled on them.
Does anyone have any ideas on what is going on here?

The service bus exceptions turned out to be a red herring from the perspective of the crashing - a namespace conflict in one of the data contracts being sent between two different VM roles that were published at different times. Adding additional tracing to exceptions thrown during one of the receive retries revealed it. Still a mystery as to why it's working at all, and the role recycling has not ceased, just the service bus exception.

I had the similar issue. The main reason is that it could not resolve the Service Bus dll version issues make sure the version you are redirecting in AppSettings and the version you actually added reference to are same.
It may occur with any dll mismatches not only with service bus dll...

Related

I am seeing my azure service bus queue getting deleted automatically

I have been attempting to create Azure service bus queue on azure portal but the queue is automatically disappearing fro portal after few hours of use.
I am creating queue manually through azure portal account. Have sufficient funds in the account. The service bus currently on Azure is in Preview mode.
I am only attempting to publish/subscribe a message from queue through the code.
Queue could be deleted as a result of
Custom code that performs namespace management operations and deletes the entity.
AutoDeleteOnIdle is enabled with a relatively short timespan, causing entity to be removed if it sees no action.
I suspect AutoDeleteOnIdle is set to a relatively low value. By default, it's TimeSpan.MaxValue and should not cause this issue.
It appears that queue was getting recycled overnight by some nightly process at azure due to lack of some usage indication at Microsoft azure. After I configured the SAS key information at the queue level and also left a message in queue overnight I don't see the queue getting recycled any more.
Thanks to Sean Feldman for providing useful information which helped me through the process.

WCFTransaction Exception

I am currently working on client-service architecture using WCF, c# .NET. I was stuck with well-known transaction exception for this architecture while running client and server on different machines (No problem if both on same machine). By searching on web, I resolved this issue with MSDTC settings as well as Firewall settings as shown on this link: https://www.packtpub.com/books/content/how-configure-msdtc-and-firewall-distributed-wcf-service. I made outbound settings enable on client and inbound settings enabled on server. Also set exception for DTC on client side. Now that problem is partially resolved at the cost of serious issue. When I run the app, sometime it throws exception but sometimes suddenly runs successfully w/o changing anything. I am totally clueless, what is wrong in this case. Any idea?

Topshelf service with dependency on RabbitMQ not starting on reboot

I have a windows services which uses EasyNetQ and RabbitMQ.
The service starts normally from the service control manager.
However I have seen occasionally on a reboot, the service does not start with the error in the services event log :
A timeout was reached (30000 milliseconds)
The <serviceName> service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.
I have tried auto delaying the service and this has not helped.
In addition, I was thinking about setting the recovery mechanism so that if it does not start it restarts on first/second and subsequent failures. Not sure if this will work.
So my questions is how can I determine what the dependency is that is causing my service not to start sometimes?
In order to determine what dependency is causing the error you could try to attach a handler on TopShelf "OnException" (https://topshelf.readthedocs.io/en/latest/configuration/config_api.html#onexception)
and log the exception that caused the error.

Timeout for the requested operation has expired

This is driving me crazy. We use a fairly large number of private MSMQ queues in our C#/ASP.NET web application where I work and have a common library to send and receive messages from our queues. Yesterday, this stopped working for me altogether, but none of the other developers I work with are running into this issue, which makes me think it has something to do with my local dev environment or my Windows account settings.
I am now always getting "Timeout for the requested operation has expired" exceptions when the following line of messaging code is called:
var returnMessage = fromMessageQueue.ReceiveByCorrelationId(strCorrelationID, tsWait);
We basically have an "Inbound" and "Outbound" queue for each of our (business) clients. The Inbound queues look clean, but when I look in the Outbound queues, I can see "stuck" messages that are the responses I need.
I've even written a small test console application against a dummy queue I setup for troubleshooting, that still returns the same timeout exceptions.
I've checked the permissions on the private queues I've been troubleshooting with, EVERYONE and ANONYMOUS users have full control to the queues. I've even granted my own domain login account to a few queues, but that didn't work either.
I'm afraid I'm very stuck until I can get this resolved.
I usually get this when I have installed the software and have it running as a service whilst trying to run a debug copy through visual studio (2 services running on 1 queue)

Azure Service Bus Binding causes Worker Role to become unhealthy after single fault

I've been working on a Azure worker role that hosts a netTcpRelayBinding WCF service. All seems to work well until One of my connected hosts disconnects unexpectedly. Over the next couple of minutes, the role consistently loses stability and then reports itself as unhealthy.
I'm not sure where I should be looking. I've got IntelliTrace enabled, and I've got some exceptions, which start with the TimeoutException you'd expect, but then continue on. I get these messages:
System.ServiceModel.CommunicationException - Inactivity timeout of (00:00:10) has been exceeded.
System.InvalidProgramException - The Common Language Runtime detected an invalid program
After this, I get a series of communication exceptions, timeoutexceptions, and then eventually the whole host crashes with an OutOfMemoryException.
Things to note: I've got 1 client connected. No other calls or activity. When he disconnects unexpectedly, the above consistently happens.
Tried catching the servicehost Faulted but that seemed to do nothing (can't see where it was hit in IntelliTrace logs.
Any ideas on where I should be looking? Surely I don't need to recreate the service every time something like this happens right?
That doesn't sound familiar. Can you reproduce this and how's your role feeling overall, meaning what's the CPU load looking like and what else is going on in parallel?
Cheers,
Clemens

Categories

Resources