We are using a clustered setup with HA queues using MassTransit and RabbitMQ. This works mostly well, but is it possible to extend what happens when there is an exception in the transport layer?
For instance, once a node in the cluster goes down a consumer gets this error:
MassTransit.Util.TaskSupervisor Error: 0 :
Failed to close scope
MassTransit.RabbitMqTransport.Pipeline.RabbitMqBasicConsumer -
rabbitmq://x:5672/y/z?durable=false&autodelete=true&prefetch=16,
System.Threading.Tasks.TaskCanceledException: A task was canceled.
The message flow will still be working as it will switch to another node, but we would like to perform additional actions when the above happens. It is a handled exception, so it is not possible to hook into UnhandledException for the application.
Related
This error is logged occasionally in the function app logs. "An exception occurred while creating a ServiceBusSessionReceiver (Namespace '<servicebus namespace>.servicebus.windows.net', Entity path '<topic>/Subscriptions/<subscription>'). Error Message: 'Azure.Messaging.ServiceBus.ServiceBusException: Put token failed. status-code: 500, status-description: The service was unable to process the request; please retry the operation."
The function app uses managed identity to connect to the service bus.
There is no impact on the regular usage but just want to know the reason for this exception.
I checked online to find the reason for the exception but didn`t find anything even on StackOverflow. I want to know the reason for this exception so I will know the impact of the failure and try to resolve it.
There is no action needed for your application and nothing that you can do to resolve. This is something that is handled by the Service Bus infrastructure internally. Intermittent failures will not impact your application, though if you're seeing this in large clusters or seeing it frequently, I'd encourage you to open a support ticket for investigation.
To add some context, this exception indicates a service-side issue when passing authorization token over the CBS link, which is a background activity. The Service Bus client sends refreshes periodically with a large enough window that failures can be retried before the current authorization expires. In the worst case, a specific connection would fault and the Service Bus client would create a new one. So long as the service issue was transient, such as is common when a service node is rebooting or being moved, things will recover without noticeable impact to the application.
I'm using the RabbitMQ.Client nuget package to publish messages to rabbitmq from a .NET core 3.1 application. We are using the 5.1.0 version of the library.
We want to improve the resiliency of our application, so we are exploring the possibility to define a retry policy to be used when we send messages via the IModel.BasicPublish method. We are going to employ the Polly nuget package to define the retry policy.
Thw whole point of retry policies is retrying a failed operation when a failure deemed to be transient occurs. What I'm trying to understand is how to identify a transient error in this context.
Based on my understanding, all the exceptions thrown by the RabbitMQ.Client derives from the RabbitMQClientException
custom exception. The point is that there are several exception types defined by the library which derives from RabbitMQClientException, see here for the full list.
I didn't find any specific documentation on that, but by reading the code on github it seems that the only custom exception thrown by the library when a message is published is AlreadyClosedException, this happens when the connection used to publish the message is actually closed. I don't think that retrying in this case makes sense: the connection is already closed, so there is no way to overcome the error by simply retrying the operation.
So my question is: what exception types should I handle in my Polly retry policy which I want to use to execute the IModel.BasicPublish call ? Put another way, which are the exception types representing transient errors thrown by IModel.BasicPublish?
The application uses .NET 4.6.1 and the Microsoft.Azure.ServiceBus.EventProcessorHost nuget package v2.0.2, along with it's dependency WindowsAzure.ServiceBus package v3.0.1 to process Azure Event Hub messages.
The application has an implementation of IEventProcessor. When an unhandled exception is thrown from the ProcessEventsAsync method the EventProcessorHost never re-sends those messages to the running instance of IEventProcessor. (Anecdotally, it will re-send if the hosting application is stopped and restarted or if the lease is lost and re-obtained.)
Is there a way to force the event message that resulted in an exception to be re-sent by EventProcessorHost to the IEventProcessor implementation?
One possible solution is presented in this comment on a nearly identical question:
Redeliver unprocessed EventHub messages in IEventProcessor.ProcessEventsAsync
The comment suggests holding a copy of the last successfully processed event message and checkpointing explicitly using that message when an exception occurs in ProcessEventsAsync. However, after implementing and testing such a solution, the EventProcessorHost still does not re-send. The implementation is pretty simple:
private EventData _lastSuccessfulEvent;
public async Task ProcessEventsAsync(
PartitionContext context,
IEnumerable<EventData> messages)
{
try
{
await ProcessEvents(context, messages); // does actual processing, may throw exception
_lastSuccessfulEvent = messages
.OrderByDescending(ed => ed.SequenceNumber)
.First();
}
catch(Exception ex)
{
await context.CheckpointAsync(_lastSuccessfulEvent);
}
}
An analysis of things in action:
A partial log sample is available here: https://gist.github.com/ttbjj/4781aa992941e00e4e15e0bf1c45f316#file-gistfile1-txt
TLDR: The only reliable way to re-play a failed batch of events to the IEventProcessor.ProcessEventsAsync is to - Shutdown the EventProcessorHost(aka EPH) immediately - either by using eph.UnregisterEventProcessorAsync() or by terminating the process - based on the situation. This will let other EPH instances to acquire the lease for this partition & start from the previous checkpoint.
Before explaining this - I want to call-out that, this is a great Question & indeed, was one of the toughest design choices we had to make for EPH. In my view, it was a trade-off b/w: usability/supportability of the EPH framework, vs Technical-Correctness.
Ideal Situation would have been: When the user-code in IEventProcessorImpl.ProcessEventsAsync throws an Exception - EPH library shouldn't catch this. It should have let this Exception - crash the process & the crash-dump clearly shows the callstack responsible. I still believe - this is the most technically-correct solution.
Current situation: The contract of IEventProcessorImpl.ProcessEventsAsync API & EPH is,
as long as EventData can be received from EventHubs service - continue invoking the user-callback (IEventProcessorImplementation.ProcessEventsAsync) with the EventData's & if the user-callback throws errors while invoking, notify EventProcessorOptions.ExceptionReceived.
User-code inside IEventProcessorImpl.ProcessEventsAsync should handle all errors and incorporate Retry's as necessary. EPH doesn't set any timeout on this call-back to give users full control over processing-time.
If a specific event is the cause of trouble - mark the EventData with a special property - for ex:type=poison-event and re-send to the same EventHub(include a pointer to the actual event, copy these EventData.Offset and SequenceNumber into the New EventData.ApplicationProperties) or fwd it to a SERVICEBUS Queue or store it elsewhere, basically, identify & defer processing the poison-event.
if you handled all possible cases and are still running into Exceptions - catch'em & shutdown EPH or failfast the process with this exception. When the EPH comes back up - it will start from where-it-left.
Why does check-pointing 'the old event' NOT work (read this to understand EPH in general):
Behind the scenes, EPH is running a pump per EventHub Consumergroup partition's receiver - whose job is to start the receiver from a given checkpoint (if present) and create a dedicated instance of IEventProcessor implementation and then receive from the designated EventHub partition from the specified Offset in the checkpoint (if not present - EventProcessorOptions.initialOffsetProvider) and eventually invoke IEventProcessorImpl.ProcessEventsAsync. The purpose of the Checkpoint is to be able to reliably start processing messages, when the EPH process Shutsdown and the ownership of Partition is moved to another EPH instances. So, checkpoint will be consumed only while starting the PUMP and will NOT be read, once the pump started.
As I am writing this, EPH is at version 2.2.10.
more general reading on Event Hubs...
Simple Answer:
Have you tried EventProcessorHost.ResetConnection(string partiotionId)?
Complex Answer:
It might be an architecture problem that needs to addressed at your end, why did the processing fail? was it a transient error? is retrying the processing logic is a possible scenario? And so on...
I am creating channel everytime before performing an operation on service using ChannelFactory.CreateChannel(). At the end of the operation I will close the channel or abort it if there are any exceptions.
Since I am creating a channel each time, do I have to listen to "Faulted" events.
Btw, why channelFactory has Faulted event, when all the stuff is done by the channel.
Or - will be it raised when any of the channels created by this factory faulted?
Thanks in advance,
Dreamer!
The only reason I would listen to the Faulted event is if I wanted to do something particular if the event occurs (other than aborting the channel). I can't, off the top of my head, think of a reason to use it - but that doesn't mean there isn't one.
In your case, if you're aborting the channel when an error occurs, then you're fine - you don't ned to handle the Faulted event.
FactoryChannel<T> implements ICommunicationObject, which defines a faulted event. MSDN says "Defines the contract for the basic state machine for all communication-oriented objects in the system, including channels, the channel managers, factories, listeners, and dispatchers, and service hosts."
FactoryChannel<T>.CreateChannel returns a type of IChannel, which also implements ICommunicationObject.
I am coding some kind of a WCF service. most exceptions are caught in the BL implementation and handled there. Each of my API's return type is a class (named - "result") containing error code, error message and success boolean.
When exceptions are handled, this class is updated accordingly and in the end is sent back to the client.
Some of the exceptions are off-course, unhandled. Currently, I am wrapping each of my BL calls from the service layer with a generic try-catch so I can catch every unhandled exception and create a generic "result" class with a generic failure message, error code and success=false.
Is it a good way to handle exceptions or should I let unhandled exception to be thrown by the service to the client?
You can assume that the client can't use the data from the exception so it won't benefit from the extra information contained in the exception.
Check out Exception Shielding.
This is a process where exceptions raised by the service, are mapped to fault contracts according to rules you specify in a configuration file. This saves a lot of donkey work with try/catch blocks.
Here is one post to help you out:
In general though - faults will fall into 3 categories:
1) Client error - the client has tried to do something not permissable, so it needs to know about it. E.g. Failed to set a mandatory field. - Return specific message explaining fault.
2) Business error that doesn't affect the client. An error that is considered normal operation, e.g. Payment Authorization check failure. Either hide from client completely, or return some message: "Error performing request: Please try again later..."
3) System error - Unexpected - not normal operation: Replace with generic message: "System Error: Call Support"
In all cases though, the key thing is you remove the stack trace, especially if it's a public facing service.
With shielding you would have 3 Fault Contracts covering the above scenarios, and set the text appropriately in the Shielding configuration.
Be advised, you generally want shielding turned off during development as it makes it a right pain to debug the system!
I differ with the others. I think that in the same way HTTP methods GET, POST, PUT, DELETE thereby support CRUD operations, HTTP response codes 200, 500, etc., support success/fail and this is, in my opinion, appropriate to make use of. A 500 result still has an HTTP response body, and such a body is fully readable (so long as IIS isn't spitting out HTML; you have control over this). Meanwhile, the XML protocol implementations as with Microsoft SOAP from WCF already wrap exceptions with a faulting protocol.
If you're going to throw exceptions, throw them. Just document them while doing so, so that the consumers can plan accordingly.
I think both approaches are viable.
I personally prefer not throwing exceptions over WCF, so that the client can easily distinguish between error in server-side processing and connectivity/protocol issue: in the first case the response will indicate the failure, and in the second case exception will be thrown.
Personally I wouldn't expose the unhandled exceptions and propagate them to the client. I would define those exceptions the client might be interested in and only propagate those. Exceptions not directly related to what the clients want to do (ArgumentException could set reason to "CustomerId cannot be more than 20 chars" etc.) I'd deal with in the service and only indicate that some sort of internal server error has occurred on the service side which broke the execution and meant that the operation the client tried to run failed to complete. This I would do because the client can't really take any action based on internal server errors. They can fix their inparams in the case of an ArgumentException being thrown by validating the parameters again and retry the operation.
Not sure if this is really what you're asking, but hope it gives you some ideas at least.
If you let unhandled exceptions out of your WCF service, this may have undesirable effects such as communication channel being in faulted state where in a sessionful scenario, client can no longer use the same client proxy instance but is forced to create a new one and start a new session. In general, I think it is good to have control over the errors that surface out of your WCF service and provide clients helpful information. Take a look at IErrorHandler.This interface gives you control over the SOAP fault generated, unhandled exceptions, and allows you to do extra tasks like logging, and lets you decide whether you want to keep the session or not in case of a sessionful binding. You add your custom error handler via WCF extensibility such as service, endpoint, contract, operation behaviors.
Note that IErrorHandler is called before sending a response message. So there is still a chance of an unhandled exception occurring down in the channel stack during serialization, encoding, etc.