I have the problem, that since some time, we can't stop a NServiceBus Windows-Service. If we try to, we get this exception:
Error 1061: The service cannot accept control messages at this time.
Unfortunately, I really didn't find anything about this matter but this Github-Issue: https://github.com/Particular/NServiceBus/issues/1898
Sadly, this doesn't help, since we need in fact the IConfigureThisEndpoint Interface, to configure the BusConfiguration, which also isn't that long running. We also use almost the exact same template for other NServiceBus-Endpoints, which don't have any problems.
Interesting enough, it worked also for this Endpoint for quite some time and it also seem also to be a problem only for one specific Server.
Is there a possibility to find more about the exception, be it from Microsoft or NServiceBus?
Graceful shutdown
This error can be caused for a lot of reasons. NServiceBus will only try to perform a graceful shutdown.
NServiceBus will not abort messages that are currently being processed but will stop processing new messages.
Log file
The log file should indicate that a shutdown is triggered so that is the first thing you can verify.
I would advise to set the log level to DEBUG to help diagnose the shutdown sequence and to also add this:
var appDomainLogger = LogManager.GetLogger("AppDomain");
var appDomain = AppDomain.CurrentDomain
appDomain.FirstChanceException += (o, ea) => {
appDomainLogger.Debug("FirstChanceException", ea.Exception);
};
appDomain.UnhandledException += (o, ea) => {
appDomainLogger.Debug("UnhandledException", ea.ExceptionObject as Exception);
};
It could be that exceptions occur that prevent shutdown and this adds additional diagnostics.
Long running messages
If for example a message is being processed that is waiting for a lock on a database to be released then this message could take more time then the windows service interface allows.
Other tasks like converting media files
Eventually, the windows service should shutdown if all resources are freed and messages are done processing unless they contain a bug that prevents shutdown.
Disposing of resources
Also, during shutdown the container is disposed too. It might be that your container has resources that have lots of cleanup/teardown to do. For example, resources that flush in-memory caches to disk or remote storage so the next time the services is started it can startup faster then normal.
Related
So we've been using PubSub for receiving GCB events for a while.
We have 4 subscribers to our subscription, so they can split the workload.
The subscribers are identical and written using the official C# client
The subscribers use the default settings, we configure that only 1 thread should be pulling.
They are running as a HostedService in AspNetCore inside Kubernetes.
The subscriber application has only that one responsibility
This application is deployed a couple of times every week since it's bundle with a more heavy use api.
The issue we are facing is this:
When looking at our Kibana logs we sometimes see what appears to a delayed of the pubs message of 1 or more minutes (notice that QUEUED has a later timestamp than WORKING).
However looking at the publishTime it is clear that problem is not that the event is published later, but rather that it is handled by our code later.
Now if we look at the PubSub graphs we get:
Which confirms that there indeed WAS an incident where message where not acked.
This explains why we are seeing the delayed handling of the message :).
But it does not explain WHY we appear to exceed the deadline of 60 seconds.
There are no errors / exceptions anywhere to be found
We are using the C# client in a standard way (defaults)
Now here is where it gets interesting, I discovered that if I do a PURGE messages using the google UI, everything seems to run smoothly for a while (1-3 days). But then I happens again.
Now if we look at the metrics across all the instances when the issue occurs (this is from another incident) we are at no point in time over 200ms of computation time:
Thoughts:
We are misunderstanding something basic about the pubsub ack configuration
Maybe the deploys we do somehow leads the subscription to think that there are still active subscribers and therefore it awaits them to fail before trying the next subscriber? This is indicated by the PURGE reaction, however I have no way of inspecting how many subscribers currently are registered with the subscription and I can't see a bug in the code that could imply this.
Looking at the metrics the problem is not with our code. However there might be something with the official client default config / bug.
Im really puzzled and im missing insights into what is going on inside the pubsub clusters and the official client. Some tracing from the client would be nice or query tools for pubsub like the ones we have with our Kafka clusters.
The code:
public class GoogleCloudBuildHostedService : BackgroundService
{
...
private async Task<SubscriberClient> BuildSubscriberClient()
{
var subscriptionToUse = $"{_subscriptionName}";
var subscriptionName = new SubscriptionName(_projectId,subscriptionToUse);
var settings = new SubscriberServiceApiSettings();
var client = new SubscriberClient.ClientCreationSettings(1,
credentials: GoogleCredentials.Get().UnderlyingCredential.ToChannelCredentials(),
subscriberServiceApiSettings: settings);
return await SubscriberClient.CreateAsync(subscriptionName, client);
}
protected override async Task ExecuteAsync(CancellationToken cancellationToken)
{
await Task.Yield();
cancellationToken.Register(() => _log.Info("Consumer thread stopping."));
while (cancellationToken.IsCancellationRequested == false)
{
try
{
_log.Info($"Consumer starting...");
var client = await BuildSubscriberClient();
await client.StartAsync((msg, cancellationToken) =>
{
using (eventTimer.NewTimer())
{
try
{
...
}
catch (Exception e)
{
_log.Error(e);
}
}
return Task.FromResult(SubscriberClient.Reply.Ack);
});
await client.StopAsync(cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(10), cancellationToken);
}
catch (Exception e)
{
_log.Info($"Consumer failed: {e.Message}");
}
}
_log.Info($"Consumer stopping...");
}
}
Hope someone out there in the great big void can enlighten me :).
Kind regards
Christian
UPDATE
So I looked into one of the cases again, and here below we see:
the same instance of the application handling messages from the same topic and subscription.
there's 1 client thread only configured
Notice that at 15:23:04 and 15:23:10 there's 2 messages handled at the same time of publication, now 2 minutes later a message that was published at 15:23:07 is handled. And in the mean time 2 other messages are being handled.
So why is a message published at 15:23:07 not handled until 15:25:25, when other messages arrive in the mean time?
This can be happening due to different reasons and it is not a trivial task to find and troubleshoot the root of the issue.
Possible latency reasons
Related to latency, it is normal for subscriptions to have backlogged messages if they are not consuming messages fast enough or have not finished working though the backlog.
I would start by reading the following documentation, where it mentions some reasons on why you might be exceeding the deadline in some cases.
Another reason for message latency might be due to an increase in message or payload size. Check if all of your messages are moreless the same size, or if those getting handled with delay are bigger in size.
Handling message failures
I would also like to suggest taking a read here, where it talks about how to handle message failures by setting a subscription retry policy or forwarding undelivered messages to a dead-letter topic (also known as a dead-letter queue).
Good practices
This article contains some good tips and tricks for understanding how latency and message stuckness can happen and some suggestions that can help to improve that.
xBurnsed offered some good advice links, so let me just supplement it with some other things:
This looks like a classic case of a subscriber holding on to a message and then upon its lease expiring, the message gets delivered to another subscriber. The original subscriber is perhaps one that went down after it received a message, but before it could process and ack the message. You could see if the backlog correlates with restarts of instances of your subscriber. If so, this is a likely culprit. You could check to see if your subscribers are shutting down cleanly, where the subscriber StopAsync call exits cleanly and you have acknowledged all messages received by your callback before actually stopping the subscriber application.
Make sure you are using the latest version of the client. Earlier versions were subject to issues with large backlogs of small messages, though in reality the issues were not limited to those cases. This is mostly relevant if your subscribers are running up against their flow control limits.
Do the machines on which the subscribe is running have any other tasks running on them that could be using up CPU or RAM? If so, it's possible that one of the subscriber applications is starved for resources and can't process messages quickly enough.
If you are still having issues, the next best step to take is to put in a request with Cloud Support, providing the name of your project, name of your subscription, and the message ID of a message that was delayed. Support will be able to track the lifetime of a message and determine if the delivery was delayed on the server or if it was delivered multiple times and not acked.
We have an application which read messages from IBM MQ Topic and interact with users via SignalR WebSockets.
Case:
Open iis asp.net application web.config
Change and save it (this causing appdomain restart)
Repeate step 2 10 times
After that we can see many Application_Start/Dispose events in logs but at ONE of appdomain restart iterations haven't Dispose call. Cause that out IBM MQ listener handling message from old AppDomain therefore we have duplicate handling and business errors.
It seems like something constraint AppDomain from unload.
I know what it's very hard to say what's there happening, but maybe anybody knows how can we trace that problem.
Disable Overlapped Recycle is true
Shutdown Time Limit is 3s
what I have do in similar situation - on global.asax use this call
void Application_End(object sender, EventArgs e)
{
// here signaling the listener to close - and wait until they do
// also raise the shutdown time limit to more than 3 seconds, give them time to close
}
The application uses .NET 4.6.1 and the Microsoft.Azure.ServiceBus.EventProcessorHost nuget package v2.0.2, along with it's dependency WindowsAzure.ServiceBus package v3.0.1 to process Azure Event Hub messages.
The application has an implementation of IEventProcessor. When an unhandled exception is thrown from the ProcessEventsAsync method the EventProcessorHost never re-sends those messages to the running instance of IEventProcessor. (Anecdotally, it will re-send if the hosting application is stopped and restarted or if the lease is lost and re-obtained.)
Is there a way to force the event message that resulted in an exception to be re-sent by EventProcessorHost to the IEventProcessor implementation?
One possible solution is presented in this comment on a nearly identical question:
Redeliver unprocessed EventHub messages in IEventProcessor.ProcessEventsAsync
The comment suggests holding a copy of the last successfully processed event message and checkpointing explicitly using that message when an exception occurs in ProcessEventsAsync. However, after implementing and testing such a solution, the EventProcessorHost still does not re-send. The implementation is pretty simple:
private EventData _lastSuccessfulEvent;
public async Task ProcessEventsAsync(
PartitionContext context,
IEnumerable<EventData> messages)
{
try
{
await ProcessEvents(context, messages); // does actual processing, may throw exception
_lastSuccessfulEvent = messages
.OrderByDescending(ed => ed.SequenceNumber)
.First();
}
catch(Exception ex)
{
await context.CheckpointAsync(_lastSuccessfulEvent);
}
}
An analysis of things in action:
A partial log sample is available here: https://gist.github.com/ttbjj/4781aa992941e00e4e15e0bf1c45f316#file-gistfile1-txt
TLDR: The only reliable way to re-play a failed batch of events to the IEventProcessor.ProcessEventsAsync is to - Shutdown the EventProcessorHost(aka EPH) immediately - either by using eph.UnregisterEventProcessorAsync() or by terminating the process - based on the situation. This will let other EPH instances to acquire the lease for this partition & start from the previous checkpoint.
Before explaining this - I want to call-out that, this is a great Question & indeed, was one of the toughest design choices we had to make for EPH. In my view, it was a trade-off b/w: usability/supportability of the EPH framework, vs Technical-Correctness.
Ideal Situation would have been: When the user-code in IEventProcessorImpl.ProcessEventsAsync throws an Exception - EPH library shouldn't catch this. It should have let this Exception - crash the process & the crash-dump clearly shows the callstack responsible. I still believe - this is the most technically-correct solution.
Current situation: The contract of IEventProcessorImpl.ProcessEventsAsync API & EPH is,
as long as EventData can be received from EventHubs service - continue invoking the user-callback (IEventProcessorImplementation.ProcessEventsAsync) with the EventData's & if the user-callback throws errors while invoking, notify EventProcessorOptions.ExceptionReceived.
User-code inside IEventProcessorImpl.ProcessEventsAsync should handle all errors and incorporate Retry's as necessary. EPH doesn't set any timeout on this call-back to give users full control over processing-time.
If a specific event is the cause of trouble - mark the EventData with a special property - for ex:type=poison-event and re-send to the same EventHub(include a pointer to the actual event, copy these EventData.Offset and SequenceNumber into the New EventData.ApplicationProperties) or fwd it to a SERVICEBUS Queue or store it elsewhere, basically, identify & defer processing the poison-event.
if you handled all possible cases and are still running into Exceptions - catch'em & shutdown EPH or failfast the process with this exception. When the EPH comes back up - it will start from where-it-left.
Why does check-pointing 'the old event' NOT work (read this to understand EPH in general):
Behind the scenes, EPH is running a pump per EventHub Consumergroup partition's receiver - whose job is to start the receiver from a given checkpoint (if present) and create a dedicated instance of IEventProcessor implementation and then receive from the designated EventHub partition from the specified Offset in the checkpoint (if not present - EventProcessorOptions.initialOffsetProvider) and eventually invoke IEventProcessorImpl.ProcessEventsAsync. The purpose of the Checkpoint is to be able to reliably start processing messages, when the EPH process Shutsdown and the ownership of Partition is moved to another EPH instances. So, checkpoint will be consumed only while starting the PUMP and will NOT be read, once the pump started.
As I am writing this, EPH is at version 2.2.10.
more general reading on Event Hubs...
Simple Answer:
Have you tried EventProcessorHost.ResetConnection(string partiotionId)?
Complex Answer:
It might be an architecture problem that needs to addressed at your end, why did the processing fail? was it a transient error? is retrying the processing logic is a possible scenario? And so on...
I have a Windows service that spawns a set of child activities on separate threads and that should only terminate when all those activities have successfully completed. I do not know in advance how long it might take to terminate an activity after a stop signal is received. During OnStop(), I wait in intervals for that stop signal and keep requesting additional time for as long as the system is willing to grant it.
Here is the basic structure:
class MyService : ServiceBase
{
private CancellationTokenSource stopAllActivities;
private CountdownEvent runningActivities;
protected override void OnStart(string[] args)
{
// ... start a set of activities that signal runningActivities
// when they stop
// ... initialize runningActivities to the number of activities
}
protected override void OnStop()
{
stopAllActivities.Cancel();
while (!runningActivities.Wait(10000))
{
RequestAdditionalTime(15000); // NOTE: 5000 added for overhead
}
}
}
Just how much "overhead" should I be adding in the RequestAdditionalTime call? I'm concerned that the requests are cumulative, instead of based on the point in time when each RequestAdditionalTime call is made. If that's the case, adding overhead could result in the system eventually denying the request because it's too far out in the future. But if I don't add any overhead then my service could be terminated before it has a chance to request the next block of additional time.
This post wasn't exactly encouraging:
The MSDN documentation doesn’t mention this but it appears that the value specified in RequestAdditionalTime is not actually ‘additional’ time. Instead, it replaces the value in ServicesPipeTimeout. Worse still, any value greater than two minutes (120000 milliseconds) is ignored, i.e. capped at two minutes.
I hope that's not the case, but I'm posting this as a worst-case answer.
UPDATE: The author of that post was kind enough to post a very detailed reply to my comment, which I've copied below.
Lars, the short answer is no.
What I would say is that I now realise that Windows Services ought to be designed to start and terminate processing quickly when requested to do so.
As developers, we tend to focus on the implementation of the processing and then package it up and deliver it as a Windows Service.
However, this really isn’t the correct approach to designing Windows Services. Services must be able to respond quickly to requests to start and stop not only when an administrator making the request from the services console but also when the operating system is requesting a start as part of its start up processing or a stop because it is shutting down,
Consider what happens when Windows is configured to shut down when a UPS signals that the power has failed. It’s not appropriate for the service to respond with “I need a few more minutes…”.
It’s possible to write services that react quickly to stop requests even when they implement long running processing tasks. Usually a long running process will consist of batch processing of data and the processing should check if a stop has been requested at the level of the smallest unit of work that ensures data consistency.
As an example, the first service where I found the stop timeout was a problem involved the processing of a notifications queue on a remote server. The processing retrieved a notification from the queue, calling a web service to retrieve data related to the subject of the notification, and then writing a data file for processing by another application.
I implemented the processing as a timer driven call to a single method. Once the method is called it doesn’t return until all the notifications in the queue have been processed. I realised this was a mistake for a Windows Service because occasionally there might be tens of thousands of notifications in the queue and processing might take several minutes.
The method is capable of processing 50 notifications per second. So, what I should have done was implement a check to see if a stop had been requested before processing each notification. This would have allowed the method to return when it has completed the processing of a notification but before it has started to process the next notification. This would have ensured that the service responds quickly to a stop request and any pending notifications remained queued for processing when the service is restarted.
I am having MSMQ on windows 2008. Messages are available in private queue. I have one WCF subscriber (written in C#) which is installed as windows service. Now problem is that sometimes the WCF subscriber stops picking messages from Queue. If I restart service again it works fine. Now I attached IError Handler to log the reason and exception.
Now to Handle this issue what I wanted to do is, I will set the recovery property to restart service on first failure and now problem is how to throw the error from HandleError() method of IErrorHandler class?
Please tell me best way to throw an exception in a window service so it can be restarted.
While it is probably better to address the underlying cause of your exceptions, it is certainly valid in certain scenarios to implement a fail fast methodology. Indeed, this ability to kill processes which have become "flawed" in some manner is critical to the concept of fault tolerance.
So, to make a windows service commit suicide:
void KillSelf()
{
try
{
// Code to close open connections/dispose
// of unmanaged resources etc
...
}
finally
{
Environment.Exit(1);
}
}
Service recovery options should be set to restart automatically. This will ensure your service comes straight back up again.
As far as I know one cannot throw an exception to restart a windows service.
I usually encapsulate a try catch (with logging) to prevent any exceptions crashing the service, which is the opposite to what you are suggesting.
It may be that you can catch an error and stop the service (not sure) and configure the service to restart if it stops?