PubSub with 'cloud-builds' topic often produces unack'ed messages

PubSub with 'cloud-builds' topic often produces unack'ed messages - c#

So we've been using PubSub for receiving GCB events for a while.
We have 4 subscribers to our subscription, so they can split the workload.
The subscribers are identical and written using the official C# client
The subscribers use the default settings, we configure that only 1 thread should be pulling.
They are running as a HostedService in AspNetCore inside Kubernetes.
The subscriber application has only that one responsibility
This application is deployed a couple of times every week since it's bundle with a more heavy use api.
The issue we are facing is this:
When looking at our Kibana logs we sometimes see what appears to a delayed of the pubs message of 1 or more minutes (notice that QUEUED has a later timestamp than WORKING).
However looking at the publishTime it is clear that problem is not that the event is published later, but rather that it is handled by our code later.
Now if we look at the PubSub graphs we get:
Which confirms that there indeed WAS an incident where message where not acked.
This explains why we are seeing the delayed handling of the message :).
But it does not explain WHY we appear to exceed the deadline of 60 seconds.
There are no errors / exceptions anywhere to be found
We are using the C# client in a standard way (defaults)
Now here is where it gets interesting, I discovered that if I do a PURGE messages using the google UI, everything seems to run smoothly for a while (1-3 days). But then I happens again.
Now if we look at the metrics across all the instances when the issue occurs (this is from another incident) we are at no point in time over 200ms of computation time:
Thoughts:
We are misunderstanding something basic about the pubsub ack configuration
Maybe the deploys we do somehow leads the subscription to think that there are still active subscribers and therefore it awaits them to fail before trying the next subscriber? This is indicated by the PURGE reaction, however I have no way of inspecting how many subscribers currently are registered with the subscription and I can't see a bug in the code that could imply this.
Looking at the metrics the problem is not with our code. However there might be something with the official client default config / bug.
Im really puzzled and im missing insights into what is going on inside the pubsub clusters and the official client. Some tracing from the client would be nice or query tools for pubsub like the ones we have with our Kafka clusters.
The code:
public class GoogleCloudBuildHostedService : BackgroundService
{
...
private async Task<SubscriberClient> BuildSubscriberClient()
{
var subscriptionToUse = $"{_subscriptionName}";
var subscriptionName = new SubscriptionName(_projectId,subscriptionToUse);
var settings = new SubscriberServiceApiSettings();
var client = new SubscriberClient.ClientCreationSettings(1,
credentials: GoogleCredentials.Get().UnderlyingCredential.ToChannelCredentials(),
subscriberServiceApiSettings: settings);
return await SubscriberClient.CreateAsync(subscriptionName, client);
}
protected override async Task ExecuteAsync(CancellationToken cancellationToken)
{
await Task.Yield();
cancellationToken.Register(() => _log.Info("Consumer thread stopping."));
while (cancellationToken.IsCancellationRequested == false)
{
try
{
_log.Info($"Consumer starting...");
var client = await BuildSubscriberClient();
await client.StartAsync((msg, cancellationToken) =>
{
using (eventTimer.NewTimer())
{
try
{
...
}
catch (Exception e)
{
_log.Error(e);
}
}
return Task.FromResult(SubscriberClient.Reply.Ack);
});
await client.StopAsync(cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(10), cancellationToken);
}
catch (Exception e)
{
_log.Info($"Consumer failed: {e.Message}");
}
}
_log.Info($"Consumer stopping...");
}
}
Hope someone out there in the great big void can enlighten me :).
Kind regards
Christian
UPDATE
So I looked into one of the cases again, and here below we see:
the same instance of the application handling messages from the same topic and subscription.
there's 1 client thread only configured
Notice that at 15:23:04 and 15:23:10 there's 2 messages handled at the same time of publication, now 2 minutes later a message that was published at 15:23:07 is handled. And in the mean time 2 other messages are being handled.
So why is a message published at 15:23:07 not handled until 15:25:25, when other messages arrive in the mean time?

This can be happening due to different reasons and it is not a trivial task to find and troubleshoot the root of the issue.
Possible latency reasons
Related to latency, it is normal for subscriptions to have backlogged messages if they are not consuming messages fast enough or have not finished working though the backlog.
I would start by reading the following documentation, where it mentions some reasons on why you might be exceeding the deadline in some cases.
Another reason for message latency might be due to an increase in message or payload size. Check if all of your messages are moreless the same size, or if those getting handled with delay are bigger in size.
Handling message failures
I would also like to suggest taking a read here, where it talks about how to handle message failures by setting a subscription retry policy or forwarding undelivered messages to a dead-letter topic (also known as a dead-letter queue).
Good practices
This article contains some good tips and tricks for understanding how latency and message stuckness can happen and some suggestions that can help to improve that.

xBurnsed offered some good advice links, so let me just supplement it with some other things:
This looks like a classic case of a subscriber holding on to a message and then upon its lease expiring, the message gets delivered to another subscriber. The original subscriber is perhaps one that went down after it received a message, but before it could process and ack the message. You could see if the backlog correlates with restarts of instances of your subscriber. If so, this is a likely culprit. You could check to see if your subscribers are shutting down cleanly, where the subscriber StopAsync call exits cleanly and you have acknowledged all messages received by your callback before actually stopping the subscriber application.
Make sure you are using the latest version of the client. Earlier versions were subject to issues with large backlogs of small messages, though in reality the issues were not limited to those cases. This is mostly relevant if your subscribers are running up against their flow control limits.
Do the machines on which the subscribe is running have any other tasks running on them that could be using up CPU or RAM? If so, it's possible that one of the subscriber applications is starved for resources and can't process messages quickly enough.
If you are still having issues, the next best step to take is to put in a request with Cloud Support, providing the name of your project, name of your subscription, and the message ID of a message that was delayed. Support will be able to track the lifetime of a message and determine if the delivery was delayed on the server or if it was delivered multiple times and not acked.

Related

Log messages discarded when logging to Splunk using Serilog

We have a Windows service that creates a new thread and runs a scheduled task once per day. Logging is done with Serilog and sink is Splunk ("Serilog.Sinks.Splunk"). During a successful run we write eight information messages to the log (Log.Information("") ). The messages are more or less identical from one run to another, apart from a timestamp and integer values. Four of the messages is logged before the actual job tasks are done and four after.
We have discovered that sometimes all eight messages turn up in Splunk, sometimes only the four last messages (those logged after the time consuming processing has been done) and sometimes none of the messages.
When we add another sink, writing to file ("Serilog.Sinks.File") we always get all of the eight messages in the file.
Adding Serilog debug logging (Serilog.Debugging.SelfLog.Enable), when log messages are discarded we get the following debug message logged (once - not one per lost message):
"2019-08-30T11:28:03.9029821Z A status code of Forbidden was received when attempting to send to https://<>/services/collector. The event has been discarded and will not be placed back in the queue."
Adding a Sleep (System.Threading.Thread.Sleep() ) first thing in the scheduled task we always get the logging done after the Sleep in Splunk, so it seems it takes some time to set up the connection to the Splunk endpoint and any messages sent before the connection is up is just discarded. Since three of the messages are logged by an external nuget package (Hangfire) before the execution enters into our code we frequently lose these three messages and it isn't ideal to have a Sleep() in our code.
Pseudo code (including the Sleep), as I described log messages 1-3 (and 6-8) are written by an external nuget package:
public Task DoJob()
{
var currentRunInformation = new RunInformation();
try
{
System.Threading.Thread.Sleep(3000);
Log.Information($"Log message 4");
//Get Data
var jobData = GetJobData();
//Do some calculations
var calculated = DoCalculations(jobData);
//Save result
PersistResult(calculated);
Log.Information($"Log message 4");
return Task.CompletedTask;
}
catch (Exception exception)
{
Log.Error(exception, $"Error log");
return Task.FromException(exception);
}
}
Is there any way we can make the logging wait for an open connection before sending messages? Or any other options to avoid having our logging discarded in an unpredictable manner?

There's nothing out-of-the-box in Serilog.Sinks.Splunk to perform additional checks on Splunk before sending messages, or to retry messages that failed. You can track this issue to get notified if/when this ever gets implemented in the future.
Behind the scenes, the sink is simply sending HTTP POST requests to the Splunk Event Collector...
To have the behaviour that you want, you'd have to implement a variation of Serilog.Sinks.Splunk. You could probably borrow the implementation of durable log shipping from Serilog.Sinks.Seq, and store messages that failed to send in a file, and retry later...
ps: Funny enough, even the code sample that shows how to use the sink, has a Thread.Sleep before sending messages to give Splunk a chance to warm up... 🙈

Azure Service Bus Topics, Subscription Lost

I'm working with Azure ServiceBus, standard tier.
I'm trying to figure out what's happened since a couple of weeks, (it seems it started when bus traffic has increased, maybe 10-15 messages per second).
I have automatic creation of subscription using
subscriptionOpts.AutoDeleteOnIdle = TimeSpan.FromHours(3);
Starting from lasts weeks, (when we got a traffic increment), sometimes our subscriptionclients stopped receiving messages and after 3 hours they get deleted.
var messageOptions = new MessageHandlerOptions(args =>
{
Emaillog.Warn(args.Exception, $"Client ExceptionReceived: {args.Exception}");
return Task.CompletedTask;
}) { AutoComplete = true };
_subscriptionClient.RegisterMessageHandler(async (message, token) => await OnMessageReceived(message, $"{_subscriptionClient.SubscriptionName}", token), messageOptions);
Is it possible that a subscription client gets disconnected and doesn't connect anymore?
I have 4-5 clients processes that connect to this topic, each one with his own subscription.
When I find one of these subscriptions deleted, sometimes they have all been deleted, sometimes only some of them have been deleted.
Is it a bug? The only method call I do on the subscriptionClient is RegisterMessageHandler. I don't manage manually anything else...
Thank you in advance

The property AutoDeleteOnIdle is used to delete the Subscription when there is no message processing with in the Subscription for the specified time span.
As you mentioned that the message flow increased to 15 messages per second, there is no chance that the Subscription is left empty (with out message flow). So there is no reason for the Subscriptions to delete. The idleness of the Subscription is decided by both incoming and outgoing messages.
There can be chances that due to heavy message traffic, the downstream application processing the messages may went offline, leaving the messages unprocessed, eventually when the message flow reduced there is no receiver to process the messages, leaving the Subscription idle for 3 hours and delete.

Forcing EventProcessorHost to re-deliver failed Azure Event Hub eventData's to IEventProcessor.ProcessEvents method

The application uses .NET 4.6.1 and the Microsoft.Azure.ServiceBus.EventProcessorHost nuget package v2.0.2, along with it's dependency WindowsAzure.ServiceBus package v3.0.1 to process Azure Event Hub messages.
The application has an implementation of IEventProcessor. When an unhandled exception is thrown from the ProcessEventsAsync method the EventProcessorHost never re-sends those messages to the running instance of IEventProcessor. (Anecdotally, it will re-send if the hosting application is stopped and restarted or if the lease is lost and re-obtained.)
Is there a way to force the event message that resulted in an exception to be re-sent by EventProcessorHost to the IEventProcessor implementation?
One possible solution is presented in this comment on a nearly identical question:
Redeliver unprocessed EventHub messages in IEventProcessor.ProcessEventsAsync
The comment suggests holding a copy of the last successfully processed event message and checkpointing explicitly using that message when an exception occurs in ProcessEventsAsync. However, after implementing and testing such a solution, the EventProcessorHost still does not re-send. The implementation is pretty simple:
private EventData _lastSuccessfulEvent;
public async Task ProcessEventsAsync(
PartitionContext context,
IEnumerable<EventData> messages)
{
try
{
await ProcessEvents(context, messages); // does actual processing, may throw exception
_lastSuccessfulEvent = messages
.OrderByDescending(ed => ed.SequenceNumber)
.First();
}
catch(Exception ex)
{
await context.CheckpointAsync(_lastSuccessfulEvent);
}
}
An analysis of things in action:
A partial log sample is available here: https://gist.github.com/ttbjj/4781aa992941e00e4e15e0bf1c45f316#file-gistfile1-txt

TLDR: The only reliable way to re-play a failed batch of events to the IEventProcessor.ProcessEventsAsync is to - Shutdown the EventProcessorHost(aka EPH) immediately - either by using eph.UnregisterEventProcessorAsync() or by terminating the process - based on the situation. This will let other EPH instances to acquire the lease for this partition & start from the previous checkpoint.
Before explaining this - I want to call-out that, this is a great Question & indeed, was one of the toughest design choices we had to make for EPH. In my view, it was a trade-off b/w: usability/supportability of the EPH framework, vs Technical-Correctness.
Ideal Situation would have been: When the user-code in IEventProcessorImpl.ProcessEventsAsync throws an Exception - EPH library shouldn't catch this. It should have let this Exception - crash the process & the crash-dump clearly shows the callstack responsible. I still believe - this is the most technically-correct solution.
Current situation: The contract of IEventProcessorImpl.ProcessEventsAsync API & EPH is,
as long as EventData can be received from EventHubs service - continue invoking the user-callback (IEventProcessorImplementation.ProcessEventsAsync) with the EventData's & if the user-callback throws errors while invoking, notify EventProcessorOptions.ExceptionReceived.
User-code inside IEventProcessorImpl.ProcessEventsAsync should handle all errors and incorporate Retry's as necessary. EPH doesn't set any timeout on this call-back to give users full control over processing-time.
If a specific event is the cause of trouble - mark the EventData with a special property - for ex:type=poison-event and re-send to the same EventHub(include a pointer to the actual event, copy these EventData.Offset and SequenceNumber into the New EventData.ApplicationProperties) or fwd it to a SERVICEBUS Queue or store it elsewhere, basically, identify & defer processing the poison-event.
if you handled all possible cases and are still running into Exceptions - catch'em & shutdown EPH or failfast the process with this exception. When the EPH comes back up - it will start from where-it-left.
Why does check-pointing 'the old event' NOT work (read this to understand EPH in general):
Behind the scenes, EPH is running a pump per EventHub Consumergroup partition's receiver - whose job is to start the receiver from a given checkpoint (if present) and create a dedicated instance of IEventProcessor implementation and then receive from the designated EventHub partition from the specified Offset in the checkpoint (if not present - EventProcessorOptions.initialOffsetProvider) and eventually invoke IEventProcessorImpl.ProcessEventsAsync. The purpose of the Checkpoint is to be able to reliably start processing messages, when the EPH process Shutsdown and the ownership of Partition is moved to another EPH instances. So, checkpoint will be consumed only while starting the PUMP and will NOT be read, once the pump started.
As I am writing this, EPH is at version 2.2.10.
more general reading on Event Hubs...

Simple Answer:
Have you tried EventProcessorHost.ResetConnection(string partiotionId)?
Complex Answer:
It might be an architecture problem that needs to addressed at your end, why did the processing fail? was it a transient error? is retrying the processing logic is a possible scenario? And so on...

NServiceBus Windows-Service: Service can't be stopped

I have the problem, that since some time, we can't stop a NServiceBus Windows-Service. If we try to, we get this exception:
Error 1061: The service cannot accept control messages at this time.
Unfortunately, I really didn't find anything about this matter but this Github-Issue: https://github.com/Particular/NServiceBus/issues/1898
Sadly, this doesn't help, since we need in fact the IConfigureThisEndpoint Interface, to configure the BusConfiguration, which also isn't that long running. We also use almost the exact same template for other NServiceBus-Endpoints, which don't have any problems.
Interesting enough, it worked also for this Endpoint for quite some time and it also seem also to be a problem only for one specific Server.
Is there a possibility to find more about the exception, be it from Microsoft or NServiceBus?

Graceful shutdown
This error can be caused for a lot of reasons. NServiceBus will only try to perform a graceful shutdown.
NServiceBus will not abort messages that are currently being processed but will stop processing new messages.
Log file
The log file should indicate that a shutdown is triggered so that is the first thing you can verify.
I would advise to set the log level to DEBUG to help diagnose the shutdown sequence and to also add this:
var appDomainLogger = LogManager.GetLogger("AppDomain");
var appDomain = AppDomain.CurrentDomain
appDomain.FirstChanceException += (o, ea) => {
appDomainLogger.Debug("FirstChanceException", ea.Exception);
};
appDomain.UnhandledException += (o, ea) => {
appDomainLogger.Debug("UnhandledException", ea.ExceptionObject as Exception);
};
It could be that exceptions occur that prevent shutdown and this adds additional diagnostics.
Long running messages
If for example a message is being processed that is waiting for a lock on a database to be released then this message could take more time then the windows service interface allows.
Other tasks like converting media files
Eventually, the windows service should shutdown if all resources are freed and messages are done processing unless they contain a bug that prevents shutdown.
Disposing of resources
Also, during shutdown the container is disposed too. It might be that your container has resources that have lots of cleanup/teardown to do. For example, resources that flush in-memory caches to disk or remote storage so the next time the services is started it can startup faster then normal.

Azure Service Bus Subscriber regularly phoning home?

We have pub/sub application that involves an external client subscribing to a Web Role publisher via an Azure Service Bus Topic. Our current billing cycle indicates we've sent/received >25K messages, while our dashboard indicates we've sent <100. We're investigating our implementation and checking our assumptions in order to understand the disparity.
As part of our investigation we've gathered wireshark captures of client<=>service bus traffic on the client machine. We've noticed a regular pattern of communication that we haven't seen documented and would like to better understand. The following exchange occurs once every 50s when there is otherwise no activity on the bus:
The client pushes ~200B to the service bus.
10s later, the service bus pushes ~800B to the client. The client registers the receipt of an empty message (determined via breakpoint.)
The client immediately responds by pushing ~1000B to the service bus.
Some relevant information:
This occurs when our web role is not actively pushing data to the service bus.
Upon receiving a legit message from the Web Role, the pattern described above will not occur again until a full 50s has passed.
Both client and server connect to sb://namespace.servicebus.windows.net via TCP.
Our application messages are <64 KB
Questions
What is responsible for the regular, 3-packet message exchange we're seeing? Is it some sort of keep-alive?
Do each of the 3 packets count as a separately billable message?
Is this behavior configurable or otherwise documented?
EDIT:
This is the code the receives the messages:
private void Listen()
{
_subscriptionClient.ReceiveAsync().ContinueWith(MessageReceived);
}
private void MessageReceived(Task<BrokeredMessage> task)
{
if (task.Status != TaskStatus.Faulted && task.Result != null)
{
task.Result.CompleteAsync();
// Do some things...
}
Listen();
}

I think what you are seeing is the Receive call in the background. Behind the scenes the Receive calls are all using long polling. Which means they call out to the Service Bus endpoint and ask for a message. The Service Bus service gets that request and if it has a message it will return it immediately. If it doesn't have a message it will hold the connection open for a time period in case a message arrives. If a message arrives within that time frame it will be returned to the client. If a message is not available by the end of the time frame a response is sent to the client indicating that no message was there (aka, your null BrokeredMessage). If you call Receive with no overloads (like you've done here) it will immediately make another request. This loop continues to happend until a message is received.
Thus, what you are seeing are the number of times the client requests a message but there isn't one there. The long polling makes it nicer than what the Windows Azure Storage Queues have because they will just immediately return a null result if there is no message. For both technologies it is common to implement an exponential back off for requests. There are lots of examples out there of how to do this. This cuts back on how often you need to go check the queue and can reduce your transaction count.
To answer your questions:
Yes, this is normal expected behaviour.
No, this is only one transaction. For Service Bus you get charged a transaction each time you put a message on a queue and each time a message is requested (which can be a little opaque given that Recieve makes calls multiple times in the background). Note that the docs point out that you get charged for each idle transaction (meaning a null result from a Receive call).
Again, you can implement a back off methodology so that you aren't hitting the queue so often. Another suggestion I've recently heard was if you have a queue that isn't seeing a lot of traffic you could also check the queue depth to see if it was > 0 before entering the loop for processing and if you get no messages back from a receive call you could go back to watching the queue depth. I've not tried that and it is possible that you could get throttled if you did the queue depth check too often I'd think.
If these are your production numbers then your subscription isn't really processing a lot of messages. It would likely be a really good idea to have a back off policy to a time that is acceptable to wait before it is processed. Like, if it is okay that a message sits for more than 10 minutes then create a back off approach that will eventually just be checking for a message every 10 minutes, then when it gets one process it and immediately check again.
Oh, there is a Receive overload that takes a timeout, but I'm not 100% that is a server timeout or a local timeout. If it is local then it could still be making the calls every X seconds to the service. I think this is based on the OperationTimeout value set on the Messaging Factory Settings when creating the SubscriptionClient. You'd have to test that.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.