Log messages discarded when logging to Splunk using Serilog

Log messages discarded when logging to Splunk using Serilog - c#

We have a Windows service that creates a new thread and runs a scheduled task once per day. Logging is done with Serilog and sink is Splunk ("Serilog.Sinks.Splunk"). During a successful run we write eight information messages to the log (Log.Information("") ). The messages are more or less identical from one run to another, apart from a timestamp and integer values. Four of the messages is logged before the actual job tasks are done and four after.
We have discovered that sometimes all eight messages turn up in Splunk, sometimes only the four last messages (those logged after the time consuming processing has been done) and sometimes none of the messages.
When we add another sink, writing to file ("Serilog.Sinks.File") we always get all of the eight messages in the file.
Adding Serilog debug logging (Serilog.Debugging.SelfLog.Enable), when log messages are discarded we get the following debug message logged (once - not one per lost message):
"2019-08-30T11:28:03.9029821Z A status code of Forbidden was received when attempting to send to https://<>/services/collector. The event has been discarded and will not be placed back in the queue."
Adding a Sleep (System.Threading.Thread.Sleep() ) first thing in the scheduled task we always get the logging done after the Sleep in Splunk, so it seems it takes some time to set up the connection to the Splunk endpoint and any messages sent before the connection is up is just discarded. Since three of the messages are logged by an external nuget package (Hangfire) before the execution enters into our code we frequently lose these three messages and it isn't ideal to have a Sleep() in our code.
Pseudo code (including the Sleep), as I described log messages 1-3 (and 6-8) are written by an external nuget package:
public Task DoJob()
{
var currentRunInformation = new RunInformation();
try
{
System.Threading.Thread.Sleep(3000);
Log.Information($"Log message 4");
//Get Data
var jobData = GetJobData();
//Do some calculations
var calculated = DoCalculations(jobData);
//Save result
PersistResult(calculated);
Log.Information($"Log message 4");
return Task.CompletedTask;
}
catch (Exception exception)
{
Log.Error(exception, $"Error log");
return Task.FromException(exception);
}
}
Is there any way we can make the logging wait for an open connection before sending messages? Or any other options to avoid having our logging discarded in an unpredictable manner?

There's nothing out-of-the-box in Serilog.Sinks.Splunk to perform additional checks on Splunk before sending messages, or to retry messages that failed. You can track this issue to get notified if/when this ever gets implemented in the future.
Behind the scenes, the sink is simply sending HTTP POST requests to the Splunk Event Collector...
To have the behaviour that you want, you'd have to implement a variation of Serilog.Sinks.Splunk. You could probably borrow the implementation of durable log shipping from Serilog.Sinks.Seq, and store messages that failed to send in a file, and retry later...
ps: Funny enough, even the code sample that shows how to use the sink, has a Thread.Sleep before sending messages to give Splunk a chance to warm up... 🙈

Related

Azure ServiceBus - same message read multiple times

We have some issues with messages from Azure ServiceBus being read multiple times. Previously we had the same issue, which turned out to be due to lock timeout. Then, as the lock timed out the messages were read again, and their deliveryCount increased by 1 for each time the message was read. After this, we set the max delivery count to 1 to avoid resending of messages, and also increased the lock timeout to 5 minutes.
The current issue is a lot more strange.
First, messages are read at 10:45:34. Message locks are set to 10:50:34, and deliveryCount is 1. The reading says it succeeds, at 10:45:35.0. All good so far.
But then, at 10:45:35.8, the same messages are read again! And the delivery count is still 1. Both the sequence number and message id are the same in the two receive logs. This happens for a very small percentage of messages, something like 0,02% of the messages.
From what I understand, reading a message should either result in a success where the message should be removed, or an increase of deliveryCount, which in my case should send the message to DLQ. In these cases, neither happens.
I'm using ServiceBusTrigger, like this:
[FunctionName(nameof(ReceiveMessages))]
public async Task Run([ServiceBusTrigger(queueName: "%QueueName%", Connection = "ServiceBusConnectionString")]
string[] messages,
This seems to be like a bug in either the service bus or the library, any thoughts on what it could be?

That’s not the SDK but rather the specific entity. It sounds like the entity is corrupted. Delete and recreate it. If that doesn’t help, then open a support case.
On a different note, most of the time when delivery count is set to 1 is an indicator of something off. If you truly need at-most-once delivery guarantee, use ReceiveAndDelete mode instead of PeekLock.

PubSub with 'cloud-builds' topic often produces unack'ed messages

So we've been using PubSub for receiving GCB events for a while.
We have 4 subscribers to our subscription, so they can split the workload.
The subscribers are identical and written using the official C# client
The subscribers use the default settings, we configure that only 1 thread should be pulling.
They are running as a HostedService in AspNetCore inside Kubernetes.
The subscriber application has only that one responsibility
This application is deployed a couple of times every week since it's bundle with a more heavy use api.
The issue we are facing is this:
When looking at our Kibana logs we sometimes see what appears to a delayed of the pubs message of 1 or more minutes (notice that QUEUED has a later timestamp than WORKING).
However looking at the publishTime it is clear that problem is not that the event is published later, but rather that it is handled by our code later.
Now if we look at the PubSub graphs we get:
Which confirms that there indeed WAS an incident where message where not acked.
This explains why we are seeing the delayed handling of the message :).
But it does not explain WHY we appear to exceed the deadline of 60 seconds.
There are no errors / exceptions anywhere to be found
We are using the C# client in a standard way (defaults)
Now here is where it gets interesting, I discovered that if I do a PURGE messages using the google UI, everything seems to run smoothly for a while (1-3 days). But then I happens again.
Now if we look at the metrics across all the instances when the issue occurs (this is from another incident) we are at no point in time over 200ms of computation time:
Thoughts:
We are misunderstanding something basic about the pubsub ack configuration
Maybe the deploys we do somehow leads the subscription to think that there are still active subscribers and therefore it awaits them to fail before trying the next subscriber? This is indicated by the PURGE reaction, however I have no way of inspecting how many subscribers currently are registered with the subscription and I can't see a bug in the code that could imply this.
Looking at the metrics the problem is not with our code. However there might be something with the official client default config / bug.
Im really puzzled and im missing insights into what is going on inside the pubsub clusters and the official client. Some tracing from the client would be nice or query tools for pubsub like the ones we have with our Kafka clusters.
The code:
public class GoogleCloudBuildHostedService : BackgroundService
{
...
private async Task<SubscriberClient> BuildSubscriberClient()
{
var subscriptionToUse = $"{_subscriptionName}";
var subscriptionName = new SubscriptionName(_projectId,subscriptionToUse);
var settings = new SubscriberServiceApiSettings();
var client = new SubscriberClient.ClientCreationSettings(1,
credentials: GoogleCredentials.Get().UnderlyingCredential.ToChannelCredentials(),
subscriberServiceApiSettings: settings);
return await SubscriberClient.CreateAsync(subscriptionName, client);
}
protected override async Task ExecuteAsync(CancellationToken cancellationToken)
{
await Task.Yield();
cancellationToken.Register(() => _log.Info("Consumer thread stopping."));
while (cancellationToken.IsCancellationRequested == false)
{
try
{
_log.Info($"Consumer starting...");
var client = await BuildSubscriberClient();
await client.StartAsync((msg, cancellationToken) =>
{
using (eventTimer.NewTimer())
{
try
{
...
}
catch (Exception e)
{
_log.Error(e);
}
}
return Task.FromResult(SubscriberClient.Reply.Ack);
});
await client.StopAsync(cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(10), cancellationToken);
}
catch (Exception e)
{
_log.Info($"Consumer failed: {e.Message}");
}
}
_log.Info($"Consumer stopping...");
}
}
Hope someone out there in the great big void can enlighten me :).
Kind regards
Christian
UPDATE
So I looked into one of the cases again, and here below we see:
the same instance of the application handling messages from the same topic and subscription.
there's 1 client thread only configured
Notice that at 15:23:04 and 15:23:10 there's 2 messages handled at the same time of publication, now 2 minutes later a message that was published at 15:23:07 is handled. And in the mean time 2 other messages are being handled.
So why is a message published at 15:23:07 not handled until 15:25:25, when other messages arrive in the mean time?

This can be happening due to different reasons and it is not a trivial task to find and troubleshoot the root of the issue.
Possible latency reasons
Related to latency, it is normal for subscriptions to have backlogged messages if they are not consuming messages fast enough or have not finished working though the backlog.
I would start by reading the following documentation, where it mentions some reasons on why you might be exceeding the deadline in some cases.
Another reason for message latency might be due to an increase in message or payload size. Check if all of your messages are moreless the same size, or if those getting handled with delay are bigger in size.
Handling message failures
I would also like to suggest taking a read here, where it talks about how to handle message failures by setting a subscription retry policy or forwarding undelivered messages to a dead-letter topic (also known as a dead-letter queue).
Good practices
This article contains some good tips and tricks for understanding how latency and message stuckness can happen and some suggestions that can help to improve that.

xBurnsed offered some good advice links, so let me just supplement it with some other things:
This looks like a classic case of a subscriber holding on to a message and then upon its lease expiring, the message gets delivered to another subscriber. The original subscriber is perhaps one that went down after it received a message, but before it could process and ack the message. You could see if the backlog correlates with restarts of instances of your subscriber. If so, this is a likely culprit. You could check to see if your subscribers are shutting down cleanly, where the subscriber StopAsync call exits cleanly and you have acknowledged all messages received by your callback before actually stopping the subscriber application.
Make sure you are using the latest version of the client. Earlier versions were subject to issues with large backlogs of small messages, though in reality the issues were not limited to those cases. This is mostly relevant if your subscribers are running up against their flow control limits.
Do the machines on which the subscribe is running have any other tasks running on them that could be using up CPU or RAM? If so, it's possible that one of the subscriber applications is starved for resources and can't process messages quickly enough.
If you are still having issues, the next best step to take is to put in a request with Cloud Support, providing the name of your project, name of your subscription, and the message ID of a message that was delayed. Support will be able to track the lifetime of a message and determine if the delivery was delayed on the server or if it was delivered multiple times and not acked.

Rollbar .NET API: How to wait until all pending async messages have completed?

I have a .NET console application that uses the official Rollbar .NET API to post log messages to Rollbar asynchronously. The C# code used to send each message looks like:
RollbarLocator.RollbarInstance.Log(ErrorLevel.Error, "test message");
I've noticed that if my application terminates shortly after an async message is sent, that message often won't make it to Rollbar -- evidently because the message is still in a pending state at the time of termination.
All pending messages generally will get sent successfully if I have my application sleep for several seconds just before exiting:
// Give outgoing async Rollbar messages time to send before we exit.
System.Threading.Thread.Sleep(6000);
However, obviously, that approach is not terribly elegant.
The docs do briefly touch on this situation:
However, in some specific situations (such as while logging right before exiting an application), you may want to use a logger fully synchronously so that the application does not quit before the logging completes (including subsequent delivery of the corresponding payload to the Rollbar API).
In my scenario, however, I don't necessarily know whether the program is about to exit at the time that I am logging a given message.
I could log a debug-level "Exiting now!" message synchronously just before the program terminates; however, it's not clear from the documentation whether or not doing so causes any pending async messages to also be sent?
Is there an elegant way to guarantee that all pending async messages sent to Rollbar actually have been sent (or have timed out) before my program terminates?

You need to use a blocking logger in this case. For example:
RollbarLocator.RollbarInstance
.AsBlockingLogger(TimeSpan.FromSeconds(6))
.Log(ErrorLevel.Error, "test message");
For more details, please, look here:
https://docs.rollbar.com/docs/basic-usage#blocking-vs-non-blocking-logging

Azure functions with service bus: How to keep a message in the queue if something goes wrong with its processing?

I'm new to service bus and not able to figure this out.
Basically i'm using Azure function app which is hooked onto the service bus queue. Let's say a trigger is fired from the service bus and I receive a message from the queue, and in the processing of that message something goes wrong in my code. In such cases how do I make sure to put that message back in the queue again? Currently its just disappearing into thin air and when I restart my function app on VS, the next message from the queue is taken.
Ideally only when all my data processing is done and when i hit myMsg.Success() do I want it to be removed from the queue.
public static async Task RunAsync([ServiceBusTrigger("xx", "yy", AccessRights.Manage)]BrokeredMessage mySbMsg, TraceWriter log)
{
try{ // do something with mySbMsg }
catch{ // put that mySbMsg back in the queue so it doesn't disappear. and throw exception}
}
I was reading up on mySbMsg.Abandon() but it looks like that puts the message in the dead letter queue and I am not sure how to access it? and if there is a better way to error handle?

Cloud queues are a bit different than in-memory queues because they need to be robust to the possibility of the client crashing after it received the queue message but before it finished processing the message.
When a queue message is received, the message becomes "invisible" so that other clients can't pick it up. This gives the client a chance to process it and the client must mark it as completed when it is done (Azure Functions will do this automatically when you return from the function). That way, if the client were to crash in the middle of processing the message (we're on the cloud, so be robust to random machine crashes due to powerloss, etc), the server will see the absence of the completed message, assume the client crashed, and eventually resend the message.
Practically, this means that if you receive a queue message, and throw an exception (and thus we don't mark the message as completed), it will be invisible for a few minutes, but then it will show up again after a few minutes and another client can attempt to handle it. Put another way, in Azure functions, queue messages are automatically retried after exceptions, but the message will be invisible for a few minutes inbetween retries.

If you want the message to remain on the queue to be retried, the function should not swallow exception and rather throw. That way Function will not auto-complete the message and retry it.
Keep in mind that this will cause message to be retried and eventually, if exception persists, to be moved into dead-letter queue.

As per my understanding, I think what you are for is if there is an error in processing the message it needs to retry the execution instead of swallowing it. If you are using Azure Functions V2.0 you define the message handler options in the host.json
"extensions": {
"serviceBus": {
"prefetchCount": 100,
"messageHandlerOptions": {
"autoComplete": false,
"maxConcurrentCalls": 1
}
}
}
prefetchCount - Gets or sets the number of messages that the message receiver can simultaneously request.
autoComplete - Whether the trigger should automatically call complete after processing, or if the function code will manually call complete.
After retrying the message n(defaults to 10) number of times it will transfer the message to DLQ.

Azure Service Bus Subscriber regularly phoning home?

We have pub/sub application that involves an external client subscribing to a Web Role publisher via an Azure Service Bus Topic. Our current billing cycle indicates we've sent/received >25K messages, while our dashboard indicates we've sent <100. We're investigating our implementation and checking our assumptions in order to understand the disparity.
As part of our investigation we've gathered wireshark captures of client<=>service bus traffic on the client machine. We've noticed a regular pattern of communication that we haven't seen documented and would like to better understand. The following exchange occurs once every 50s when there is otherwise no activity on the bus:
The client pushes ~200B to the service bus.
10s later, the service bus pushes ~800B to the client. The client registers the receipt of an empty message (determined via breakpoint.)
The client immediately responds by pushing ~1000B to the service bus.
Some relevant information:
This occurs when our web role is not actively pushing data to the service bus.
Upon receiving a legit message from the Web Role, the pattern described above will not occur again until a full 50s has passed.
Both client and server connect to sb://namespace.servicebus.windows.net via TCP.
Our application messages are <64 KB
Questions
What is responsible for the regular, 3-packet message exchange we're seeing? Is it some sort of keep-alive?
Do each of the 3 packets count as a separately billable message?
Is this behavior configurable or otherwise documented?
EDIT:
This is the code the receives the messages:
private void Listen()
{
_subscriptionClient.ReceiveAsync().ContinueWith(MessageReceived);
}
private void MessageReceived(Task<BrokeredMessage> task)
{
if (task.Status != TaskStatus.Faulted && task.Result != null)
{
task.Result.CompleteAsync();
// Do some things...
}
Listen();
}

I think what you are seeing is the Receive call in the background. Behind the scenes the Receive calls are all using long polling. Which means they call out to the Service Bus endpoint and ask for a message. The Service Bus service gets that request and if it has a message it will return it immediately. If it doesn't have a message it will hold the connection open for a time period in case a message arrives. If a message arrives within that time frame it will be returned to the client. If a message is not available by the end of the time frame a response is sent to the client indicating that no message was there (aka, your null BrokeredMessage). If you call Receive with no overloads (like you've done here) it will immediately make another request. This loop continues to happend until a message is received.
Thus, what you are seeing are the number of times the client requests a message but there isn't one there. The long polling makes it nicer than what the Windows Azure Storage Queues have because they will just immediately return a null result if there is no message. For both technologies it is common to implement an exponential back off for requests. There are lots of examples out there of how to do this. This cuts back on how often you need to go check the queue and can reduce your transaction count.
To answer your questions:
Yes, this is normal expected behaviour.
No, this is only one transaction. For Service Bus you get charged a transaction each time you put a message on a queue and each time a message is requested (which can be a little opaque given that Recieve makes calls multiple times in the background). Note that the docs point out that you get charged for each idle transaction (meaning a null result from a Receive call).
Again, you can implement a back off methodology so that you aren't hitting the queue so often. Another suggestion I've recently heard was if you have a queue that isn't seeing a lot of traffic you could also check the queue depth to see if it was > 0 before entering the loop for processing and if you get no messages back from a receive call you could go back to watching the queue depth. I've not tried that and it is possible that you could get throttled if you did the queue depth check too often I'd think.
If these are your production numbers then your subscription isn't really processing a lot of messages. It would likely be a really good idea to have a back off policy to a time that is acceptable to wait before it is processed. Like, if it is okay that a message sits for more than 10 minutes then create a back off approach that will eventually just be checking for a message every 10 minutes, then when it gets one process it and immediately check again.
Oh, there is a Receive overload that takes a timeout, but I'm not 100% that is a server timeout or a local timeout. If it is local then it could still be making the calls every X seconds to the service. I think this is based on the OperationTimeout value set on the Messaging Factory Settings when creating the SubscriptionClient. You'd have to test that.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.