Elasticsearch timeout - c#

In which scenario can I get an error like this?
The elastic search service is on the same computer as the client calling it, so there it no network issue. The server has free memory, free disk, and there is always at least 15-20% cpu free.
I'm inserting a lot of data in elastic, but it never came with a timeout; and today, there are hundreds of similar error in our logs.
Could it be because there are a lot of requests in parallel? the insertion code is heavily multi-threaded.
InternalServerError - Invalid NEST response built from a unsuccessful low level call on POST: /albums/albummetadata/f3c20bb7-8f60-5d80-fe87-449bdf3d828a/_update
# Audit trail of this API call:
- [1] BadResponse: Node: http://localhost:9200/ Took: 00:01:00.3240283
- [2] MaxTimeoutReached: Took: -736395.18:15:40.4144464
# OriginalException: System.Net.WebException: The operation has timed out
at System.Net.HttpWebRequest.GetResponse()
at Elasticsearch.Net.HttpConnection.Request[TReturn](RequestData requestData) in C:\code\elasticsearch-net\src\Elasticsearch.Net\Connection\HttpConnection.cs:line 145
# Request:
# Response:
+The operation has timed out

Are you doing bulk inserts? Are your documents large?
It sounds reasonable that under heavy load inserts may timeout. You could try to detect increased error rates and scale back/slow down your insertions. You could also increase the timeout threshold, but that'll likely only go so far - you'll still end up with a backlog of requests that is ever-growing and will eventually start failing again.
Another option is to scale up your ES cluster, either increasing the specs of your current nodes or adding more nodes.

Related

Maximum number of retries (6) exceeded while executing database operations with 'CosmosExecutionStrategy'

I am working on API development project using ASP.NET Core 2.2, GraphQL.NET , CosmosDB, Entity Framework Core (Microsoft.EntityFrameworkCore.Cosmos v2.2.4).
While testing the API method which pulls the data from AzureComosDB, sometime I get this error:
Microsoft.EntityFrameworkCore.Storage.RetryLimitExceededException: 'Maximum number of retries (6) exceeded while executing database operations with 'CosmosExecutionStrategy'. See inner exception for the most recent failure.'
I am not sure why this error is popping up intermittently.
Can anyone help me here by providing some guidance to fix this issue?
I would like to know more information about context file as the error says `
'Maximum number of retries (6) exceeded'
`. It might cause if you are trying to redeploy database on every request, So Considering you have already deployed database in cosmosdb it is recommended to remove Database.EnsureCreated() as will create performance issues.
Refer to this documentation for more information https://learn.microsoft.com/en-us/ef/core/providers/cosmos/?tabs=dotnet-core-cli
First of all, have you checked the inner exception as stated in the error?
Microsoft.EntityFrameworkCore.Storage.RetryLimitExceededException: 'Maximum number of retries (6) exceeded while executing database operations with 'CosmosExecutionStrategy'. See inner exception for the most recent failure.'
it might give a clue as to why it is failing.
Now, this error is caused by the cosmos retry stategy. If an operation failes it will retry it to up to six times.
You can modify this strategy but the default can be found here.
The fact that it is retried indicates it it an error that might be gone when retried. A good example is a glitch in the netwerk connection (like, when the wifi signal is bad). Another one could be the fact that the requests are exceeding the provisioned Request Unit limits.

How to fix incosistent and slow Google Cloud Storage response times?

I'm using Google Cloud Storage to store and retrieve some files, and my problem is that the response times I'm getting are inconsistent, and sometimes very slow.
My application is an ASP.NET Core app running in the Google Container Engine. The Container Engine cluster is in europe-west1-c. The Cloud Storage bucket is Multi-Regional, in the location EU, and it's a secure bucket (not publicly accessible). I'm using the latest version of the official Google.Cloud.Storage.V1 SDK package to access the Cloud Storage. (I tried both 1.0.0 and the new 2.0.0-beta01.) I'm using a singleton instance of the StorageClient object, which should do connection pooling under the hood.
I'm measuring and logging the time it takes to download a file from the Cloud Storage, this is the measurement I do.
var sw = Stopwatch.CreateNew();
await client.DownloadObjectAsync(googleCloudOptions.StorageBucketName, filepath, ms);
sw.Stop();
So I'm directly measuring the SDK call without any of my own application logic.
The numbers I'm getting for this measurement look like this in an average period.
44ms
56ms
501ms
274ms
90ms
237ms
145ms
979ms
446ms
148ms
You can see that the variance is already pretty large to begin with (and the response time is often really sluggish).
But occasionally I even get response times like this (the slowest I've seen was over 10 seconds).
172ms
4,348ms
72ms
51ms
179ms
2,508ms
2,592ms
100ms
Which is really bad considering that the file I'm downloading is ~2 KB in size, and my application is doing less than 1 requests per second, and I'm running my application inside the Google Cloud. I don't think that the bucket not being warmed up can be a problem, since I'm mainly downloading the same handful of files, and I'm doing at least a couple of requests every minute.
Does anyone know what can be the reason for this slowness, or how I could investigate what's going wrong?
Update: Following #jterrace's suggestion, I've run gsutil perfdiag on the production environment, and uploaded both the terminal output and the generated json report here.
I also collected some more measurements, here you can see the statistics for the last 7 days.
So you can see that slow requests don't happen super-often, but over half a second response time is not rare, and we even have a handful of requests over 5 seconds every day.
What I'd like to figure out is whether we're doing something wrong, or this is expected with Cloud Storage and we have to be prepared to be able to handle these slow responses on our side.
We have the same issue with GCS. The only answer we got (from GCS support) is to use exponential backoff.
First request should be with 200ms timeout, next try 400ms and so on.
A common problem I've seen in GCE is that due to gcloud clients having a heavy DNS dependency, that bursts of traffic are being throttled by DNS queries, not the actual clients (storage or otherwise). I highly recommend you adding etcd or some other DNS cache to your container. Any real amount of traffic in GCE will choke otherwise.

When executionTimeout is reached, what happens?

If the default is 110 seconds why do I see requests going beyond that (up to 177 seconds)?
I'd expect and hope that once time is reached the request is cancelled and resources reallocated.
I'm seeing these response times in my apm tool (dynatrace) which instruments the code and doesn't likely get the time from the server logs
( referring to In our IIS logs, why do requests last 5 min and longer when executionTimeout is 110 seconds?)
Thank you
Have you considered the requests may be being queued on the server? If you look at perfmon RequestsQueued you might see some queuing going on.
Also look at request wait time to get an indication of how long the last request waited.
Can you send a screenshot of the PurePath showing the Exec Time but also the Elapsed Time column in the tree? Maybe the PurePath itself actually gets aborted by IIS after 110 s but some asynchronous activity in your ASP.NET App is still working and was not interrupted by the IIS Timeout. The PurePath tree should show that as it shows asynchronous subpaths
andi

MongoDB connection problems on Azure

We have an ASP.NET MVC application deployed to an Azure Website that connects to MongoDB and does both read and write operations. The application does this iteratively. A few thousand times per minute.
We initialize the C# driver using Autofac and we set the MaxConnectionIdleTime to 45 seconds as suggested in https://groups.google.com/forum/#!topic/mongodb-user/_Z8YepNHnbI and a few other places.
We are still getting a large number of the below error:
Unable to read data from the transport connection: A connection
attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because
connected host has failed to respond. Method
Message:":{"ClassName":"System.IO.IOException","Message":"Unable to
read data from the transport connection: A connection attempt failed
because the connected party did not properly respond after a period of
time, or established connection failed because connected host has
failed to respond.
We get this error while connecting to both a MongoDB instance deployed on a VM in the same datacenter/region on Azure and also while connecting to an external PaaS MongoDB provider.
I run the same code in my local computer and connect to the same DB and I don't receive these errors. It's only when I deploy the code to an Azure Website.
Any suggestions?
A few thousand requests per minute is a big load, and the only way to do it right, is by controlling and limiting the maximum number of threads which could be running at any one time.
As there's not much information posted as to how you've implemented this. I'm going to cover a few possible circumstances.
Time to experiment...
The constants:
Items to process:
50 per second, or in other words...
3,000 per minute, and one more way to look at it...
180,000 per hour
The variables:
Data transfer rates:
How much data you can transfer per second is going to play a role no matter what we do, and this will vary through out the day depending on the time of day.
The only thing we can do is fire off more requests from different cpu's to distribute the weight of traffic we're sending back n forth.
Processing power:
I'm assuming you have this in a WebJob as opposed to having this coded inside the MVC site it's self. It's highly inefficient and not fit for the purpose that you're trying to achieve. By using a WebJob we can queue work items to be processed by other WebJobs. The queue in question is the Azure Queue Storage.
Azure Queue storage is a service for storing large numbers of messages
that can be accessed from anywhere in the world via authenticated
calls using HTTP or HTTPS. A single queue message can be up to 64 KB
in size, and a queue can contain millions of messages, up to the total
capacity limit of a storage account. A storage account can contain up
to 200 TB of blob, queue, and table data. See Azure Storage
Scalability and Performance Targets for details about storage account
capacity.
Common uses of Queue storage include:
Creating a backlog of work to process asynchronously
Passing messages from an Azure Web role to an Azure Worker role
The issues:
We're attempting to complete 50 transactions per second, so each transaction should be done in under 1 second if we were utilising 50 threads. Our 45 second time out serves no purpose at this point.
We're expecting 50 threads to run concurrently, and all complete in under a second, every second, on a single cpu. (I'm exaggerating a point here, just to make a point... but imagine downloading 50 text files every single second. Processing it, then trying to shoot it back over to a colleague in the hopes they'll even be ready to catch it)
We need to have a retry logic in place, if after 3 attempts the item isn't processed, they need to be placed back in to the queue. Ideally we should be providing more time to the server to respond than just one second with each failure, lets say that we gave it a 2 second break on first failure, then 4 seconds, then 10, this will greatly increase the odds of us persisting / retrieving the data that we needed.
We're assuming that our MongoDb can handle this number of requests per second. If you haven't already, start looking at ways to scale it out, the issue isn't in the fact that it's a MongoDb, the data layer could have been anything, it's the fact that we're making this number of requests from a single source that is going to be the most likely cause of your issues.
The solution:
Set up a WebJob and name it EnqueueJob. This WebJob will have one sole purpose, to queue items of work to be process in the Queue Storage.
Create a Queue Storage Container named WorkItemQueue, this queue will act as a trigger to the next step and kick off our scaling out operations.
Create another WebJob named DequeueJob. This WebJob will also have one sole purpose, to dequeue the work items from the WorkItemQueue and fire out the requests to your data store.
Configure the DequeueJob to spin up once an item has been placed inside the WorkItemQueue, start 5 separate threads on each and while the queue is not empty, dequeue work items for each thread and attempt to execute the dequeued job.
Attempt 1, if fail, wait & retry.
Attempt 2, if fail, wait & retry.
Attempt 3, if fail, enqueue item back to WorkItemQueue
Configure your website to autoscale out to x amount of cpu's (note that your website and web jobs share the same resources)
Here's a short 10 minute video that gives an overview on how to utilise queue storages and web jobs.
Edit:
Another reason you may be getting those errors could be because of two other factors as well, again caused by it being in an MVC app...
If you're compiling the application with the DEBUG attribute applied but pushing the RELEASE version instead, you could be running into issues due to the settings in your web.config, without the DEBUG attribute, an ASP.NET web application will run a request for a maximum of 90 seconds, if the request takes longer than this, it will dispose of the request.
To increase the timeout to longer than 90 seconds you will need to change the [httpRuntime][3] property in your web.config...
<!-- Increase timeout to five minutes -->
<httpRuntime executionTimeout="300" />
The other thing that you need to be aware of is the request timeout settings of your browser > web app, I'd say that if you insist on keeping the code in MVC as opposed to extracting it and putting it into a WebJob, then you can use the following code to fire a request off to your web app and offset the timeout of the request.
string html = string.Empty;
string uri = "http://google.com";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Timeout = TimeSpan.FromMinutes(5);
using (HttpWebResponse response = (HttpWebResonse)request.GetResponse())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
Are you using mongoDB in a VM? It seems to be a network problem. This kind of transient faults should occur, so the best you can do is implement a retry pattern or use a lib such as Polly to do that:
Policy
.Handle<IOException>()
.Retry(3, (exception, retryCount) =>
{
// do something
});
https://github.com/michael-wolfenden/Polly

Could awaiting network cause client timeouts?

I have a server that is doing work instructed by an Azure queue. It is almost always on very high CPU doing multiple tasks in parallel and some of the tasks use Parallel.ForEach.
During the running of the tasks I write analytic events to another Azure queue by calling CloudQueue.AddMessageAsync with await.
I noticed thousands of these analytic writings that fail with the following error:
WebException: The remote server returned an error: (500) Internal Server Error.
I checked Azure's storage event logs, and I have a nice bunch of PutMessage commands that take 80.000ms end to end, but they only take 1ms for Azure itself. The HTTP status code I get is 500 and Azure describes the reason as client timeout.
What I think is happening is that my code calls the AddMessageAsync and from that point my thread is released and the network driver is sending the request and waiting for a response. When getting a response, the network driver needs a thread to get the response and a task is scheduled to do that and calls my continuation. Because my server is constantly on high load, the task takes a long time to get a thread and by then the Azure server decides this is a client timeout.
The code calling azure:
await cloudQueue.AddMessageAsync(new CloudQueueMessage(aMessageContent));
The exception:
StorageException: The remote server returned an error: (500) Internal Server Error.
Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result):11
Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions+<>c__DisplayClass4.<CreateCallbackVoid>b__3(IAsyncResult ar):45
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task):82
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task):41
AzureCommon.Data.AsyncQueueDataContext+<AddMessage>d__d.MoveNext() in c:\BuildAgent\work\14078ab89161833\Azure\AzureCommon\Data\Async\AsyncQueueDataContext.cs:60
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task):82
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task):41
AzureCommon.Storage.AzureEvent+<DispatchAsync>d__1.MoveNext() in c:\BuildAgent\work\14078ab89161833\Azure\AzureCommon\Events\AzureEvent.cs:354
WebException: The remote server returned an error: (500) Internal Server Error.
System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult):41
Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndGetResponse[T](IAsyncResult getResponseResult):44
Am I right about why this is happening? If so, would using a single-threaded synchronization context for this call be better for me?
A row from Azure storage log. You can find details about what each property means here.
<request-start-time> <operation-type> <request-status> <http-status-code> <end-to-end-latency-in-ms> <server-latency-in-ms>
2014-07-29T14:55:20.0794198Z PutMessage ClientTimeoutError 500 86929 1
Thanks.
The error 500 means that the server has received a bad request or it has crashed for various other reasons. I don't believe that it has to do with the high load of your threads. Please consider taking the following actions:
Check the name of the queue you are using. The name needs to be lowercase, starting with a character. This is a common issue that causes error 500 with no enlighting error message from the server.
Set up the retry policy of the Azure Storage SDK client, preferably with an Exponential retry policy.
Make sure you are using the latest Azure Storage SDK, as the underlying protocol has recently changed to a more efficient one.
'Bad Request' is a 400 error, not a 500 error. A 500 Error indicates any kind of server error, so it's perfectly reasonable to get that response, and many client-side libraries will use a 500 error code for similar types of unexpected issues.
Normally a 'client timeout' response would never make it to the client (because it timed out!). The only situation I can think of where a client timeout response could make it to the client would be if the request was more than a single network packet and the client was too slow in sending packets after the first one. This could easily be caused by CPU contention on the client device. I would recommend using a higher priority thread for listening to network responses but then immediately pass off the processing of the response to a normal priority thread. Overloaded CPU will cause all sorts of timeout issues because the code can't tell the difference between a network response not coming in soon enough and the CPU not scheduling the listener in time to receive the response (or even to send the request). Even local disk I/O and locking can timeout in these situations, depending on the underlying implementation.

Categories

Resources