StackExchange.Redis - Unexplainable time-out exception issue - c#

We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache.
The exception thrown is
An unhandled exception has occurred while executing the
request.","#l":"Error","#x":"StackExchange.Redis.RedisTimeoutException:
Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms
elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu:
0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0,
serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc:
1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a
look at this article for some common client-side issues that can cause
timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were
Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))
What is the setup you ask?
.NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don't see any extremes on the monitoring regarding cpu or memory)
Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
pushing request to an Azure Service Bus at the end (no problems there)
A Batch process is running and processing a couple of 10000's of API calls (several API's) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api's are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api's behavior)
All api's and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can't be an implementation issue in that 1 api, all shared code.
How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms)
We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is
services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
.AddStackExchangeRedisCache(s =>
{
s.Configuration = connectionString;
})
.AddTransient<IRedisCache, RedisCache>()
.AddTransient<ISharedCacheStore, SharedCacheStore>();
We are out of ideas to be honest. We don't see an issue with the Redis Cache instance in Azure as this one is not even near it's top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn't even reach 10% on the current plan.
According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.
UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months.
This also is applicable for other api's connecting to Redis Cache and NOT giving an issue
Comparison
Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
This one is hitting the time-out mark nog even reaching 10K hits per minute

There's a couple of possible things here:
I don't know what that EVAL is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look at SLOWLOG, but I don't know whether this is exposed on Azure redis
It could be that your payloads are saturating the available bandwidth - I don't know what you are transferring
It could simply be a network/socket stall/break; they happen, especially with cloud - and the (relatively) high latency makes this especially painful
We want to enable a new optional pooled (rather than multiplexed) model; this would in theory (the proof-of-concept worked great) avoid large backlogs, which means even if a socket fails: only that one call is impacted, rather than causing a cascade of failure; the limiting factor on this one is our time (and also, this needs to be balanced with any licensing implications from the redis provider; is there an upper bound on concurrent connections, for example)
It could simply be a bug in the library code; if so, we're not seeing it here, but we don't use the same setup as you; we do what we can, but it is very hard to diagnose problems that we don't see, that only arise in someone else's at-cost setup that we can't readily replicate; plus ultimately: this isn't our day job :(
I don't think there's a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don't pay for our time.

So, we found the issue on this.
The issue sits within the registration of our classes which is AddTransient as shown in the original code above.
When altering this to AddScoped, the performance is a lot faster. Even wondering if it can be a singleton.
Weird thing is that the addtransient should increase 'connected clients', which it does as a matter of fact, but has a bigger impact on the number of requests that can be handled as well. Since we never reached the max connections limit during processing.
.AddScoped<IRedisCache, RedisCache>()
.AddScoped<ISharedCacheStore, SharedCacheStore>();
With this code instead of AddTransient, we did 220 000 operations on a 4-5 minute period without an issues, whereas with the old code, we didn't even reach 40 000 operations, because of time out exceptions

Related

Empty ASP.NET Core (.NET 5.0) project keeps using more RAM for each request done against it

I've come across a memory issue i cannot seem to find the source of.
Even an empty .NET 5 ASP.NET Core project will balloon multiple MB per second if you run a continuous stream of HTTP requests against it.
It will also not settle once you stop sending requests.
I've used my rudimentary understanding of the memory snapshot tool to find the cause of it but it all seems to be internal objects like Action, AsyncMethodBuilderCore+ContinuationWrapper, Task, IPAddress, IPEndpoint, CancellationTokenSource and so on and so forth.
Searching online for the same issue hasn't brought me anywhere besides general programming recommandations against memory leaks.
Steps to reproducde:
Create ASP.NET Core empty .NET 5.0 project in Visual Studio 2019
Compile the default solution
Use another program/script to run continuous HTTP requests against it
Observe the memory usage growing steadily
It turns out that patience is key here.
I still don't know exactly what it is, but after almost 10 minutes of runtime and, depending on unknown circumstances, at anywhere between ~370MB and ~800MB any project without any other memory allocations will stop ballooning upwards on every request.
Interestingly enough, around that 800MB limit i started to encounter random connection closed errors in both a pure .NET loop of System.Net.WebRequest.Create("http://127.0.0.1:5000/").GetResponse(); aswell as while ($True){([System.Net.WebRequest]::Create("http://127.0.0.1:5000/")).GetResponse() | Out-Null} in PowerShell.
While i've been writing this answer i've also repeatedly subjected the running server to the treatment of continuous connections and intermittent downtime and so far it has been stable at 800MB.
Whatever it is doing it seems to be pooling connections/sockets for reuse, as the usage neither drops nor rises now.

Azure Table/Blob/Queue random Timeout on linux system (k8s .net core 3 app)

This is my scenario:
Microsoft.Azure.Storage.Blob 11.2.0
Microsoft.Azure.Storage.Queue 11.2.0
Micorosoft.Azure.Cosmos.Table 1.0.7
I've moved a lot of my code from Azure function to Google k8s and Google Cloud, running the Core .Net app, basically with the same library built in .net Standard 2.0 without any problems.
After a few days, I notice a different behavior in the Linux system.
Few calls interacting with Azure service (blob, table, queue) get timeouts (subsystem appears to fail, i tried different retry-police with same result).
In 10,000 calls I get 10 to 50 errors (or very long calls 180 seconds, before I changed the timeouts). This happens in all Azure services: table, blob and queue.
I tried different solutions to find out why:
I instantiate the client (blobClient, TableClient..etc) every call, or recycle the same client but without difference
I change all timeouts to handle this behavior. I work on ServerTimeout and MaximumExecutionTime and put a layer on top, with my retry mechanism, so I can minimize errors. Now I have "only" a few calls of 20 seconds (instead of 2/3 sec for example).
I tried all solutions with similar problems found on Stackoverflow :D ... but nothing works (for now)
Same dll code run on azure function without any problems.
So i came to the conclusion, there is something in the http client, used internally by the azure sdk, that depends on the operating system you are running your code on.
I think after a few articles it may be the Keep-Alive header, so I try on my composition root:
ServicePointManager.SetTcpKeepAlive (true, 120000, 10000);
but nothing changes.
Any ideas or suggestions? ... maybe I'm on the wrong path, or i've missed something.
UPDATE
After reading the last article linked by #KrishnenduGhosh-MSFT in the last comment i tried to change this setting:
ServicePointManager.DefaultConnectionLimit = 100;
This was the turning point.
Since it used to happen randomly, I'm still not 100% sure if the problem is solved.
But after 50k calls, I'm pretty optimistic. Obviously in production will have another behavior, but I already expect it :)
UPDATE 2 - AFTER PUBLISH IN PROD
In the end, it doesn't work :(
I had written in the comments, but it seems fair to update here (more readable).
I still have long calls (abbreviated with MaximumExecutionTime), but I don't see the light at the end of the tunnel.
Now I'm thinking about moving some Azure storage to Google storage, but haven't completely given up.

Performance of Consumption-hosted Functions on Azure free subscriptions

I am evaluating Azure Functions using Azure Free Trial Subscription.
Everything is OK except for performance/scalability.
I developed trivial http-triggered function (C# Class library), that does nothing but sleeps 5 seconds.
When executed once, directly, it works like 5s, exactly as expected.
But when called 500 times in parallel - execution time grows up to 20-30 seconds.
Function is "hosted" on Consumption plan, so I expected that once required, it is executed on separate VM "automatically".
I checked ARR Cookies (that might have stuck my requests to one VM) - no, no
cookies at all.
Everything looks fine, at least for such simple case (no obvious bottlenecks to check - no DB, no communications, etc.).
So, the question is - is it because of free trial subscription, or I am missing something?
There is no difference for Azure Functions on Free Trial Subscriptions. You aren't being slowed down by that.
As #mathewc pointed out, this is due to HTTP scale out having some lag which we're working to improve. You can see some knobs you can control here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/Http-Functions#throttling
If you enable throttling, it will result in 429s, but will help prevent increasing execution times.

SQL Server log file grew 40GB with Hangfire

I have developed an Hangfire application using MVC running in IIS, and it is working absolutely fine, till I saw the size of my SQL Server log file, which grew whopping 40 GB overnight!!
As per information from our DBA, there was an long running transaction, with the following SQL statement (I have 2 hangfire queues in place)-
(#queues1 nvarchar(4000),#queues2 nvarchar(4000),#timeout float)
delete top (1) from [HangFire].JobQueue with (readpast, updlock, rowlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, #timeout, GETUTCDATE()))
and Queue in (#queues1,#queues2)
On exploring the Hangfire library, I found that it is used for dequeuing the jobs, and doing a very simple task that should not take any significant time.
I couldn't found anything that would have caused this error. transactions are used correctly with using statements and object are Disposed in event of exception.
As suggested in some posts, I have checked the recovery mode of my database and verified that it is simple.
I have manually killed the hanged transaction to reclaim the log file space, but it come up again after few hours. I am observing it continuously.
What could be the reason for such behavior? and how it can be prevented?
The issue seems to be intermittent, and it could be of extremely high risk to be deployed on production :(
Starting from Hangfire 1.5.0, Hangfire.SqlServer implementation wraps the whole processing of a background job with a transaction. Previous implementation used invisibility timeout to provide at least once processing guarantee without requiring a transaction, in case of an unexpected process shutdown.
I've implemented a new model for queue processing, because there were a lot of confusion for new users, especially ones who just installed Hangfire and played with it under a debugging session. There were a lot of questions like "Why my job is still under processing state?". I've considered there may be problems with transaction log growth, but I didn't know this may happen even with Simple Recovery Model (please see this answer to learn why).
It looks like there should be a switch, what queue model to use, based on transactions (by default) or based on invisibility timeout. But this feature will be available in 1.6 only and I don't know any ETAs yet.
Currently, you can use Hangfire.SqlServer.MSMQ or any other non-RDBMS queue implementations (please see the Extensions page). Separate database for Hangfire may also help, especially if your application changes a lot of data.

Multi-server n-tier synchronized timing and performance metrics?

[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.

Categories

Resources