I am evaluating Azure Functions using Azure Free Trial Subscription.
Everything is OK except for performance/scalability.
I developed trivial http-triggered function (C# Class library), that does nothing but sleeps 5 seconds.
When executed once, directly, it works like 5s, exactly as expected.
But when called 500 times in parallel - execution time grows up to 20-30 seconds.
Function is "hosted" on Consumption plan, so I expected that once required, it is executed on separate VM "automatically".
I checked ARR Cookies (that might have stuck my requests to one VM) - no, no
cookies at all.
Everything looks fine, at least for such simple case (no obvious bottlenecks to check - no DB, no communications, etc.).
So, the question is - is it because of free trial subscription, or I am missing something?
There is no difference for Azure Functions on Free Trial Subscriptions. You aren't being slowed down by that.
As #mathewc pointed out, this is due to HTTP scale out having some lag which we're working to improve. You can see some knobs you can control here: https://github.com/Azure/azure-webjobs-sdk-script/wiki/Http-Functions#throttling
If you enable throttling, it will result in 429s, but will help prevent increasing execution times.
Related
We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache.
The exception thrown is
An unhandled exception has occurred while executing the
request.","#l":"Error","#x":"StackExchange.Redis.RedisTimeoutException:
Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms
elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu:
0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0,
serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc:
1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a
look at this article for some common client-side issues that can cause
timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were
Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))
What is the setup you ask?
.NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don't see any extremes on the monitoring regarding cpu or memory)
Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
pushing request to an Azure Service Bus at the end (no problems there)
A Batch process is running and processing a couple of 10000's of API calls (several API's) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api's are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api's behavior)
All api's and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can't be an implementation issue in that 1 api, all shared code.
How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms)
We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is
services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
.AddStackExchangeRedisCache(s =>
{
s.Configuration = connectionString;
})
.AddTransient<IRedisCache, RedisCache>()
.AddTransient<ISharedCacheStore, SharedCacheStore>();
We are out of ideas to be honest. We don't see an issue with the Redis Cache instance in Azure as this one is not even near it's top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn't even reach 10% on the current plan.
According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.
UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months.
This also is applicable for other api's connecting to Redis Cache and NOT giving an issue
Comparison
Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
This one is hitting the time-out mark nog even reaching 10K hits per minute
There's a couple of possible things here:
I don't know what that EVAL is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look at SLOWLOG, but I don't know whether this is exposed on Azure redis
It could be that your payloads are saturating the available bandwidth - I don't know what you are transferring
It could simply be a network/socket stall/break; they happen, especially with cloud - and the (relatively) high latency makes this especially painful
We want to enable a new optional pooled (rather than multiplexed) model; this would in theory (the proof-of-concept worked great) avoid large backlogs, which means even if a socket fails: only that one call is impacted, rather than causing a cascade of failure; the limiting factor on this one is our time (and also, this needs to be balanced with any licensing implications from the redis provider; is there an upper bound on concurrent connections, for example)
It could simply be a bug in the library code; if so, we're not seeing it here, but we don't use the same setup as you; we do what we can, but it is very hard to diagnose problems that we don't see, that only arise in someone else's at-cost setup that we can't readily replicate; plus ultimately: this isn't our day job :(
I don't think there's a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don't pay for our time.
So, we found the issue on this.
The issue sits within the registration of our classes which is AddTransient as shown in the original code above.
When altering this to AddScoped, the performance is a lot faster. Even wondering if it can be a singleton.
Weird thing is that the addtransient should increase 'connected clients', which it does as a matter of fact, but has a bigger impact on the number of requests that can be handled as well. Since we never reached the max connections limit during processing.
.AddScoped<IRedisCache, RedisCache>()
.AddScoped<ISharedCacheStore, SharedCacheStore>();
With this code instead of AddTransient, we did 220 000 operations on a 4-5 minute period without an issues, whereas with the old code, we didn't even reach 40 000 operations, because of time out exceptions
This is my scenario:
Microsoft.Azure.Storage.Blob 11.2.0
Microsoft.Azure.Storage.Queue 11.2.0
Micorosoft.Azure.Cosmos.Table 1.0.7
I've moved a lot of my code from Azure function to Google k8s and Google Cloud, running the Core .Net app, basically with the same library built in .net Standard 2.0 without any problems.
After a few days, I notice a different behavior in the Linux system.
Few calls interacting with Azure service (blob, table, queue) get timeouts (subsystem appears to fail, i tried different retry-police with same result).
In 10,000 calls I get 10 to 50 errors (or very long calls 180 seconds, before I changed the timeouts). This happens in all Azure services: table, blob and queue.
I tried different solutions to find out why:
I instantiate the client (blobClient, TableClient..etc) every call, or recycle the same client but without difference
I change all timeouts to handle this behavior. I work on ServerTimeout and MaximumExecutionTime and put a layer on top, with my retry mechanism, so I can minimize errors. Now I have "only" a few calls of 20 seconds (instead of 2/3 sec for example).
I tried all solutions with similar problems found on Stackoverflow :D ... but nothing works (for now)
Same dll code run on azure function without any problems.
So i came to the conclusion, there is something in the http client, used internally by the azure sdk, that depends on the operating system you are running your code on.
I think after a few articles it may be the Keep-Alive header, so I try on my composition root:
ServicePointManager.SetTcpKeepAlive (true, 120000, 10000);
but nothing changes.
Any ideas or suggestions? ... maybe I'm on the wrong path, or i've missed something.
UPDATE
After reading the last article linked by #KrishnenduGhosh-MSFT in the last comment i tried to change this setting:
ServicePointManager.DefaultConnectionLimit = 100;
This was the turning point.
Since it used to happen randomly, I'm still not 100% sure if the problem is solved.
But after 50k calls, I'm pretty optimistic. Obviously in production will have another behavior, but I already expect it :)
UPDATE 2 - AFTER PUBLISH IN PROD
In the end, it doesn't work :(
I had written in the comments, but it seems fair to update here (more readable).
I still have long calls (abbreviated with MaximumExecutionTime), but I don't see the light at the end of the tunnel.
Now I'm thinking about moving some Azure storage to Google storage, but haven't completely given up.
I have a C# MVC application with a WCF service running on Azure. First of it was of course hosted on the free version, but as I had that one running smoothly I wanted to try and see how it ran on either Basic or Standard, which as far as I know should be dedicated servers.
To my surprise the code ran significantly slower once it was changed from Free to either Standard or Basic. I chose the smallest instance, but still expected them to perform better than the Free option?
From my performance logging I can see that the code that runs especially slow is something that is started as async from Task.Run. Initially it was old school Thread.Start() but considered whether this might spawn it in some lower priority thread and therefore changed it to Task.Run - without this changing anything - so perhaps it has nothing to do with it - but it might, so now you know.
The code that runs really slow basically works on some XML document, through XDocument, XElement etc. It loops through, has some LINQ etc. but nothing too fancy. But still it is 5-10 times slower on Basic and Standard as on the Free version? For the exact same request the Free version uses around 1000ms where as Basic and Standard uses 8000-10000ms?
In each test I have tried 5-10 times but without any decrease in response-times. I thought about whether I need to wait some hours before the Basic/Standard is fully functional or something like that, but each time I switch back, the Free version just outperforms it from the get-go.
Any suggestions? Is the Free version for some strange reason more powerful than Basic or Standard or do I need to configure something differently once I get up and running on Basic or Standard?
The notable difference between the Free and Basic/Standard tiers is that Free uses an undisclosed number of shared cores, whereas Basic/Standard has a defined number of CPU cores (1-4 based on how much you pay). Related to this is the fact that Free is a shared instance while Basic/Standard is a private instance.
My best guess based on this that since the Free servers you would be on house multiple different users and applications, they probably have pretty beef specs. Their CPUs are probably 8-core Xeons and there might even be multiple CPUs. Most likely, Azure isn't enforcing any caps but rather relying on quotas (60 CPU minutes / day for the Free tier) and overall demand on the server to restrict CPU use. In other words, if your site is the only one that happens to be doing anything at the moment (unlikely of course, but for the sake of example), you could be potentially utilizing all 8+ cores on the box, whereas when you move over to Basic/Standard you are hard-limited to 1-4. Processing XML is actually very CPU heavy, so this seems to line up with my assumptions.
More than likely, this is a fluke. Perhaps your residency is currently on a relatively newly provisioned server that hasn't been fill up with tenants yet. Maybe you just happen to be sharing with tenants that aren't doing much. Who knows? But, if the server is ever actually under real load, I'd imagine you'd see a much worse response time on the Free tier than even Basic/Standard.
[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.
I've been using Signal R on a project for the last couple of weeks and its been performing great, I even did a stress test with Crank yesterday and got 1000 users with no real delay.
I need to move on to the next stage of testing today so I decided to move it to IIS 7.5
After moving it over and doing a quick touch test I decided to do another stress test - this time I only got to 10 users and the website was pretty much dead..
does anyone know why this would happen? I've followed all the information on the Signal R performance tuning and its made zero difference..
Can anyone help?
In some cases the Maximum concurrent requests can be maxed out at ~10 (old default). This was changed in later .net releases to default to 5000. Judging on what's happening on your machine I'd assume that your default is still (somehow) ~10.
I know you said you looked over the SignalR performance tuning piece, but make sure your configurations are properly setup for the Maximum Concurrent Requests Per CPU section at https://github.com/SignalR/SignalR/wiki/Performance. It makes sense to ignore the section thinking that 5k concurrent requests is enough, but in earlier releases the value was defaulted to be very low.
You can also check out: http://blogs.msdn.com/b/tmarq/archive/2007/07/21/asp-net-thread-usage-on-iis-7-0-and-6-0.aspx for more info regarding IIS concurrent request usages, particularly the 7th paragraph.