I'm trying to understand how MS Enterprise Library's data access block manages its connections to SQL. The issue I have is that under a steady load (from a load test), at 10 minute intervals the number of connections to SQL increases quickly - which causes noticeable jump in page response times from the website.
This is the scenario I'm running:
Visual Studio load test tools, running against 3 web servers behind a load balancer
The tools give full visibility over the performance counters to all web + DB boxes
The tests take ~10 seconds each, and perform 4 inserts (form data), and some trivial selects
There are 60 tests running concurrently. There is no increase or decrease in load during the entire test.
The test is run for between 20 minutes and 3 hours, with consistent results.
And this is the issue we see:
Exactly every 10 minutes, the performance counter from SQL for SQL General: User Connections increases - by ~20 connections total
The pages performing the HTTP post / DB insert are the ones most significantly affected. The other pages show moderate, but noticeable rises.
The CPU/memory load on the web servers is unaffected
This increase corresponds with a notable bump in page response times - E.g. from .3 seconds average to up to 5 seconds
After ~5 minutes it releases many of the connections, with no affect on web performance
The following 5 minutes of testing gives the same (normal) web performance
Ultimately, the graph looks like a square wave
Happens all over again, 10 minutes after the first rise
What I've looked at:
Database calls:
All calls in the database start with:
SqlDatabase database = new SqlDatabase([...]);
And execute either proc with no required output:
return database.ExecuteScalar([...], [...]);
Or read wrapped in a using statement:
using (SqlDataReader reader = (SqlDataReader)database.ExecuteReader([...], [...]))
{
[...]
}
There are no direct uses of SqlConnection, no .Open() or .Close() methods, and no exceptions being thrown
Database verification:
We've run SQL profiler over the login / logout events, and taken snapshots with the sp_who2 command, showing who owns the connections. The latter shows that indeed the web site (seen by machine + credential) are holding the connections.
There are no scheduled jobs (DB or web server), and the user connection load is stable when there is no load from the web servers.
Connection pool config
I know the min & max size of the connection pool can be altered with the connection string.
E.g.:
"Data Source=[server];Initial Catalog=[x];Integrated Security=SSPI;Max
Pool Size=75;Min Pool Size=5;"
A fall back measure may be to set the minimum size to ~10
I understand the default max is 100, and the default min is 0 (from here)
I'm a little bit lithe to think of connection pooling (specific to this setting) and the User Connections performance counter from SQL. This article introduces these connection pools as being used to manage connection string, which seems different to what I assume it does (hold a pool of connections generally available, to avoid the cost of re-opening them on SQL)
I still haven't seen any configuration parameters that are handily defaulting to 5 or 10 minutes, to zero in on...
So, any help is appreciated.
I know that 10 minute spikes sounds like a change in load, or new activity is happening - but we've worked quite hard to isolate those & any other factors - and for this question, I am hoping to understand EL scaling its connections up & down.
Thanks.
So, it turns out that SQL user connections are created & added to the pool whenever all other connections are busy. So when long-running queries occur, or the DB is otherwise unresponsive, it will choose to expand to manage the load.
The cause of this in our case happened to be a SQL replication job (unfortunate, but found...) - And the changes in the # of User Connections was just a symptom, not a possible cause.
Although the cause turned out to be elsewhere, I now feel I understand the connection pool management, from this (and assumably other) SQL libraries.
Related
PROBLEM:
Since a DB migration, I have many intermittent exception from SQL timeout. All what I now after my 2 months of investigation:
All my .NET C# apps using EF 6 are impacted
Whatever time in the day (day or night) BUT the more traffic we have, the more timeouts happens
No matter what method used (.ToList, .Single, .First, .SaveChanges) which trigger a call to the DB
Whatever DB table impacted (small or big)
Whatever query complexity is (large query with 20 join or juste un smal query whith a select ... where Id = 1)
Whatever calling mode (write/read)
Whatever calling dbContext directly or by a property navigation (with lazy loading activiated)
ENVIRONMENT:
> INFRASTRUCTURE:
1 AWS RDS (db.m4.2xlarge), 32 GB of RAM, 8 vCPU, 100 GB SSD with SQL SERVER 11.0.xxx in Read Committed Snapshot Isolation level
UC around 5%, Connection DB around 45, very few E/S write or read, 20GB of RAM available (on 32GB), 90GB free space (on 100GB SSD)
Optimisation of index (rebuild, drop/create), query plan, stats
Increase parallelism
NO blocking, NO deadlock, NO suspended session (sp_who2), NO CXPACKET problem, very large free memory or disk space
BD usage: around 25 request by second, nothing exceptionel...
> APPLICATIONS :
20 C# applications (Batch, UI ASP MVC, ...) with EF 6 mainly (lazy loading ativated) or EF Core with CommandTimeout to 45s (instead of default 30s)
LINQ to Entities optimization (Inlcude, replaced by raw sql on very complex query, remove WHERE 1=0 queries, ...)
DbContext management optimized (open/close juste when needed, no leak connection)
Many memory caches added to avoid DB call
All application pools are recycled every day
EXCEPTION:
When a .ToList, .First, .Single or any method which trigger a call to the DB from EF throw a message like these:
System.ComponentModel.Win32Exception: The wait operation timed out.
or
System.Data.SqlClient.SqlException A transport-level error has occurred when receiving results from the server. (provider: Session Provider, error: 19 - Physical connection is not usable)
RESULT:
When I run SQL Profiler on my PROD database (impossible to reproduce the problem on others dev environment), I have a query who run juste 1 sec (not 45 sec...) and my code return only after 45s the timeout... Like the problem was on network/transport layer... But how to monitor this ? How to be sure that the problem was not on C# or DB server but between the 2 ? Have you some advice to make these timeouts disapear before I kill myself ?
EDIT 22/04/2019:
I confirm that when EF 6 execute the query (trigger by a .ToList for example), the query on my database took only 78 microseconds (yes, microseconds...) to execute and... EF are waiting for 45 seconds to give me an error like "A transport-level error has occurred..." : WHY ? Why EF 6 don't get the result from the database ? Why the answer is lost in the wild ? Like a connection lost by randomly...
EDIT 25/04/2019:
Finally we decide to develop a DbExecutionStrategy with a retry system. I'm very sad to end like this but we runnig out of time and that solved our problems.
I have a WebAPI service which connects to an Oracle database using Oracle.ManagedDataAccess.dll. Each time after a reset of the application pool (or a deployment) there is a long delay on the first OracleConnection.Open() statement. It's typically around 8 seconds. Subsequent calls are around ~0.5 seconds each.
After reading lots of suggestions regarding server OS and networking issues, i have narrowed it down to the oracle client itself. If I remote debug my code, set a breakpoint on the open statement, and then run Sysinternals Process Monitor i can confirm that the first open statement produces 544 entries, second and subsequent tests produce 2 entries.
The entries are quite random, but mostly relate to Cryptography. A quick overview of the logs:
RegOpenKey, HKLM\SOFTWARE\Microsoft\Cryptography\Defaults\Provider Types\Type 001
RegOpenKey, HKLM\SOFTWARE\Microsoft\Cryptography\Defaults\Provider\Microsoft Strong Cryptographic Provider
RegSetInfoKey, HKLM\SOFTWARE\Microsoft\Bryptography\MachineGuid
These are repeated several times, then there are sections like below:
RegQueryValue, HKLM\System\CurrentControlSet\WinSock2\Parameters\Protocol_Catalog9
RegCreateKey, HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
RegCreateKey, HKLM\System\CurrentControlSet\Services\DnsCache\Parameters
RegOpenKey, HKLM\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient
Then there are several reads of the machine.config file followed by multiple TCP connect and receive to the oracle port 1521. Following this is a section reading the time zone from the registry.
My question is, why is the oracle client doing all of this at first open? Is there any way i can predetermine the answer to some of these questions? (like configure the time zone so it doesn't have to 'ask' Oracle for it)?
Only time I have seen something like this was when the address in the tns connect descriptor was not fully qualified, ie host=computername, instead of host=computername.domain.com.
Issue is likely dns resolution as it goes thru suffixes. I imagine you could put in an ip and eliminate dns altogether as a test. Consider posting your tns entry and connection string as well.
FYI, a lot of things are happening when that first connection is created, ie the pool is established and connections are actually opened vs just fetched from the pool, initial parameters for self-tuning are initialized, etc, so i think number of reg reads is probably a red hearing.
We are running a .NET 4.5 console application that performs USNChanged polling on a remote LDAP server and then synchronizes the records into a local AD LDS on Windows Server 2008R2. The DirSync control was not an option on the remote server but getting the records isn't the problem.
The directory is quite large, containing millions of user records. The console app successfully pulls down the records and builds a local cache. It then streams through the cache and does lookup/update/insert as required for each record on the local directory. The various network constraints in the environment had performance running between 8 and 80 records per second. As a result, we used the Task Parallel Library to improve performance:
var totalThreads = Environment.ProcessorCount *2;
var options = new ParallelOptions { MaxDegreeOfParallelism = totalThreads };
Parallel.ForEach(Data.ActiveUsersForSync.Batch(250), options, (batch, loopstate) =>
{
if (!loopstate.IsExceptional
&& !loopstate.IsStopped
&& !loopstate.ShouldExitCurrentIteration)
{
ProcessBatchSync(batch);
}
});
After introducing this block, performance increased to between 1000 and 1500 records per second. Some important notes:
This is running on an eight core machine so it allows up to 16 operations simultaneously Environment.ProcessorCount * 2;
The MoreLinq library batching mechanism is used so each task in the parallel set is processing 250 records on a given connection (from pool) before returning
Each batch is processed synchronously (no additional parallelism)
The implementation relies on System.DirectoryServices.Protocols (Win32), NOT System.DirectoryServices (ADSI)
Whenever a periodic full synchronization is executed, the system will get through about 1.1 million records and then AD LDS returns "The Server Is Busy" and the system throws a DirectoryOperationException. The number it completes before erroring is not constant but it is always near 1.1 million.
According to Microsoft (http://support.microsoft.com/kb/315071) the MaxActiveQueries value in AD LDS is no longer enforced in Windows Server 2008+. I can't change the value anyway, it doesn't show. They also show the "Server is Busy" error coming back only from a violation of that value or from having too many open notification requests per connection. This code only sends simple lookup/update/insert LDAP commands and requests no notifications from the server when something is changed.
As I understand it, I've got at most 16 threads working in tandem to query the LDS. While they are doing it very quickly, that's the max number of queries coming in in a given tick since each of these are processed single-threaded.
Is the Microsoft document incorrect? Am I misunderstanding another component here? Any assistance is appreciated.
we are having some connection pool issues with Nhibernate on an MVC3 web application which is running with SQL Express and dealing with multiple concurrent AJAX based requests.
Every so often (hours in between) we see errors starting which show:
NHibernate.Util.ADOExceptionReporter
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
then a load of
While preparing select TOP (#p0)
....
an error occurred
We have to recycle the IIS app pool to stop 500 errors from being thrown after that.
Looking at the SQL Server we see:
select * from sys.dm_exec_sessions
... gives about 30 sessions with IDs above 51 (i.e. user sessions)
select * from sys.dm_exec_connections
... gives around the same amount
BUT
select ##connections
... gives results with 79022
Is this indicating that the connections are never released?
The Nhibernate sessions are for the lifetime of the request.
Does anyone have any experience of anything like this or can point us in the right direction?
Many thanks
Richard
You can't have more then 32767 connection to SQL Server.
##CONNECTIONS also gives (my bold)
Returns the number of attempted connections, either successful or unsuccessful since SQL Server was last started.
Not current connection
I suspect that your pool is not set up correctly so it's exhausted too quickly.
Or you are not releasing connections correctly and you're checking SQL Server after you recycle IIS.
I am using JMeter to test our application 's performance. but I found when I send 20 requests from JMeter, with this the reason result should be add 20 new records into the sql server, but I just find 5 new records, which means that SQL server discard the other requests(because I took a log, and make sure that the insert new records are sent out to sql server.)
Do anyone have ideas ? What's the threshold number of request can SQL server handle per second ? Or do i need to do some configuration ?
Yeah, in my application, I tried, but it seems that only 5 requests are accepted, I don't know how to config , then it can accept more.
I'm not convinced the nr of requests per seconds are directly releated to SQL server throwing away your inserts. Perhaps there's an application logic error that rolls back or fails to commit the inserts. Or the application fails to handle concurrency and inserts data violating the constraints. I'd check the server logs for deadlocks as well.
Use either SQL Profiler or the LINQ data context for logging to see what has actually been sent to the server and then determine what the problem is.
Enable the data context log like this:
datacontext.Log = Console.Out;
As a side note, I've been processing 10 000 transactions per second in SQL Server, so I don't think that is the problem.
This is very dependent on what type of queries you are doing. You can have many queries requesting data which is already in a buffer, so that no disk read access is required or you can have reads, which actually require disk access. If you database is small and you have enough memory, you might have all the data in memory at all times - access would be very fast then, you might get 100+ queries/second. If you need to read a disk, you are dependant an you hardware. I have opted for an UltraSCSI-160 controller with UltraSCSI-160 drives, the fastest option you can get on a PC type platform. I process about 75'000 records every night (they get downloaded from another server). For each record I process, the program makes about 4 - 10 queries to put the new record into the correct 'slot'. The entire process takes about 3 minutes. I'm running this on an 850 MHz AMD Athlon machine with 768 MB of RAM.
Hope this gives you a little indication about the speed.
This is an old case of study, now there is 2017, and 2019 I am waiting to see what will happens
https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/
SQL Server 2016 1 200 000 batch requests/sec Memory-Optimized Table with LOB support, Natively Compiled stored procedures
To get benchmark tests for SQL Server and other RDBMS, visit the Processing Performance Council Web
You can also use Sql Server profile to check how your querys are executed