MongoDB connection problems on Azure

MongoDB connection problems on Azure - c#

We have an ASP.NET MVC application deployed to an Azure Website that connects to MongoDB and does both read and write operations. The application does this iteratively. A few thousand times per minute.
We initialize the C# driver using Autofac and we set the MaxConnectionIdleTime to 45 seconds as suggested in https://groups.google.com/forum/#!topic/mongodb-user/_Z8YepNHnbI and a few other places.
We are still getting a large number of the below error:
Unable to read data from the transport connection: A connection
attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because
connected host has failed to respond. Method
Message:":{"ClassName":"System.IO.IOException","Message":"Unable to
read data from the transport connection: A connection attempt failed
because the connected party did not properly respond after a period of
time, or established connection failed because connected host has
failed to respond.
We get this error while connecting to both a MongoDB instance deployed on a VM in the same datacenter/region on Azure and also while connecting to an external PaaS MongoDB provider.
I run the same code in my local computer and connect to the same DB and I don't receive these errors. It's only when I deploy the code to an Azure Website.
Any suggestions?

A few thousand requests per minute is a big load, and the only way to do it right, is by controlling and limiting the maximum number of threads which could be running at any one time.
As there's not much information posted as to how you've implemented this. I'm going to cover a few possible circumstances.
Time to experiment...
The constants:
Items to process:
50 per second, or in other words...
3,000 per minute, and one more way to look at it...
180,000 per hour
The variables:
Data transfer rates:
How much data you can transfer per second is going to play a role no matter what we do, and this will vary through out the day depending on the time of day.
The only thing we can do is fire off more requests from different cpu's to distribute the weight of traffic we're sending back n forth.
Processing power:
I'm assuming you have this in a WebJob as opposed to having this coded inside the MVC site it's self. It's highly inefficient and not fit for the purpose that you're trying to achieve. By using a WebJob we can queue work items to be processed by other WebJobs. The queue in question is the Azure Queue Storage.
Azure Queue storage is a service for storing large numbers of messages
that can be accessed from anywhere in the world via authenticated
calls using HTTP or HTTPS. A single queue message can be up to 64 KB
in size, and a queue can contain millions of messages, up to the total
capacity limit of a storage account. A storage account can contain up
to 200 TB of blob, queue, and table data. See Azure Storage
Scalability and Performance Targets for details about storage account
capacity.
Common uses of Queue storage include:
Creating a backlog of work to process asynchronously
Passing messages from an Azure Web role to an Azure Worker role
The issues:
We're attempting to complete 50 transactions per second, so each transaction should be done in under 1 second if we were utilising 50 threads. Our 45 second time out serves no purpose at this point.
We're expecting 50 threads to run concurrently, and all complete in under a second, every second, on a single cpu. (I'm exaggerating a point here, just to make a point... but imagine downloading 50 text files every single second. Processing it, then trying to shoot it back over to a colleague in the hopes they'll even be ready to catch it)
We need to have a retry logic in place, if after 3 attempts the item isn't processed, they need to be placed back in to the queue. Ideally we should be providing more time to the server to respond than just one second with each failure, lets say that we gave it a 2 second break on first failure, then 4 seconds, then 10, this will greatly increase the odds of us persisting / retrieving the data that we needed.
We're assuming that our MongoDb can handle this number of requests per second. If you haven't already, start looking at ways to scale it out, the issue isn't in the fact that it's a MongoDb, the data layer could have been anything, it's the fact that we're making this number of requests from a single source that is going to be the most likely cause of your issues.
The solution:
Set up a WebJob and name it EnqueueJob. This WebJob will have one sole purpose, to queue items of work to be process in the Queue Storage.
Create a Queue Storage Container named WorkItemQueue, this queue will act as a trigger to the next step and kick off our scaling out operations.
Create another WebJob named DequeueJob. This WebJob will also have one sole purpose, to dequeue the work items from the WorkItemQueue and fire out the requests to your data store.
Configure the DequeueJob to spin up once an item has been placed inside the WorkItemQueue, start 5 separate threads on each and while the queue is not empty, dequeue work items for each thread and attempt to execute the dequeued job.
Attempt 1, if fail, wait & retry.
Attempt 2, if fail, wait & retry.
Attempt 3, if fail, enqueue item back to WorkItemQueue
Configure your website to autoscale out to x amount of cpu's (note that your website and web jobs share the same resources)
Here's a short 10 minute video that gives an overview on how to utilise queue storages and web jobs.
Edit:
Another reason you may be getting those errors could be because of two other factors as well, again caused by it being in an MVC app...
If you're compiling the application with the DEBUG attribute applied but pushing the RELEASE version instead, you could be running into issues due to the settings in your web.config, without the DEBUG attribute, an ASP.NET web application will run a request for a maximum of 90 seconds, if the request takes longer than this, it will dispose of the request.
To increase the timeout to longer than 90 seconds you will need to change the [httpRuntime][3] property in your web.config...

<httpRuntime executionTimeout="300" />
The other thing that you need to be aware of is the request timeout settings of your browser > web app, I'd say that if you insist on keeping the code in MVC as opposed to extracting it and putting it into a WebJob, then you can use the following code to fire a request off to your web app and offset the timeout of the request.
string html = string.Empty;
string uri = "http://google.com";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Timeout = TimeSpan.FromMinutes(5);
using (HttpWebResponse response = (HttpWebResonse)request.GetResponse())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}

Are you using mongoDB in a VM? It seems to be a network problem. This kind of transient faults should occur, so the best you can do is implement a retry pattern or use a lib such as Polly to do that:
Policy
.Handle<IOException>()
.Retry(3, (exception, retryCount) =>
{
// do something
});
https://github.com/michael-wolfenden/Polly

Related

429 Too many requests only production server side, not localhost, not browser

I readed this post: C# (429) Too Many Requests
and i understod the responde code but... why only return this status code when the call is done from server side (backend) and production mode (hosted)? the service never return this code when call (the same service) from chrome's navigate url or when i do the call server side (backend) but my localhost.
CASE 1 (works fine in localhost - the service url is not localhost, is hosted)
App A (localhost) call App B (hosted) --> works fine
for (int i = 0; i < 1000; i++)
{
HttpClient client = new HttpClient();
client.BaseAddress = new Uri(url);
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
String response = client.GetStringAsync(urlParameters).Result;
client.Dispose();
}
CASE 2 (work fine)
Chrome navigator call App B (hosted) --> works fine
CASE 3 (similar to case 1 but too less requests - NOT WORK)
App A (hosted) call App B (hosted) --> 429
Why? What is the problem? How can solve it?

What's Happening
The HTTP 429 response code indicates you have been rate limited. The idea is to prevent one caller from overwhelming a service, making it less availabe to other callers.
Most Common
That limiting can be based on many things. Most common are
Number of calls per unit time (usually per second)
Number of concurrent calls
The General Case
A rate limiter may also forgive a short burst of calls that happens occasionally, may allow more calls before hitting the brakes based on who you are (using your IP or an API key for example), dynamically adjust its limits based on total system load, or do other things.
Probably Happening Here
Based on your description, I would guess the number of concurrent calls could be causing production rate limiting. Rather than hitting the external API hard trying to guess what the rules are, try reaching out to them to ask. If that is not an option, running multiple requests in parallel could validate this theory.
Handling
A great way to deal with this is to back off your requests when you receive an HTTP 429.
The service should return a Retry-After header indicating how many seconds you should wait before trying again. If it does, wait that long before resubmitting your request.
If the service does not provide that header (I work with a major one that does not), use exponential backoff instead.
Depending on your needs, you may want to tell your own caller to try again later (return an HTTP 429 yourself) or you may want to queue up pending requests and work off the queue to submit them all.
Preventing
If you know the rate limits, you can pre-emptively limit your outbound call rate so you get into this situation less often.
For call-per-second limits, you can use a counter variable that you reset (in a thread-safe way) every second. If the known call limit would be exceeded, calculate when the counter will reset (store a timestamp when it does) and delay processing that long.
For a concurrent-call limit, a SemaphoreSlim works nicely. Set the maximum count to whatever your concurrent rate limit is. Acquire the semaphore before making a request and release it (in a finally block) after your call completes.
If you have multiple servers subject to the same rate limit (e.g. if rate limiting is based on an API key rather than IP address), it gets harder to self-limit, but you can set self-limiting parameters (calls per second and concurrent calls) in a configuration file, and tune them over time to maximize your throughput without hitting excessive HTTP 429's.

gRPC - bidirectional stream goes to TRANSIENT_FAILURE if idle for too long

I'm trying to get a fairly simple test scenario to work - I'd like to create a long-lived bidirectional streaming rpc that may sit idle for long periods of time (electron app with local server).
A Node gRPC client starts a C# gRPC server locally and initiates a bidirectional stream. The streaming service receives each message, waits 50 ms, and sends it back.
The Node client test code is set up to send 5 messages, wait 30 seconds, and then send 5 more messages. The first 5 messages successfully roundtrip. The second 5 messages eventually roundtrip, but not until 5 minutes later. The server side code is not hit during this time.
I'm sure I'm being a baboon here, but I don't understand why the connection seems to be dying so fast. I'm also not sure what options could help here, if any. It seems like keepalive is intended for tracking whether the TCP connection is still alive, but doesn't actually help keep it alive. idleTimeout doesn't seem relevant either, because we're going to TRANSIENT_FAILURE status according to the enum documentation here.
This discussion from 2016 is close to what I'm trying to do, but the solution was a RYO heartbeat. This grpc-dotnet issue seems to rely on a heartbeat-type solution specific to ASP.NET, which is not currently used.
gRPC server logs:
After the first 5 messages are sent:
transport 000001A7B5A63090 set connectivity_state=4
Start BDP ping err..."Endpoint read failed" (paraphrasing)
5 minutes later right before the second set of 5 messages comes through:
W:000001A7B5AC8A10 SERVER [ipv6:[::1]:57416] state IDLE -> WRITING [RETRY_SEND_PING]
Node library is #grpc/grpc-js
tl;dr How can I keep the connection healthy & working in the case of downtime?

How to fix incosistent and slow Google Cloud Storage response times?

I'm using Google Cloud Storage to store and retrieve some files, and my problem is that the response times I'm getting are inconsistent, and sometimes very slow.
My application is an ASP.NET Core app running in the Google Container Engine. The Container Engine cluster is in europe-west1-c. The Cloud Storage bucket is Multi-Regional, in the location EU, and it's a secure bucket (not publicly accessible). I'm using the latest version of the official Google.Cloud.Storage.V1 SDK package to access the Cloud Storage. (I tried both 1.0.0 and the new 2.0.0-beta01.) I'm using a singleton instance of the StorageClient object, which should do connection pooling under the hood.
I'm measuring and logging the time it takes to download a file from the Cloud Storage, this is the measurement I do.
var sw = Stopwatch.CreateNew();
await client.DownloadObjectAsync(googleCloudOptions.StorageBucketName, filepath, ms);
sw.Stop();
So I'm directly measuring the SDK call without any of my own application logic.
The numbers I'm getting for this measurement look like this in an average period.
44ms
56ms
501ms
274ms
90ms
237ms
145ms
979ms
446ms
148ms
You can see that the variance is already pretty large to begin with (and the response time is often really sluggish).
But occasionally I even get response times like this (the slowest I've seen was over 10 seconds).
172ms
4,348ms
72ms
51ms
179ms
2,508ms
2,592ms
100ms
Which is really bad considering that the file I'm downloading is ~2 KB in size, and my application is doing less than 1 requests per second, and I'm running my application inside the Google Cloud. I don't think that the bucket not being warmed up can be a problem, since I'm mainly downloading the same handful of files, and I'm doing at least a couple of requests every minute.
Does anyone know what can be the reason for this slowness, or how I could investigate what's going wrong?
Update: Following #jterrace's suggestion, I've run gsutil perfdiag on the production environment, and uploaded both the terminal output and the generated json report here.
I also collected some more measurements, here you can see the statistics for the last 7 days.
So you can see that slow requests don't happen super-often, but over half a second response time is not rare, and we even have a handful of requests over 5 seconds every day.
What I'd like to figure out is whether we're doing something wrong, or this is expected with Cloud Storage and we have to be prepared to be able to handle these slow responses on our side.

We have the same issue with GCS. The only answer we got (from GCS support) is to use exponential backoff.
First request should be with 200ms timeout, next try 400ms and so on.

A common problem I've seen in GCE is that due to gcloud clients having a heavy DNS dependency, that bursts of traffic are being throttled by DNS queries, not the actual clients (storage or otherwise). I highly recommend you adding etcd or some other DNS cache to your container. Any real amount of traffic in GCE will choke otherwise.

How to avoid DirectoryOperationException: The Server Is Busy when USNChange Poll-Synchronizing an AD LDS directory

We are running a .NET 4.5 console application that performs USNChanged polling on a remote LDAP server and then synchronizes the records into a local AD LDS on Windows Server 2008R2. The DirSync control was not an option on the remote server but getting the records isn't the problem.
The directory is quite large, containing millions of user records. The console app successfully pulls down the records and builds a local cache. It then streams through the cache and does lookup/update/insert as required for each record on the local directory. The various network constraints in the environment had performance running between 8 and 80 records per second. As a result, we used the Task Parallel Library to improve performance:
var totalThreads = Environment.ProcessorCount *2;
var options = new ParallelOptions { MaxDegreeOfParallelism = totalThreads };
Parallel.ForEach(Data.ActiveUsersForSync.Batch(250), options, (batch, loopstate) =>
{
if (!loopstate.IsExceptional
&& !loopstate.IsStopped
&& !loopstate.ShouldExitCurrentIteration)
{
ProcessBatchSync(batch);
}
});
After introducing this block, performance increased to between 1000 and 1500 records per second. Some important notes:
This is running on an eight core machine so it allows up to 16 operations simultaneously Environment.ProcessorCount * 2;
The MoreLinq library batching mechanism is used so each task in the parallel set is processing 250 records on a given connection (from pool) before returning
Each batch is processed synchronously (no additional parallelism)
The implementation relies on System.DirectoryServices.Protocols (Win32), NOT System.DirectoryServices (ADSI)
Whenever a periodic full synchronization is executed, the system will get through about 1.1 million records and then AD LDS returns "The Server Is Busy" and the system throws a DirectoryOperationException. The number it completes before erroring is not constant but it is always near 1.1 million.
According to Microsoft (http://support.microsoft.com/kb/315071) the MaxActiveQueries value in AD LDS is no longer enforced in Windows Server 2008+. I can't change the value anyway, it doesn't show. They also show the "Server is Busy" error coming back only from a violation of that value or from having too many open notification requests per connection. This code only sends simple lookup/update/insert LDAP commands and requests no notifications from the server when something is changed.
As I understand it, I've got at most 16 threads working in tandem to query the LDS. While they are doing it very quickly, that's the max number of queries coming in in a given tick since each of these are processed single-threaded.
Is the Microsoft document incorrect? Am I misunderstanding another component here? Any assistance is appreciated.

How to log changes to database every 5 minutes in a high-transaction application with C# and SQL?

Imagine this scenario: you have a WCF web service that gets hit up to a million times a day. Each hit contains an "Account ID" identifier. The WCF service is hosted in a distributed ASP.NET cluster and you don't have Remote Desktop access to the server.
Your goal is to save "number of hits per hour" for each Account ID into a SQL database. The results should look like this:
[Time], [AccountID], [NumberOfHits]
1 PM, Account ID (Bob), 10 hits
2 PM, Account ID (Bob), 10 hits
1 PM, Account ID (Jane), 5 hits
The question is: How can you do this without connecting to a SQL server database on every hit?
Here's one solution I thought of: Store the temporary results in a System.Web.Cache object, listen to its expiration, and on Cache Expiration, write all the accumulated data to the database when Cache expires.
Any thoughts on a better approach?

Deffered update is the key, indeed, and you are on the right path with your local cache approach. As long as you don't have a requirement to display the last-update-count on each visit, the solution is simple: update a local cache of account_id->count and periodically sweep through this cache, replace the count with 0 and add the count to the total in the database. You may loose some visit counts if your ASP.Net process is lost, and your display hit count is not accurate (Node 1 int he ASP farm returns it's lats count, Node 2 returns its own local one, different from Node 1).
If you must have accurate display of counts on each return result (whether this is an page return or a service return, matter little) then it gets hairy quite fast. Centralized cache like Memcache can help to create a solution, but is not trivial.
Here is how I would keep the local cache:
class HitCountCache
{
class Counter
{
public unsigned int count {get;set}
public accountid {get;set}
};
private Dictionary<accountType, Counter> _counts = new Dictionary<...>();
private Object _lock= new Object();
// invoke this on every call
//
void IncrementAccountId (accountId)
{
Counter count;
lock(_lock)
{
if (_counts.TryGetValue (accountId, out count))
{
++count.count;
}
else
{
_counts.Add (accountId,
new Counter {accountId = accountId; count=0});
}
}
}
// Schedule this to be invoked every X minutes
//
void Save (SqlConnection conn)
{
Counter[] counts;
// Snap the counts, under lock
//
lock(_lock)
{
counts = _counts.ToArray();
_counts.Clear();
}
// Lock is released, can do DB work
//
foreach(Counter c in counts)
{
SqlCommand cmd = new SqlCommand(
#"Update table set count+=#count where accountId=#accountId",
conn);
cmd.Parameters.AddWithValue("#count", c.count);
cmd.Parameters.AddWithValue("#accountId", accountId);
cmd.ExecuteNoQuery();
}
}
}
This is a skeleton, it can be improved, and can also be made to return the current total count if needed, at least the total count as known by local node.

One option is to dump the relevant information into your server logs (logging APIs are already optimised to deal with high transaction volumes) and reap them with a separate process.

You asked: "How can you do this without connecting to a SQL server database on every hit?"
Use connection pooling. With connection pooling a several connection to SQL server opened ONCE and then they are reused for subsequent calls. So on each database hit, you do not need to connect to SQL server, because you will already be connected and can reuse existing connection for you database access.
Note, that connection pooling is used by default with SQL ado.net provider, so you might be using already without even knowing it.

An in-memory object as proposed is fastest but risks data loss in the event of an app or server crash. To reduce data loss you can lazy-write the cached data to disk. Then periodically read back from the cache file and write the aggregated information to your SQL server.

Any reason why they aren't using app fabric or the like?
Can you get into the service implementation? If so, the way to hit this is to have the service implementation fire a "fire and forget" style logging call to whatever other service you've setup to log this puppy. Shouldn't hold up the execution, should survive app crashes and the like and won't require digging into the SQL angle.
I honestly wouldn't take the job if I couldn't get into the front end of things, most other approaches are doomed to fail here.

If your goal is performance on the website then like another poster said, just use fire and forget. This could be a WebService that you post the data to or you can create a service running in the background listening on an MSMQ queue. I can give you more examples of this if interested. If you need to keep the website or admin tool in sync with the database you can store the values in a high performance cache like memcache at the same time you update the database.
If you want to run a batch of 100 queries on the DB in one query then make a separate service, again with MSMQ, which polls the queue and waits for > 100 messages in the queue. Once it detects there is 100 messages it opens a transaction with MSTDC and reads all the messages into memory and batches them up to run in one query. MSMQ durable, meaning that if the server shuts off or the service is down when the message is sent, it will still get delivered when the service comes online. Messages will only be removed from the queue once the query has completed. If the query errors out or something happens to the service the messages will still be in the queue for processing, you don't loose anything. MSTDC just helps you keep everything in one transaction so if one part of the process fails everything gets rolled back.
If you can't make a windows service to do this then just make a WebService that you call. You still send the MSMQ message each time a page loads, and say once every 10 times the page loads you fire the web service to process all the messages in the queue. The only problem you might have is getting the MSMQ service installed, however many hosting places and install something like this for you if you request it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.