We have an Azure-based ASP.NET Web Service that accesses an Azure KeyVault. We are seeing two instances in which a method "hangs" on a first try, and then works a minute or so later.
In both instances, a KeyVault access occurs. In both instances the problem started when we started using the KeyVault in these methods.
We have done very careful logging in the first instance, and cannot see anything else in our code that could cause the hang. The KeyVault access is the primary suspect.
In addition, if we run the app from our local servers (from Visual Studio), the KeyVault access works fine on the "first try". It only produces the "hang" error when it runs in production on Azure, and only on that "first try".
By "hang" I mean that in one instance, which is triggered by an external API, it takes at least 60 seconds (we can tell that because the external API times out.) In the other instance, which is triggered by a page request, several minutes can pass and the page just spins, at which point we assume the DB request or something else has timed out.
When I say "a minute or so later", that's as fast as we have timed the retry.
Is there some kind of issue or function where the KeyVault needs to be "warmed up" before it works on the first try?
Update: I'm looking at the code more carefully, and I see at least a couple of places where we can insert still more logging to get a more exact picture of where the failure occurs. I'm going to do that, and then I'll report back here.
Update: See answer below - major newbie error, has been corrected.
Found the problem, and the solution.
Key Vault access needs to be called from an async task, because there is a multi-second delay.
private async Task<string> GetKeyVaultSecretValue(varSecretParms) {
I don't understand the underlying technology, however, apparently, if the call is from within a standard code sequence, the server doesn't like to wait, and so the thread is abandoned/halts.
According to your description, it seems that it dues to WebApp that does not enable Always on .
By default, web apps are unloaded if they are idle for some period of time. This lets the system conserve resources. In Basic or Standard mode, you can enable Always On to keep the app loaded all the time
If possible, please have a try to enable Always on and try it again.
Related
As we have moved from NSB5 to NSB6 we also looked into removing NServiceBus.Host and instead use Topshelf. When we did, our service no longer shows that it has stopped when we receive a critical failure.
As an example, when we have trouble to reach the database for any reason I want the service to end and in Services Manager it should indicate not running. Though, it still says running but service is actually stopped. Therefore no recovery is run either.
This was working as we were using NServiceBus.Host.
I was looking in the wrong direction, towards Topshelf. The answer lies in how to configure NServiceBus to take care of critical errors.
EndpointConfiguration.DefineCriticalErrorAction(OnCriticalError);
and
private async Task OnCriticalError(ICriticalErrorContext context)
{
await context.Stop().ConfigureAwait(false);
}
This worked for me.
I am posting this partly out of intrest on how the Task Parallel Library works, and for spreading knowledge. And also for investigating whether my "Cancellation" updates is the reason to a new issue where the user is suddenly logged out.
The project I am working on have these components:
Web forms site. A website that acts as portal for administrating company vehicles. Further refered as "Web"
WCF web service. A backend service on a seperate machine. Further refered as "Service"
Third party service. Further refered as "3rd"
Note: I am using .NET 4.0. Therefore the newer updates to the Task Parallel Library are not available.
The issue that I was assigned to fix was that the login function was very slow and CPU intensive. This later was later admitted to be a problem in the Third party service. However I tried to optimize the login behavior as well as I could.
The login request and response doesn't contain perticularly much data. But for gathering the response data several API calls are made to the Third party service.
1. Pre changes
The Web invokes a WCF method on the Service for gathering "session data".
This method would sometimes take so long that it would timeout (I think the timeout was set to 1 minute).
A pseudo representation of the "GetSessionData" method:
var agreements = getAgreements(request);
foreach (var agreement in agreements)
{
getAgreementDetails(agreement);
var customers = getCustomersWithAgreement(agreement);
foreach (var customer in customers)
{
getCustomerInfo(customer);
getCustomerAddress(customer);
getCustomerBranches(customer);
}
}
var person = getPerson(request);
var accounts = getAccount(person.Id);
foreach (var account in accounts)
{
var accountDetail = getAccountDetail(account.Id);
foreach (var vehicle in accountDetail.Vehicles)
{
getCurrentMilageReport(vehicle.Id);
}
}
return sessionData;
See gist for code snippet.
This method quickly becomes heavy the more agreements and accounts the user has.
2. Parallel.ForEach
I figured that I could replace foreach loops with a Parallel.ForEach(). This greatly improved the speed of the method for larger users.
See gist for code snippet.
3. Cancel
Another problem we had was that when the web services server is maxed on CPU usage, all method calls becomes much slower and could result in a timeout for the user. And a popular response to a timeout is to try again, so the user triggers another login attempt which is "queued"(?) due to the high CPU usage levels. This all while the first request has not returned yet.
We discovered that the request is still alive if the web site times out. So we decided to implement a similiar timeout on the Service side.
See gist for code snippet.
The idea is that GetSessionData(..) is to be invoked with a CancellationToken that will trigger Cancel after about the same time as the Web timeout. So that no work will be done if no one is there to show or use the results.
I also implemented the cancellation for the method calls to the Third party service.
Is it correct to share the same CancellationToken for all of the loops and service calls? Could there be an issue when all threads are "aborted" by throwing the cancel exception?
See gist for code snippet.
Is it correct to share the same CancellationToken for all of the loops and service calls? Could there be an issue when all threads are "aborted" by throwing the cancel exception?
Yes, it is correct. And yes, there could be an issue with throwing a lot of exceptions at the same time, but only in specific situations and huge amount of parallel work.
Several hints:
Use one CancellationTokenSource per one complete action. For example, per request. Pass the same Cancellation Token from this source to every asynchronous method
You can avoid throwing an exception and just return from a method. Later, to check that work was done and nothing been cancelled, you you check IsCancellationRequested on cts
Check token for cancellation inside loops on each iteration and just return if cancelled
Use threads only when there is an IO work, for example, when you query something from database or requests to another services; don't use it for CPU-bound work
I was tired at the end of working day and suggested a bad thing. Mainly, you don't need threads for IO bound work, for example, for waiting for a response from database of third service. Use threads only for CPU computations.
Also, I reviewed your code again and found several bottlenecks:
You can call GetAgreementDetail, GetFuelCards, GetServiceLevels, GetCustomers in asynchronously; don't wait for each next, not running all four requests
You can call GetAddressByCustomer and GetBranches in parallel as well
I noticed that you use mutex. I guess it is for protecting agreementDto. Customers and response.Customers on addition. If so, you can reduce scope of the lock
You can start work with Vehicles earlier, as you know UserId at the beginning of the method; do it in parallel too
I have a REST service in a self hosted ASP.Net WebApi application (Console).
Some clients poll the server in specific intervals to fetch new data. In general all is working fine.
The problem is, that the server stops responding to requests after some random duration (~30mins - 2.5 hours). All client requests start to time out.
The weird thing is, the server doesn't seem to receive the requests anymore as no controller method is invoked anymore). Server didn't throw any exceptions and the console app is still responsive. So I can only suppose there is a problem, before the request reaches the API controller.
In the debugger everything seems fine.
How can I diagnose such an issue?
What else can I try to fix the described behavior?
Notes:
Tested on multiple systems
.Net 4.5.1
Asp.Net WebApi 5.1.2
I have found the issue, the reason this is happening is because of connection leaks. If you are sending requests and aren't closing them correctly, either after the request is finished, or within an exception, the amount of open connections will eventuelly reach it's max value. Either you change the max amount of open connections in the connectionstring or(the prefered way) make sure your code is handling the closing part:
SqlConnection myConnection = new SqlConnection(ConnectionString);
try
{
conn.Open();
someCall (myConnection);
}
finally
{
myConnection.Close();
}
Credit goes to How can I solve a connection pool problem between ASP.NET and SQL Server? Where you can read more about this.
In my case, the issue was caused by never ending tasks. Due a misusage of the ReactiveExtensions Api, I randomly created never ending tasks. It seems, at some point the task scheduler simply couldn't handle them anymore, although I'm not completely sure about that.
Thing learned: It seems, by doing bad things in your app code (too many tasks, SQL connections ...) you can kill the WebApi infrastructure, so that it doesn't handle requests - at any level - anymore.
I have created a windows service using c#.NET, The service will updated oracle tables whenever it receives new files. I have kept timer control and the time limit as 30 seconds. I am using ODP.NET as data access layer.
The very first time I will get error, but subsequently the service will work fine. If service is Idle for a long time if it receives a file, I will get "connection lost error", but after if we receives file it will loaded successfully.
Kindly suggest me do I need add any properties in connection string to fix this error?
Hello Karthik Two issues here it seems.
You are best to open and close a new connection each time your service is called.
Windows services quickly go to a latent state if not called and they will respond slower on the next call. If the caller does not have a sufficient timeout value to accomodate this lag then it will return a time out error. If you address these two points you should be fine.
I have a Lync 2013-based application which:
connects to a UserEndpoint (hereinafter CallCenter)
redirects calls made to CallCenter according to bla bla bla business logic.
At times, a user will see CallCenter in their standard Lync 2013 Client as Online, but if that user attempts to start an IM call with CallCenter, the user receives the message "We couldn't send this message because CallCenter is unavailable or offline."
I haven't been able to identify the process that leads up to this, but if it's happened to one user, then all of the other users experience the same problem when attempting to call CallCenter. The only way I have been able to recover CallCenter has been to restart my application. Regular interaction with CallCenter then resumes without a problem.
If CallCenter is indeed "unavailable or offline", then why does it's Presence appear as "Online"? Is there a need to renew / keep CallCenter's connection alive every so often?
For reference, I connect CallCenter like so:
UserEndpointSettings settings = new UserEndpointSettings(userURI, _ProxyHost, _ProxyPort);
settings.AutomaticPresencePublicationEnabled = true;
settings.Presence.UserPresenceState = PresenceState.UserAvailable;
_userEndpoint = new UserEndpoint(_Platform.CollabPlatform, settings);
_userEndpoint.BeginEstablish(res =>
{
try
{
_userEndpoint.EndEstablish(res);
_userEndpoint.StateChanged += new EventHandler<LocalEndpointStateChangedEventArgs>(_userEndpoint_StateChanged);
}
catch (Exception ex)
{
LogError(ex, ErrorReference.EndpointEstablishFailed);
}
}, null);
In the client, when you go offline or experience an error, your presence reflects that (most of the time, that is). This can lead you to believe that the status portion of presence [1] is somehow tied to actual availability.
When you're working with UCMA, you are given ultimate control over everything related to your endpoint. As you've seen, you can make your UCMA application do things that would otherwise be impossible in the regular client. You don't have to publish any presence status (leaving you "offline" to your users), yet the service can still send/receive IMs. And, as you've seen, your service can be "Available" and yet ... have no capability to do anything but publish its status [2].
If you fail to wire up the appropriate modality (in your case IM), or your application encounters an exception which results in a particular modality no longer working (I suspect this may be your actual problem), the status of your service will still be available.
Begin/EndTerminate on the UserEndpoint should publish Offline for you automatically and publishing a presence other than Available is the only way to guarantee the presence won't be "Available" for the lifetime of your application (and even after the application ends/dies prematurely, though this is sometimes rectified by the server -- sometimes).
Here's how I'd attack resolving this issue. Ignore the presence problem and ignore the error. They're red herrings. Many problems result in the "unavailable or offline" message that have nothing to do with the service actually being stopped.
Instead, figure out why your calls aren't connecting.
If the call takes a while before you receive the error, check for deadlocks or circumstances where the Thread Pool has no room for another thread. Troubleshooting involves reviewing your code for race conditions and the myriad of other things that multi-threaded applications throw your way. If the IMCall fails instantly, check around the parts that handle incoming calls. In the latter case, your subscription may be gone (too many causes to list here, most of which are .Net related, not UCMA related), or your service may be dead.
If the importance of presence to your application is only to show it as "available" or "offline" when it is actually able to send/receive an IM, you're going to want to ensure your application terminates the endpoint properly during tear-down (including in the case of a critical failure: catch-terminate-rethrow or whatever is appropriate in your case).
[1] Be careful when thinking about the term "presence" as it relates to Lync. Presence contains availability status, modality specific states, capabilities (IM/Voice, etc), the "note" and contact information.
[2] This seems like a bizarre thing to do, however, it gave me the ability to use an ApplicationEndpoint to report on the availability of a web service (unrelated to Lync) that I wanted to be able to view in the Mobile client without connecting via VPN. When doing something like this, it's really important to publish the capabilities of your endpoint -- this will explicitly signal to your connected clients what your service can and cannot do.
[Final Footnote] There are a few ways to publish presence. The mechanism you're using to publish is the simplest and most logical to use if you're just interested in telling your users that the "service is here"/"service is not here" which is documented rather well here: Simplified Presence Publication for Endpoints