.net remoting stops every 100 seconds - c#

We have very strange problem, one of our applications is continually querying server by using .net remoting, and every 100 seconds the application stops querying for a short duration and then resumes the operation. The problem is on a client and not on the server because applications actually queries several servers in the same time and stops receiving data from all of them in the same time.

100 Seconds is a give away number as it's the default timeout for a webrequest in .Net.
I've seen in the past that the PSI (Project Server Interface within Microsoft Project) didn't override the timeout and so the default of 100 seconds was applied and would terminate anything talking to it for longer than that time.
Do you have access to all of the code and are you sure you have set timeouts where applicable so that any defaults are not being applied unbeknownst to you?

I've never seen that behavior before and unfortunately it's a vague enough scenario I think you're going to have a hard time finding someone on this board who's encountered the problem. It's likely specific to your application.
I think there are a few investigations you can do to help you narrow down the problem.
Determine whether it's the client or server that is actually stalling. If you have problems determining this, try installing a packet filter and monitor the traffic to see who sent the last data. You likely won't be able to read the binary data but at least you will get a sense of who is lagging behind.
Once you figure out whether it's the client or server causing the lag, attempt to debug into the application and get a breakpoint where the hang occurs. This should give you enough details to help track down the problem. Or at least ask a more defined question on SO.

How is the application coded to implement the continuous querying? Is it in a continuous loop? or a loop with a Thread.Sleep? or is it on a timer ?,
It would first be useful to determine if your system is executing this "trigger" in your code when you expect it to, or if it is, and the remoting server is not responding... so, ...
if you cannot reproduce this issue in a development environment where you can debug it, then, if you can, I suggest you add code to this Loop to write out to a log file (or some other persistence mechanism) each time it "should" be examining whatever conditions it uses to decide whether to query the remoting server or not, and then review those logs when the problem reoccurs...
If you can do the same in your remoting server, to record when the server receives a remoting request, this would help as well...
... and oh yes, just a thought, (I don;t know how you have coded this... ) but if you are using a separate thread in client to issue the remoting request, and the channel is being registered, and unregistered on that separate thread, make sure you are deconflicting the requests, cause you can't register the same port twice on the same machine at the same time...
(although this should probably have raised an exception in your client if this was the issue)

Related

Http connections slow down or deadlock with .NET HttpClient

We have an asp.net webapi application that needs to issue a lot of calls to other web applications (it's basically a reverse proxy). To do this we use the async methods of the HttpClient.
Yes, we have seen the hints about using only one HttpClient instance and not to dispose of it.
Yes, we have seen the hints about setting configuration values, especially the problem with the lease timeout. Currently we set ConnectionLimit = CPU*12, ConnectionLeaseTimeout = 5min and MaxIdleTime = 30s.
We can see that the connections behave as desired. The throughput in a load test was also very good. However we are facing issues where occasionally the connections stop working. It seems to happen when a lot of requests are coming in (and, being a reverse proxy, cause new requests to be issued) and it happens mostly (but not only) with the slowest of all backend applications. The behaviour is then that it takes forever to finish the requests to this endpoint or they simply end in a timeout.
An IISReset of the server hosting our reverse proxy application terminates the problems (for a while).
We have investigated in several areas already:
Performance issues of the remote web application: Although it behaves exactly as this would be the case the performance is good when the same requests are issued locally on the remote server. Also the values for CPU / network etc. are low.
Network issues (bandwidth, router, firewall, load balancers): Possible but rather unlikely since everything else runs stable and our hoster is involved in the analysis too.
Threadpool starvation: Not impossible but rather theoretical - sure we have a lot of async calls but shouldn't that help regarding this issue?
HttpCompletionOption.ResponseHeadersRead: Not a problem by itself but maybe one piece of the puzzle?
The best explanation so far focuses on the ConnectionLimit: We started setting the values mentioned above only recently and this seems to have triggered the problems. But why would it? Shouldn't it be an improvement to reuse the connections instead of opening a new one for every request? And the values we set seem to be rather conservative?
We have started to experiment with these values lately to see their impact in production. Yet it is still unclear to us if this is the only cause. And we'd appreciate a more straighforward approach for analysis. Unfortunately a memory dump and netstat printouts did not help any further.
Some suggestions about how to analyze or hints about possible causes would be highly appreciated.
***** EDIT *****
Setting the connection limit to 1000 is solving the issue! So the question remains as to why is that the case? From what we know the default connection limit is 2 in a non-web and 1000 in a web application. MS is suggesting a default value of CPU*12 (but they didn't implement it like that?!) so our change was basically to go from 1000 to 48. Still we can see that only a handful connections are open. Is there anyone who can shed some light on this? What is the exact behaviour about opening new connections, reusing existing ones, pipelining etc.? Is there any source of information for this?
ConnectionLimit means ServicePointManager.DefaultConnectionLimit? Yes it matters. When the value is X, if there are already X requests waiting response, new request will not be sent until any previous request is finished.
I posted a follow up question here: How to disable pipelining for the .NET HttpClient
Unfortunately there were no real answers to any of my questions. We ended up leaving the ConnectionLimit at 1000 (which is a workaround only but the only solution we were able to find).

Slow initial connection to API

I've got an API written in C# (webforms) and an SQL Server 2008 database that accepts JSON POST data on an AWS EC2 VM. My problem is that the "first" use of this API is rather slow to respond.
What I mean by "first" is that if I were to wait for an hour or so, then post some data, that would be the first. Subsequent posts would process rather quickly in comparison, and I would need to wait another hour or so before experiencing the slow "first" transaction again.
Since only the initial post is slow, it makes me wonder if something is "spinning down" after being idle for some time, and then spinning up again upon first use, adding the extra time.
Things I have tried -
Run program through a performance profiler - This didn't really help. As far as I can see, the program itself doesn't have any obvious parts that run very slowly or inefficiently.
Change configuration to persist at least 1 connection to the database at all times. Again, no real change. I did this by adding "Min Pool Size=1;Max Pool Size=100" to my connection string.
Change configuration to use named pipes instead of TCP. Once again, no real change. I did this by adding "np:" before the server specified in my connection string, eg. server=np:MyServer;database=MyDatabase;
Is there anything else I can do to diagnose the problem? What else should I be looking for in this scenario?
Chances are your app pool is shutting down after a designated period of non-use. The first call after the shutdown forces everything to get loaded back into memory which explains the lag.
You could play with these settings: http://technet.microsoft.com/en-us/library/cc771956%28v=ws.10%29.aspx to see if you get the desired effect, or setup a task scheduler job that makes at least one call every 10+/- minutes of so by doing a simulated post - a simple powershell script could handle that for you and will keep everything 'primed' for the next use.

Multi-server n-tier synchronized timing and performance metrics?

[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.

How to prevent NHibernate long-running process from locking up web site?

I have an NHibernate MVC application that is using ReadCommitted Isolation.
On the site, there is a certain process that the user could initiate, and depending on the input, may take several minutes. This is because the session is per request and is open that entire time.
But while that runs, no other user can access the site (they can try, but their request won't go through unless the long-running thing is finished)
What's more, I also have a need to have a console app that also performs this long running function while connecting to the same database. It is causing the same issue.
I'm not sure what part of my setup is wrong, any feedback would be appreciated.
NHibernate is set up with fluent configuration and StructureMap.
Isolation level is set as ReadCommitted.
The session factory lifecycle is HybridLifeCycle (which on the web should be Session per request, but on the win console app would be ThreadLocal)
It sounds like your requests are waiting on database locks. Your options are really:
Break the long running process into a series of smaller transactions.
Use ReadUncommitted isolation level most of the time (this is appropriate in a lot of use cases).
Judicious use of Snapshot isolation level (Assuming you're using MS-SQL 2005 or later).
(N.B. I'm assuming the long-running function does a lot of reads/writes and the requests being blocked are primarily doing reads.)
As has been suggested, breaking your process down into multiple smaller transactions will probably be the solution.
I would suggest looking at something like Rhino Service Bus or NServiceBus (my preference is Rhino Service Bus - I find it much simpler to work with personally). What that allows you to do is separate the functionality down into small chunks, but maintain the transactional nature. Essentially with a service bus, you send a message to initiate a piece of work, the piece of work will be enlisted in a distributed transaction along with receiving the message, so if something goes wrong, the message will not just disappear, leaving your system in a potentially inconsistent state.
Depending on what you need to do, you could send an initial message to start the processing, and then after each step, send a new message to initiate the next step. This can really help to break down the transactions into much smaller pieces of work (and simplify the code). The two service buses I mentioned (there is also Mass Transit), also have things like retries built in, and error handling, so that if something goes wrong, the message ends up in an error queue and you can investigate what went wrong, hopefully fix it, and reprocess the message, thus ensuring your system remains consistent.
Of course whether this is necessary depends on the requirements of your system :)
Another, but more complex solution would be:
You build a background robot application which runs on one of the machines
this background worker robot can be receive "worker jobs" (the one initiated by the user)
then, the robot processes the jobs step & step in the background
Pitfalls are:
- you have to programm this robot very stable
- you need to watch the robot somehow
Sure, this is involves more work - on the flip side you will have the option to integrate more job-types, enabling your system to process different things in the background.
I think the design of your application /SQL statements has a problem , unless you are facebook I dont think any process it should take all this time , it is better to review your design and check where is the bottleneck are, instead of trying to make this long running process continue .
also some times ORM is not good for every scenario , did you try to use SP ?

How do I handle WCF Call lifecycles under load when timeouts are expected?

I have a nice fast task scheduling component (windows service as it happens but this is irrelevant), it subscribes to an in memory queue of things to do.
The queue is populated really fast ... and when I say fast I mean fast ... so fast that I'm experiencing problems with some particular part.
Each item in the queue gets a "category" attached to it and then is passed to a WCf endpoint to be processed then saved in a remote db.
This is presenting a bit of a problem.
The "queue" can be processed in the millions of items per minute whereas the WCF endpoint will only realistically handle about 1000 to 1200 items per second and many of those are "stacked" in order to wait for a slot to dump them to the db.
My WCF client is configured so that the call is fire and forget (deliberate) my problem is that when the call is made occasionally a timeout occurs and thats when the headaches begin.
The thread just seems to stop after timeout no dropping in to my catch block nothing ... just sits there, whats even more confusing is that this is an intermittent thing, this only happens when the queue is dealing with extreme loads and the WCF endpoint is over taxed, and even in that scenario it's only about once a fortnight this happens.
This code is constantly running on the server, round the clock 24/7.
So ... my question ...
How can I identify the edge case that is causing my problem so that I can resolve it?
Some extra info:
The client calling the WCF endpoint seems to automatically "throttle itself" by the fact that i'm limiting the number of threads making calls, and the code hangs about until a call is considered complete (i'm thinking this is a http level thing as im not asking the service for a result of my method call).
The db is talked to with EF which seems to never open more than a fixed number of connections to the db (quite a low number too which is cool) and the WCF endpoint from the call reception back seems super reliable.
The problem seems to be coming off the queue processor to the WCf endpoint.
The queue processor has a single instance of my WCF endpoint client which it reuses for all calls ... (is it good practice to rebuild this endpoint per call? - bear in mind number of calls here).
Final note:
It's a peculiar "module" of functionality, under heavy load for hours at a time it's stable, but for some reason this odd thing happens resulting in the whole lot just stopping and not recovering. The call is wrapped in a try catch, but seemingly even if the catch is hit (which isn't guaranteed) the code doesn't recover / drop out as expected ... it just hangs.
Any ideas?
Please let me know if there's anything else I can add to help resolve this.
Edit 1:
binding - basicHttpBinding
error handling - no code written other than wrapping the WCF call in a try catch.
Seemingly my solution appears to be to increase the timeout settings on the client config to allow the server more time to respond.
The net result being that whilst the database is busy saving data (effectively the slowest part of this process) the calling client sits and waits (on all threads but seemingly not as long as i would have liked).
This issue seems to be the net result of a lot of multithreaded calls to the WCF and not giving it enough time to respond.
The high load is not conintuous, the service usage seems to spike then tail off, adding to the expected response time allows spikes to be filtered through as they happen.
A key note:
Way too many calls will result in the server / service treating them as a dos type attack and as such may simply terminate the connection.
This isn't what I'm getting, but some fine tuning and time may result in this ...
Time for some bigger servers !!!

Categories

Resources