Calling a WCF published orchestration from a C# program usually is sub-second response time. However, on some occasions, it can take 20-50- seconds between the call in the C# program and the first trace message from the orchestration. The C# that runs calls the WCF runs under HIS/HIP (Host Integration Services/CICS Host-Initiated Processing).
Almost everytime I restart the HIS/HIP service, we have a very slow response time, and thus a timeout in CICS. I'm also afraid it might happen during the day if things "go cold" - in other words maybe things are being cached. Even JIT first-times compiles shouldn't take 20-50 seconds should they? The other thing that seem strange is that the slow response time seems to be the load of the orchestration, which is running under the BizTalk service, not the HIP/Service which I cycled.
The fear is that when we go live, the first user in the morning (or after a "cold-spell" will get the timeout). The second time they try it after the time-out, it is always fast.
I've done a few tests by restarting each of the following:
1) BizTalk services
2) IIS
3) HIS/HIP Transaction Integrator (HIP Service)
Restarting any one of them tends to cause about a 20 second delay.
Restarting all 3 is like the kiss of death - about a 60 second delay before first trace appears from orchestration.
The HIP program always gives its first trace quickly, even when the HIP service is restarted. Not sure why restarting HIP slows down the starting of the orchestration.
Thanks,
Neal Walters
I have seen this kind of behavior with the MQSeries adapter as well. After a period of inactivity the COM+ components which enable communication with MQSeries will shut down due to inactivity.
What we had was a 10 minute timer which would force some sort of a keep-alive message. I don't know if you have a non-destructive call which can be sent, or if you can build one into the system just for this purpose.
I have the same problem with a BizTalk flow that needs to work in 2 seconds, but when it was unused for some time the reload of the dll into cache generated a timeout.
We found a solution in MS's Orchestration Engine Configuration documentation, where they explain how to avoid unloading of the dlls:
Using the options SecondsIdleBeforeShutdown and SecondsEmptyBeforeShutdown from AppDomainSpecs and assigning to the desired dlls in the ExactAssignmentRules or PatternAssignmentRules sections, you can have your dlls permanently loaded, and maybe you can avoid the caller application.
Take into account that if you restart the BizTalk host, the dll will be loaded again.
Related
I've got an API written in C# (webforms) and an SQL Server 2008 database that accepts JSON POST data on an AWS EC2 VM. My problem is that the "first" use of this API is rather slow to respond.
What I mean by "first" is that if I were to wait for an hour or so, then post some data, that would be the first. Subsequent posts would process rather quickly in comparison, and I would need to wait another hour or so before experiencing the slow "first" transaction again.
Since only the initial post is slow, it makes me wonder if something is "spinning down" after being idle for some time, and then spinning up again upon first use, adding the extra time.
Things I have tried -
Run program through a performance profiler - This didn't really help. As far as I can see, the program itself doesn't have any obvious parts that run very slowly or inefficiently.
Change configuration to persist at least 1 connection to the database at all times. Again, no real change. I did this by adding "Min Pool Size=1;Max Pool Size=100" to my connection string.
Change configuration to use named pipes instead of TCP. Once again, no real change. I did this by adding "np:" before the server specified in my connection string, eg. server=np:MyServer;database=MyDatabase;
Is there anything else I can do to diagnose the problem? What else should I be looking for in this scenario?
Chances are your app pool is shutting down after a designated period of non-use. The first call after the shutdown forces everything to get loaded back into memory which explains the lag.
You could play with these settings: http://technet.microsoft.com/en-us/library/cc771956%28v=ws.10%29.aspx to see if you get the desired effect, or setup a task scheduler job that makes at least one call every 10+/- minutes of so by doing a simulated post - a simple powershell script could handle that for you and will keep everything 'primed' for the next use.
[I'm not sure whether to post this in stackoverflow or serverfault, but since this is a C# development project, I'll stick with stackoverflow...]
We've got a multi-tiered application that is exhibiting poor performance at unpredictable times of the day, and we're trying to track down the cause(s). It's particularly difficult to fix because we can't reproduce it on our development environment - it's a sporadic problem on our production servers only.
The architecture is as follows: Load balanced front end web servers (IIS) running an MVC application (C#). A home-grown service bus, implemented with MSMQ running in domain-integration mode. Five 'worker pool' servers, running our Windows Service, which responds to requests placed on the bus. Back end SQL Server 2012 database, mirrored and replicated.
All servers have high spec hardware, running Windows Server 2012, latest releases, latest windows update. Everything bang up to date.
When a user hits an action in the MVC app, the controller itself is very thin. Pretty much all it does is put a request message on the bus (sends an MSMQ message) and awaits the reply.
One of the servers in the worker pool picks up the message, works out what to do and then performs queries on the SQL Server back end and does other grunt work. The result is then placed back on the bus for the MVC app to pick back up using the Correlation ID.
It's a nice architecture to work with in respect to the simplicity of each individual component. As demand increases, we can simply add more servers to the worker pool and all is normally well. It also allows us to hot-swap code in the middle tier. Most of the time, the solution performs extremely well.
However, as stated we do have these moments where performance is a problem. It's proving difficult to track down at which point(s) in the architecture the bottleneck is.
What we have attempted to do is send a request down the bus and roundtrip it back to the MVC app with a whole suite of timings and metrics embedded in the message. At each stop on the route, a timestamp and other metrics are added to the message. Then when the MVC app receives the reply, we can screen dump the timestamps and metrics and try to determine which part of the process is causing the issue.
However, we soon realised that we cannot rely on the Windows time as an accurate measure, due to the fact that many of our processes are down to the 5-100ms level and a message can go through 5 servers (and back again). We cannot synchronize the time across the servers to that resolution. MS article: http://support.microsoft.com/kb/939322/en-us
To compound the problem, each time we send a request, we can't predict which particular worker pool server will handle the message.
What is the best way to get an accurate, coordinated and synchronized time that is accurate to the 5ms level? If we have to call out to an external (web)service at each step, this would add extra time to the process, and how can we guarantee that each call takes the same amount of time on each server? Even a small amount of latency in an external call on one server would skew the results and give us a false positive.
Hope I have explained our predicament and look forward to your help.
Update
I've just found this: http://www.pool.ntp.org/en/use.html, which might be promising. Perhaps a scheduled job every x hours to keep the time synchronised could get me to the sub 5 ms resolution I need. Comments or experience?
Update 2
FWIW, We've found the cause of the performance issue. It occurs when the software tests if a queue has been created before it opens it. So it was essentially looking up the queue twice, which is fairly expensive. So the issue has gone away.
What you should try is using the Performance Monitor that's part of Windows itself. What you can do is create a Data Collector Set on each of the servers and select the metrics you want to monitor. Something like Request Execution Time would be a good one to monitor for.
Here's a tutorial for Data Collector Sets: https://www.youtube.com/watch?v=591kfPROYbs
Hopefully this will give you a start on troubleshooting the problem.
I have a nice fast task scheduling component (windows service as it happens but this is irrelevant), it subscribes to an in memory queue of things to do.
The queue is populated really fast ... and when I say fast I mean fast ... so fast that I'm experiencing problems with some particular part.
Each item in the queue gets a "category" attached to it and then is passed to a WCf endpoint to be processed then saved in a remote db.
This is presenting a bit of a problem.
The "queue" can be processed in the millions of items per minute whereas the WCF endpoint will only realistically handle about 1000 to 1200 items per second and many of those are "stacked" in order to wait for a slot to dump them to the db.
My WCF client is configured so that the call is fire and forget (deliberate) my problem is that when the call is made occasionally a timeout occurs and thats when the headaches begin.
The thread just seems to stop after timeout no dropping in to my catch block nothing ... just sits there, whats even more confusing is that this is an intermittent thing, this only happens when the queue is dealing with extreme loads and the WCF endpoint is over taxed, and even in that scenario it's only about once a fortnight this happens.
This code is constantly running on the server, round the clock 24/7.
So ... my question ...
How can I identify the edge case that is causing my problem so that I can resolve it?
Some extra info:
The client calling the WCF endpoint seems to automatically "throttle itself" by the fact that i'm limiting the number of threads making calls, and the code hangs about until a call is considered complete (i'm thinking this is a http level thing as im not asking the service for a result of my method call).
The db is talked to with EF which seems to never open more than a fixed number of connections to the db (quite a low number too which is cool) and the WCF endpoint from the call reception back seems super reliable.
The problem seems to be coming off the queue processor to the WCf endpoint.
The queue processor has a single instance of my WCF endpoint client which it reuses for all calls ... (is it good practice to rebuild this endpoint per call? - bear in mind number of calls here).
Final note:
It's a peculiar "module" of functionality, under heavy load for hours at a time it's stable, but for some reason this odd thing happens resulting in the whole lot just stopping and not recovering. The call is wrapped in a try catch, but seemingly even if the catch is hit (which isn't guaranteed) the code doesn't recover / drop out as expected ... it just hangs.
Any ideas?
Please let me know if there's anything else I can add to help resolve this.
Edit 1:
binding - basicHttpBinding
error handling - no code written other than wrapping the WCF call in a try catch.
Seemingly my solution appears to be to increase the timeout settings on the client config to allow the server more time to respond.
The net result being that whilst the database is busy saving data (effectively the slowest part of this process) the calling client sits and waits (on all threads but seemingly not as long as i would have liked).
This issue seems to be the net result of a lot of multithreaded calls to the WCF and not giving it enough time to respond.
The high load is not conintuous, the service usage seems to spike then tail off, adding to the expected response time allows spikes to be filtered through as they happen.
A key note:
Way too many calls will result in the server / service treating them as a dos type attack and as such may simply terminate the connection.
This isn't what I'm getting, but some fine tuning and time may result in this ...
Time for some bigger servers !!!
I hate asking questions like this - they're so undefined... and undefinable, but here goes.
Background:
I've got a DLL that is the guts of an application that is a timed process. My timer receives a configuration for the interval at which it runs and a delegate that should be run when the interval elapses. I've got another DLL that contains the process that I inject.
I created two applications, one Windows Service and one Console Application. Each of the applications read their own configuration file and load the same libraries pushing the configured timer interval and delegate into my timed process class.
Problem:
Yesterday and for the last n weeks, everything was working fine in our production environment using the Windows Service. Today, the Windows Service will run for a period of around 20-30 minutes and hangs (with a timer interval of 30 secods), but the console application runs without issue and has for the past 4 hours. Detailed logging doesn't indicate any failure. It's as if the Windows Service just...dies quietly - without stopping.
Given that my Windows Service and Console Applications are doing the exact same thing, I can only think that there is something that is causing the Windows Service process to hang - but I have no idea what could be causing that. I've checked the configuration files, and they're both identical - I even copied and pasted the contents of one into the other just to be sure. No dice.
Can anyone make suggestions as to what might cause a Windows Service to hang, when a counterpart Console Application using the same base libraries doesn't; or can anyone point me in the direction of tools that would allow me to diagnose what could be causing this issue?
Thanks for everyone's help - still digging.
You need to figure out what changed on the production server. At first, the IT guys responsible will swear that nothing changed but you have to be persistent. i've seen this happen to often i've lost count. Software doesn't spoil. Period. The change must have been to the environment.
Difference in execution: You have two apps running the same code. The most likely difference (and culprit) is that the service is running with a different set of security credentials than your console app and might fall victim to security vagaries. Check on that first. Which Windows account is running the service? What is its role and scope? Is there any 3rd party security software running on the server and perhaps Killing errant apps? Do you have to register your service with a 3rd party security service? Is your .Net assembly properly signed? Are your .Net assemblies properly registered and configured on the server? Last but not least, don't forget that a debugger user, which you most likely are, gets away with a lot more stuff than many other account types.
Another thought: Since timing seems to be part of the issues, check the scheduled tasks on the machine. Perhaps there's a process that is set to go off every 30 minutes that is interfering with your own.
You can debug a Windows service by running it interactively within Visual Studio. This may help you to isolate the problem by setting (perhaps conditional) breakpoints.
Alternatively, you can use the Visual Studio "Attach to process" dialog window to find the service process and attach to it with the "Debug CLR" option enabled. Again this allows you to set breakpoints as needed.
Are you using any assertions? If an assertion fires without being re-directed to write to a log file, your service will hang. If the code throws an unhandled exception, perhaps because of a memory leak, then your service process will crash. If you set the Service Control Manager (SCM) to restart your process in the event of a crash, you should be able to see that the service has been restarted. As you have identical code running in both environments, these two situations don't seem likely. But remember that your service is being hosted by the SCM, which means a very different environment to the one in which your console app is running.
I often use a "heartbeat", where each active thread in the service sends a regular (say every 30 seconds) message to a local MSMQ. This enables manual or automated monitoring, and should give you some clues when these heartbeat messages stop arriving.
Annother possibility is some sort of permissions problem, because the service is probably running with a different local/domain user to the console.
After the hang, can you use the SCM to stop the service? If you can't, then there is probably some sort of thread deadlock problem. After the service appears to hang, you can go to a command-line and type sc queryex servicename. This should give you the current STATE of the service.
I would probably put in some file logging just to see how far the program is getting. It may give you a better idea of what is looping/hanging/deadlocked/crashing.
You can try these techniques
Logging start logging the flow of the code in the service. Have this parameter based so you dont have a deluge after you are done. You should log all function names, parameters, timestamps.
Attach Debugger Locally or Remotely attach a debugger with the code to the running service, set appropriate breakpoints (can be based on the data gathered from logging)
PerfMon Run this utility and gather information about the machine that the service is running on for any additional clues (high CPU spikes, IO spikes, excessive paging, etc)
Microsoft provides a good resource on debugging a Windows Service. That essentially sounds like what you'd have to do given that your question is so generic. With that said, has any changes been made to the system over the last few days that could aversely affect the service? Have you made any updates to the code that change the way the service might possibly work?
Again, I think you're going to have to do some serious debugging to find your problem.
What type of timer are you using in the windows service? I've seen numberous people on SO have problems with timers and windows services. Here is a good tutorial just to make sure you are setting it up correctly and using the right type of timer. Hope that helps.
Another potential problem in reference to psasik's answer is if your application is relying on something only available when being run in User Mode.
Running in service mode runs in (is it desktop0?) which can cause some issues in my experience if you are trying to determine states of something that can only be seen in user mode.
Smells like a threading issue to me. Is there any threading or async work being done at all? One crucial question is "does the service hang on the same line of code or same method every time?" Use your logging to find out the last thing that happens before a hang, and if so, post the problem code.
One other tool you may consider is a good profiler. If it is .NET code, I believe RedGate ANTS can monitor it and give you a good picture of any threadlock scenarios.
We have very strange problem, one of our applications is continually querying server by using .net remoting, and every 100 seconds the application stops querying for a short duration and then resumes the operation. The problem is on a client and not on the server because applications actually queries several servers in the same time and stops receiving data from all of them in the same time.
100 Seconds is a give away number as it's the default timeout for a webrequest in .Net.
I've seen in the past that the PSI (Project Server Interface within Microsoft Project) didn't override the timeout and so the default of 100 seconds was applied and would terminate anything talking to it for longer than that time.
Do you have access to all of the code and are you sure you have set timeouts where applicable so that any defaults are not being applied unbeknownst to you?
I've never seen that behavior before and unfortunately it's a vague enough scenario I think you're going to have a hard time finding someone on this board who's encountered the problem. It's likely specific to your application.
I think there are a few investigations you can do to help you narrow down the problem.
Determine whether it's the client or server that is actually stalling. If you have problems determining this, try installing a packet filter and monitor the traffic to see who sent the last data. You likely won't be able to read the binary data but at least you will get a sense of who is lagging behind.
Once you figure out whether it's the client or server causing the lag, attempt to debug into the application and get a breakpoint where the hang occurs. This should give you enough details to help track down the problem. Or at least ask a more defined question on SO.
How is the application coded to implement the continuous querying? Is it in a continuous loop? or a loop with a Thread.Sleep? or is it on a timer ?,
It would first be useful to determine if your system is executing this "trigger" in your code when you expect it to, or if it is, and the remoting server is not responding... so, ...
if you cannot reproduce this issue in a development environment where you can debug it, then, if you can, I suggest you add code to this Loop to write out to a log file (or some other persistence mechanism) each time it "should" be examining whatever conditions it uses to decide whether to query the remoting server or not, and then review those logs when the problem reoccurs...
If you can do the same in your remoting server, to record when the server receives a remoting request, this would help as well...
... and oh yes, just a thought, (I don;t know how you have coded this... ) but if you are using a separate thread in client to issue the remoting request, and the channel is being registered, and unregistered on that separate thread, make sure you are deconflicting the requests, cause you can't register the same port twice on the same machine at the same time...
(although this should probably have raised an exception in your client if this was the issue)