Azure Portal: How to See Callstacks - c#

Apologies, this is not a short question:
Background
I have a B1 Azure Website, and for the life of me, cannot get exceptions with callstacks.
The WebAPI is hosted side-by-side with the website in the same solution, which I hear is unusual. Almost all configuration has been done through the solution, I believe. Most everything in the portal is probably default settings from a brand new site.
I will be the first to admit, I am a novice at Azure. I have previously hosted some exceedingly simple ASP websites (mostly pre-.NET) in the past. I have found the Azure Portal to be overwhelming, to say the least. Hence why I am here!
The main place I look for exceptions is in Application Insights, under Failures, Exceptions tab, however. While it usually (not always...) show that there were 500s, the vast majority of the time, it will show no callstack.
Situations
The few times it does catch a callstack, it's your normal bots poking at random directories... not the crippling exception I need to debug immediately. I recall hearing that Azure will use "AI to determine which callstacks to keep" or something market-y like that, but I can't find any settings regarding it. Even if that market-speak is true, why is it recording callstacks to daily bot attempts, but the rare application-crippling exception?
A month or so ago, I attempted to debug the live website via Visual Studio, but I get an error saying that Internet Explorer could not be found. Given that it's the year 2018 and Microsoft has moved onto Edge, I don't know why it wants Internet Explorer at all. I did find a response to this, saying to hack the registry and reinstall Internet Explorer, but that seemed overkill at the time.
Viewing Azure errors through Visual Studio's embedded Azure portal thing seems to show very similar data as the Azure portal does. No callstacks to be found.
Many years ago, a classic alert was set up for Http Server Errors, which still triggers to this day. It does not trigger on HttpExceptions from bots poking at the site, but it does for for important 500s, and that's good. What is interesting is that it is the most reliable way to hear about errors, besides user reports. Too bad they don't have callstacks...
Last night, we encountered an exception, presumably in the view, of a page. We got e-mails from the classic alert, as expected, but the Failures section does not show any failures at all. In the past, we'd see 500s, but no callstack. It would seem that last night's errors were not detected by anything but the classic alert and the user. I don't know if it is because last night's error was unique, or if we now mysteriously get even less information out of Azure.
Attempted Solutions
Over the years, I have followed a myriad of guides, ranging from flipping switches in the portal itself, to FTPing and looking at the raw logs (which apparently are not really about your application, as much as Microsoft hosting it). If I got a penny for every time I read a guide that said, "Simply click on the Exceptions tab to see your callstacks" I'd be rich :-P.
A month ago, I got so desperate I implemented Application_Error in the HttpApplication class for the application, and implemented ExceptionLogger for WebAPI, to manually log all exceptions to text files. Unfortunately, while this helped me fix one error, subsequent exceptions have not appeared there either. Just like Application Insights, mostly bots poking at non-existent directories show up in these logs.
A week ago, I got desperate enough that I wrote a janky "unit test" (ha!), that'd pull a copy of production data down and test it locally, which is absolutely bonkers.
I have spoken to other architect-level ASP.NET engineers that use Azure portal to varying frequencies, and they could not come up with any suggestions. We looked at the web.configs; there is one in the root and in the Views folder. We played with turning on customerrors, but obviously we can't have that running in production because it'd display the errors to the user. That being said, I wouldn't mind having real error messages appear to certain users. How would one accomplish that? If I were to guess, the issue is hidden in those web.configs, simply because they're ancient and so many hands have touched them.
Conclusion
I need a 100% bullet-proof way to get exceptions and their callstacks from ASP.NET hosted on Azure. Otherwise, it's nearly impossible to solve edge cases that appear unexpectedly in production. I don't recall this being a problem in my days before Azure.
I am certain an expert out there will have this solved in mere minutes, but, for now, I'm completely stumped. Thank you for your time!

A couple of things to try and check for:
Make sure that your Application Insights NuGet packages are up to date. I've had metrics quit working over the last couple of years, or new metrics show up on the AppInsights blade that I wasn't collecting. Upgrading to the latest NuGet packages did the trick.
Are you catching exceptions within your web app and then returning a HTTP 500 response explicitly? If so, you won't see a stack trace. Stack traces are captured after bubbling all the way up through your controller method unhandled.

Related

Azure Table/Blob/Queue random Timeout on linux system (k8s .net core 3 app)

This is my scenario:
Microsoft.Azure.Storage.Blob 11.2.0
Microsoft.Azure.Storage.Queue 11.2.0
Micorosoft.Azure.Cosmos.Table 1.0.7
I've moved a lot of my code from Azure function to Google k8s and Google Cloud, running the Core .Net app, basically with the same library built in .net Standard 2.0 without any problems.
After a few days, I notice a different behavior in the Linux system.
Few calls interacting with Azure service (blob, table, queue) get timeouts (subsystem appears to fail, i tried different retry-police with same result).
In 10,000 calls I get 10 to 50 errors (or very long calls 180 seconds, before I changed the timeouts). This happens in all Azure services: table, blob and queue.
I tried different solutions to find out why:
I instantiate the client (blobClient, TableClient..etc) every call, or recycle the same client but without difference
I change all timeouts to handle this behavior. I work on ServerTimeout and MaximumExecutionTime and put a layer on top, with my retry mechanism, so I can minimize errors. Now I have "only" a few calls of 20 seconds (instead of 2/3 sec for example).
I tried all solutions with similar problems found on Stackoverflow :D ... but nothing works (for now)
Same dll code run on azure function without any problems.
So i came to the conclusion, there is something in the http client, used internally by the azure sdk, that depends on the operating system you are running your code on.
I think after a few articles it may be the Keep-Alive header, so I try on my composition root:
ServicePointManager.SetTcpKeepAlive (true, 120000, 10000);
but nothing changes.
Any ideas or suggestions? ... maybe I'm on the wrong path, or i've missed something.
UPDATE
After reading the last article linked by #KrishnenduGhosh-MSFT in the last comment i tried to change this setting:
ServicePointManager.DefaultConnectionLimit = 100;
This was the turning point.
Since it used to happen randomly, I'm still not 100% sure if the problem is solved.
But after 50k calls, I'm pretty optimistic. Obviously in production will have another behavior, but I already expect it :)
UPDATE 2 - AFTER PUBLISH IN PROD
In the end, it doesn't work :(
I had written in the comments, but it seems fair to update here (more readable).
I still have long calls (abbreviated with MaximumExecutionTime), but I don't see the light at the end of the tunnel.
Now I'm thinking about moving some Azure storage to Google storage, but haven't completely given up.

Application Insights Delay?

I've looked in many places for details around the delay of time it takes for Application Insights data to appear in my dashboard, but can't find it documented anywhere.
I spent some time yesterday trying to debug an issue around my code seemingly unable to send data to application insights, only for the data to appear sometime later (~40 mins).
Does anybody have any details regarding time I should expect to have to wait prior to seeing data on my dashboard?
I've read a few FAQs an articles such as: https://azure.microsoft.com/en-gb/documentation/articles/app-insights-troubleshoot-faq/ but am none the wiser.
More specifically, these were attempts to track exceptions and custom events.
Generally raw examples of your data should be available within couple of minutes from the time you send it, and aggregated data takes about 5-10 minutes to appear. Also when we are experiencing a processing delay we display a banner on the Overview page in Application Insights in the portal as on the screenshot below.
If you saw 40 minutes delay seeing your data this was either the case of ongoing issue with the processing pipeline, in which case a message should have been shown (and if not, it is a detection problem on our side), or, as we are often seeing, there could have been a configuration problem with your application that was later addressed.
Agree with the comments in the accepted answer that real-time logging is a absolute requirement of an enterprise system. Even the Portal says the following on the Monitor section of the Azure Functions blade:
This appears to be due to metric aggregation. However I've just been shown Application Insights' Live Metrics Stream by a colleague. It has 1-second latency, which is probably what most readers of this question are after and thought worth sharing.

Intermittent application hang on startup in Windows Store App

I am developing a Windows Store application. Currently, I am getting intermittent hangs as described in this blog post. The issue appears to be that not enough space is given to remainder-defined column widths and TextBlocks attempting to format themselves (possibly due to the ellipsis processing). My app tends to hang indefinitely when this happens.
The question I have less related to how to solve the issue (as it seems to be described fairly well in the blog post), but instead how to find the issues. I have one fairly regularly (approximately one in five or ten start-ups) on a Hub Page, so I've been looking through there (as it's the most notable instance of issue), but it's a true Heisenbug in that it never seems to happen when debugging (or when you look for it).
So, how do I find the offending code? Is there just a pattern I need to look for (ColumnWidth="*"?). Is there a simpler way to solve this, such as changing the base style to remove one of the possibly offending properties listed in the blog post?
It seems possible that this is being caused by another issue, but this seems to be the most likely/plausible as of right now (as with the hubs I have a similar situation to what is being described there).
Also, is there a way to track when this happens in the wild? MSFT provides crash dumps on hangs, but they seem to give little to no information in them at all (and on top of that they only appear 5 days after they happen, which is less than ideal).
Thanks!
This is a complicated question to answer.
First, I think you have identified a real problem with WinRT. You theorize that the layout subsystem seems busy calculating your layout, and based on some condition that occurs around 20% of the time it does not finish in any reasonable time. Reasonable guess.
The problem, then, is when such an event does not occur during debug. In my personal development experience, errors that do not occur in debug are 99.99% timing related. Something is not finishing before a second process begins. Debugging lets those first, long process finish.
This is a real computer science question, and not so much a WinRT or Windows 8 question. To that end, the best answer I can give you without any code samples (why no code samples?) is the typical approach I employ when I reach the same dilemma. I hope it helps, at least a little.
Start with your brain.
I have always joked with developers just how much debugging can be done outside the debugger - and in your mind. Mentally walking the pipeline of your app and looking for race-condition dependencies that might cause deadlocks. Believe it or not, this solves a lot of problems a debugger could never catch - because debuggers unwind timing dependencies.
Next is simplicity.
The more complex the problem the less likely you will find the culprit. In the case of a XAML application, I tend to remove or disable value converters first. Then, I look to remove data templates. If you have element bindings, those go next. If simplifying the XAML does help - that's just the beginning to figuring it out. If it doesn't, things just got easier.
Your code behind can be disabled with just a few keystrokes and found guilty or innocent. It's the most likely place for your problem, I find, and the reason we work so hard to keep it simple, clean, and minimal. After that, there's the view model. Though it's not impossible for your view model to be the one, and indeed you still have to check, it's probably not the root of your evil.
Lastly, there's the app pipeline that loads your page, loads your data, or does anything else. Step by step your only real option is to slowly remove things from your app until you don't see the problem. Removing the problem, though is not solving it. That's a case by case thing based on your app and the logic in it. Reality is, you might see the problem leave when removing XAML, while the real problem is in the view model or elsewhere.
What am I really saying? The silver bullet you are asking for really isn't there. There are several Microsoft tools and even more third party tools to look for bottlenecks, latency problems, slow code, and stuff - but in all reality, the scenario you describe is plain ole programming. I am not saying you aren't the victim of a bug. I'm saying, with the information we have, this is all I can do for you.
You'll get it.
Third thing to do is to add logging, and instrumentation to your app.
Best of luck.
Given that Jerry has answered this at a higher level I figured I would add in the lower level answers that from the way your question is phrased makes me think you are interested in. I guess first I would like to address the last item which is the dump files. There is a mechanism for getting dump files of a process 'in the wild' that Microsoft provides which is through Windows Error Reporting. If you are wanting to collect dump files from failed client processes you could sign up for Windows Error Reporting (I must admit I have never actually done it, but I did look into it and tried to get my current employer to allow me to do this, but it didn't end successfully). To sign up go to the Establish a Hardware/Desktop Account Page.
As far as what to do with dump files once you get them, you would be wanting to download the debugging tools for windows (part of the Windows SDK download) and/or the Debug Diag Tool (I must confess I am more of a debugging tools for windows user than a Debug Diag user). These will provide you with the tools to look into what is going on at a lower level. Obviously you can only go so far as you won't have access to private Microsoft symbols, but you do have access to public symbols and usually those are enough to give you a pretty good idea of the problem area.
Your primary tools will depend on how reproducible the issue is. If it is only reproducible on some client machines then you will have to rely on looking at a single dump file that you probably got a hold of from Windows Error Reporting. In this case what I would do is open it up using the appropriate version of Windbg (either x86 or x64) and look at what was going on at the time the dump was taken. Depending on how savvy you are depends on how far you can go. Probably a simple starter would be to run
.symfix
.reload
.loadby sos clr
!EEStack
This will load Microsoft public symbols, the sos extension dll for dealing with Managed code inspection, and then will dump the contents of the stack for each thread in the process. From looking at the names of the method that appear on the call stacks you might be able to get a pretty good idea of at least the area of the code where the lock is occuring.
You can go much farther than this as Windbg provides the ability to go pretty deep into deadlock analysis (for instance there is an extension available for Windbg called sosex that provides a command !dlk which can sometimes automate the detection of a deadlock for you from a single dump file. To load an extension dll into Windbg you just have to download it and then call .load fullpathtodll). If the problem is reproducible locally you might even be more successful with WPA/WPR or if you are really fortunate a simple procmon trace. These tools do have a pretty decent barrier to entry as they take some time to learn. But if you are really interested in the topic your best resources would be the Defrag Tools series on Channel9 and anything by Mario Hewardt (especially his book "Advanced .Net Debugging"). Again, getting familiar with these tools can take a bunch of time, but at the very least if you just know how to dump the contents of the stacks from a dump file you can sometimes get what you need just from that so a basic understanding of these tools can be beneficial as well.

High number of Request Timeouts on IIS

I have a fairly busy site which does around 10m views a month.
One of my app pools seemed to jam up for a few hours and I'm looking for some ideas on how to troubleshoot it..? I suspect that it somehow ran out of threads but I'm not sure how to determine this retroactively..? Here's what I know:
The site never went 'down', but around 90% of requests started timing out.
I can see a high number of "HttpException - Request timed out." in the log during the outage
I can't find any SQL errors or code errors that would have caused the timeouts.
The timeouts seem to have been site wide on all pages.
There was one page with a bug on it which would have caused errors on that specific page.
The site had to be restarted.
The site is ASP.NET C# 3.5 WebForms..
Possibilities:
Thread depletion: My thought is that the page causing the error may have somehow started jamming up the available threads?
Global code error: Another possibility is that one of my static classes has an undiscovered bug in it somewhere. This is unlikely as the this has never happened before, and I can't find any log errors for these classes, but it is a possibility.
UPDATE
I've managed to trace the issue now while it's occurring. The pages are being loaded normally but for some reason WebResource.axd and ScriptResource.axd are both taking a minute to load. In the performance counters I can see ASP.NET Requests Queued spikes at this point.
The first thing I'd try is Sam Saffron's CPU analyzer tool, which should give an indication if there is something common that is happening too much / too long. In part because it doesn't involve any changes; just run it at the server.
After that, there are various other debugging tools available; we've found that some very ghetto approaches can be insanely effective at seeing where time is spent (of course, it'll only work on the 10% of successful results).
You can of course just open the server profiling tools and drag in various .NET / IIS counters, which may help you spot some things.
Between these three options, you should be covered for:
code dropping into a black hole and never coming out (typically threading related)
code running, but too slowly (typically data access related)

What could cause a Windows Service to hang when a Console App doing the exact same thing using the exact same base libraries doesn't?

I hate asking questions like this - they're so undefined... and undefinable, but here goes.
Background:
I've got a DLL that is the guts of an application that is a timed process. My timer receives a configuration for the interval at which it runs and a delegate that should be run when the interval elapses. I've got another DLL that contains the process that I inject.
I created two applications, one Windows Service and one Console Application. Each of the applications read their own configuration file and load the same libraries pushing the configured timer interval and delegate into my timed process class.
Problem:
Yesterday and for the last n weeks, everything was working fine in our production environment using the Windows Service. Today, the Windows Service will run for a period of around 20-30 minutes and hangs (with a timer interval of 30 secods), but the console application runs without issue and has for the past 4 hours. Detailed logging doesn't indicate any failure. It's as if the Windows Service just...dies quietly - without stopping.
Given that my Windows Service and Console Applications are doing the exact same thing, I can only think that there is something that is causing the Windows Service process to hang - but I have no idea what could be causing that. I've checked the configuration files, and they're both identical - I even copied and pasted the contents of one into the other just to be sure. No dice.
Can anyone make suggestions as to what might cause a Windows Service to hang, when a counterpart Console Application using the same base libraries doesn't; or can anyone point me in the direction of tools that would allow me to diagnose what could be causing this issue?
Thanks for everyone's help - still digging.
You need to figure out what changed on the production server. At first, the IT guys responsible will swear that nothing changed but you have to be persistent. i've seen this happen to often i've lost count. Software doesn't spoil. Period. The change must have been to the environment.
Difference in execution: You have two apps running the same code. The most likely difference (and culprit) is that the service is running with a different set of security credentials than your console app and might fall victim to security vagaries. Check on that first. Which Windows account is running the service? What is its role and scope? Is there any 3rd party security software running on the server and perhaps Killing errant apps? Do you have to register your service with a 3rd party security service? Is your .Net assembly properly signed? Are your .Net assemblies properly registered and configured on the server? Last but not least, don't forget that a debugger user, which you most likely are, gets away with a lot more stuff than many other account types.
Another thought: Since timing seems to be part of the issues, check the scheduled tasks on the machine. Perhaps there's a process that is set to go off every 30 minutes that is interfering with your own.
You can debug a Windows service by running it interactively within Visual Studio. This may help you to isolate the problem by setting (perhaps conditional) breakpoints.
Alternatively, you can use the Visual Studio "Attach to process" dialog window to find the service process and attach to it with the "Debug CLR" option enabled. Again this allows you to set breakpoints as needed.
Are you using any assertions? If an assertion fires without being re-directed to write to a log file, your service will hang. If the code throws an unhandled exception, perhaps because of a memory leak, then your service process will crash. If you set the Service Control Manager (SCM) to restart your process in the event of a crash, you should be able to see that the service has been restarted. As you have identical code running in both environments, these two situations don't seem likely. But remember that your service is being hosted by the SCM, which means a very different environment to the one in which your console app is running.
I often use a "heartbeat", where each active thread in the service sends a regular (say every 30 seconds) message to a local MSMQ. This enables manual or automated monitoring, and should give you some clues when these heartbeat messages stop arriving.
Annother possibility is some sort of permissions problem, because the service is probably running with a different local/domain user to the console.
After the hang, can you use the SCM to stop the service? If you can't, then there is probably some sort of thread deadlock problem. After the service appears to hang, you can go to a command-line and type sc queryex servicename. This should give you the current STATE of the service.
I would probably put in some file logging just to see how far the program is getting. It may give you a better idea of what is looping/hanging/deadlocked/crashing.
You can try these techniques
Logging start logging the flow of the code in the service. Have this parameter based so you dont have a deluge after you are done. You should log all function names, parameters, timestamps.
Attach Debugger Locally or Remotely attach a debugger with the code to the running service, set appropriate breakpoints (can be based on the data gathered from logging)
PerfMon Run this utility and gather information about the machine that the service is running on for any additional clues (high CPU spikes, IO spikes, excessive paging, etc)
Microsoft provides a good resource on debugging a Windows Service. That essentially sounds like what you'd have to do given that your question is so generic. With that said, has any changes been made to the system over the last few days that could aversely affect the service? Have you made any updates to the code that change the way the service might possibly work?
Again, I think you're going to have to do some serious debugging to find your problem.
What type of timer are you using in the windows service? I've seen numberous people on SO have problems with timers and windows services. Here is a good tutorial just to make sure you are setting it up correctly and using the right type of timer. Hope that helps.
Another potential problem in reference to psasik's answer is if your application is relying on something only available when being run in User Mode.
Running in service mode runs in (is it desktop0?) which can cause some issues in my experience if you are trying to determine states of something that can only be seen in user mode.
Smells like a threading issue to me. Is there any threading or async work being done at all? One crucial question is "does the service hang on the same line of code or same method every time?" Use your logging to find out the last thing that happens before a hang, and if so, post the problem code.
One other tool you may consider is a good profiler. If it is .NET code, I believe RedGate ANTS can monitor it and give you a good picture of any threadlock scenarios.

Categories

Resources