Diagnose/Debug potential stack corruption .NET application - c#

I think I have a curly one here... I have an WinForms application that crashes fairly regularly every hour or so when running as an x64 process. I suspect this is due to stack corruption and would like to know if anyone has seen a similar issue or has some advice for diagnosing and detecting the issue.
The program in question has no visible UI. It's just a message window that sits in the background and acts as a sort of 'middleware' between our other client programs and a server.
It dies in different ways on different machines. Sometimes it's an 'APPCRASH' dialog that reports a fault in ntdll.dll. Sometimes it's an 'APPCRASH' that reports our own dll as the culprit. Sometimes it's just a silent death. Sometimes our unhandled exception hook logs the error, sometimes it doesn't.
In the cases where Windows Error Reporting kicks in, I've examined memory dumps from several different crash scenarios and found the same Managed exception in memory each time. This is the same exception I see reported as an unhandled exception in the cases where we it logs before it dies.
I've also been lucky (?) enough to have the application crash while I was actively debugging with Visual Studio - and saw that same exception take down the program.
Now here's the kicker. This particular exception was thrown, caught and swallowed in the first few seconds of the program's life. I have verified this with additional trace logging and I have taken memory dumps of the application a couple of minutes after application startup and verified that exception is still sitting there in the heap somewhere. I've also run a memory profiler over the application and used that to verify that no other .NET object had a reference to it.
The code in question looks a bit like this (vastly simplified, but maintains the key points of flow control)
public class AClass
{
public object FindAThing(string key)
{
object retVal = null;
Collection<Place> places= GetPlaces();
foreach (Place place in places)
{
try
{
retval = place.FindThing(key);
break;
}
catch {} // Guaranteed to only be a 'NotFound' exception
}
return retval;
}
}
public class Place
{
public object FindThing(string key)
{
bool found = InternalContains(key); // <snip> some complex if/else logic
if (code == success)
return InternalFetch(key);
throw new NotFoundException(/*UsefulInfo*/);
}
}
The stack trace I see, both in the event log and when looking at the heap with windbg looks a bit like this.
Company.NotFoundException:
Place.FindThing()
AClass.FindAThing()
Now... to me that reeks of something like stack corruption. The exception is thrown and caught while the application is starting up. But the pointer to it survives on the stack for an hour or more, like a bullet in the brain, and then suddenly breaches a crucial artery, and the application dies in a puddle.
Extra clues:
The code within 'InternalFetch' uses some Marshal.[Alloc/Free]CoTask and pinvoke code. I have run FxCop over it looking for portability issues, and found nothing.
This particular manifestation of the issue is only affecting x64 code built in release mode (with code optimization on). The code I listed for the 'Place.Find' method reflects the optimized .NET code. The unoptimized code returns the found object as the last statement, not 'throw exception'.
We make some COM calls during startup before the above code is run... and in a scenario where the above problem is going to manifest, the very first COM call fails. (Exception is caught and swallowed). I have commented out that particular COM call, and it does not stop the exception sticking around on the heap.
The problem might also affect 32 bit systems, but if it does - then the problem does not manifest in the same spot. I was only sent (typical users!) a few pixels worth of a screen shot of an 'APP CRASH' dialog, but the one thing I could make out was 'StackHash_2264' in the faulting module field.
EDIT:
Breakthrough!
I have narrowed down the problem to a particular call to SetTimer.
The pInvoke looks like this:
[DllImport("user32")]
internal static extern IntPtr SetTimer(IntPtr hwnd, IntPtr nIDEvent, int uElapse, TimerProc CB);
internal delegate void TimerProc(IntPtr hWnd, uint nMsg, IntPtr nIDEvent, int dwTime);
There is a particular class that starts a timer in its constructor. Any timers set before that object is constructed work. Any timers set after that object is constructed work. Any timer set during that constructor causes the application to crash, more often than not. (I have a laptop that crashes maybe 95% of the time, but my desktop only crashes 10% of the time).
Whether the interval is set to 1 hour, or 1 second, seems to make no different. The application dies when the timer is due - usually by throwing some previously handled exception as described above. The callback does not actually get executed. If I set the same timer on the very next line of managed code after the constructor returns - all is fine and happy.
I have had a debugger attached when the bad timer was about to fire, and it caused an access violation in 'DispatchMessage'. The timer callback was never called. I have enabled the MDAs that relate to managed callbacks being garbage collected, and it isn't triggering. I have examined the objects with sos and verified that the callback still existed in memory, and that the address it pointed to was the correct callback function.
If I run '!analyze -v' at this point, it usually (but not always) reports something along the lines of 'ERROR_SXS_CORRUPT_ACTIVATION_STACK'
Replacing the call to SetTimer with Microsoft's 'System.Windows.Forms.Timer' class also stops the crash. I've used a Reflector on the class and can see internally it still calls SetTimer - but does not register a procedure. Instead it has a native window that receives the callback. It's pInvoke definition actually looks wrong... it uses 'ints' for the eventId, where MSDN documentation says it should be a UIntPtr.
Our own code originally also used 'int' for nIDEvent rather than IntPtr - I changed it during the course of this investigation - but the crash continued both before and after this declaration change. So the only real difference that I can see is that we are registering a callback, and the Windows class is not.
So... at this stage I can 'fix' the problem by shuffing one particular call to SetTimer to a slightly different spot. But I am still no closer to actually understanding what is so special about starting the timer within that constructor that causes this error. And I dearly would like to understand the root cause of this issue.

Just briefly thinking about it it sounds like an x64 interop issue (i.e., calling x32 native functions from x64 managed code is fraught with danger). Does the problem go away if you force your application to compile as x32 platform from within project properties?
You can read suggestions on forcing x32 compile during x32/x64 development on Dotnetrocks. Richard Campbell's suggestion is that Visual Studio should default to x32 platform and not AnyCPU.
http://www.dotnetrocks.com/default.aspx?showNum=341 (transcript).
With regard to advanced debugging, I have not had a chance to debug x64 interop code, but i hear that this book is an great resource: Advanced .NET Debugging.
Finally, one thing you might try is force Visual Studio to break when an exception is thrown.

Use something like DebugDiag for x64 or Windbg to write a dump on Kernel32!TerminateProcess and second chance exception on .NET which should give you the actual .excr context frame of the exception that occurred.
This should help you in identifying the call-stack for the process terminate.
IMO it could be mostly because of PInvoke calls. You could use Managed Debugging Assistants to debug these issues.
If MDA is used along with Windbg it would give out messages that would be helpful in debugging
Also I have found tools from the http://clrinterop.codeplex.com/ team are extremely handy when dealing with interop
EDIT
This should give an answer why it is not working in 64 bit Issue with callback method in SetTimer Windows API called from C# code .

This does sound like a corruption issue. I would go through all of your interop calls and ensure that all of the parameters to the DllImport'ed functions are the correct types. For exmaple, using an int in place of an IntPtr will work in 32 bit code but can crash 64 bit.
I would use a site like PInvoke.net to verify all of the signatures.

Related

How to debug "Not enough storage is available to process this command"

We've started to experience Not enough storage available to process this command. The application is WPF, the exception starts to pop up after some hours of working normally.
System.ComponentModel.Win32Exception (0x80004005): Not enough storage is available to process this command
at MS.Win32.UnsafeNativeMethods.RegisterClassEx(WNDCLASSEX_D wc_d)
at MS.Win32.HwndWrapper..ctor(Int32 classStyle, Int32 style, Int32 exStyle, Int32 x, Int32 y, Int32 width, Int32 height, String name, IntPtr parent, HwndWrapperHook[] hooks)
at System.Windows.Interop.HwndSource.Initialize(HwndSourceParameters parameters)
at System.Windows.Window.CreateSourceWindow(Boolean duringShow)
at System.Windows.Window.CreateSourceWindowDuringShow()
at System.Windows.Window.SafeCreateWindowDuringShow()
at System.Windows.Window.ShowHelper(Object booleanBox)
at System.Windows.Window.Show()
at System.Windows.Window.ShowDialog()
My understanding is this is some kind of out of memory exception, specific to allocation of windows resources. What is the possible reason of this and how can I debug it?
Update
I have reviewed the topic suggested by #Thili77 (this one). I used GDIView and task manager to look at the consumed handles during our app performing (Handles, USER Objects and GDI objects in taskmgr), and it doesn't look like they are growing. My next test is to try to run it for a day without VS (previously it was running under VS host process) and check whether this still happens. I'm still looking for any advices or tips if anybody has any
Update #2
It happens on a new clean PC without hosting VS. The handles, USER Objects and GDI Objects are OK during crash. When the PC in a crashed state, nothing works properly - looks like the handles are really leaked, but ProcMon doesn't show big numbers for these values. Also weirdly this always happens around 7-8 pm, when there is nobody in the office and it doesn't matter when I started the app run. It is already a third crash like that. Coincidence? Only thing that I've notice I find weird is a big number of page faults for the app, that grows constantly. Could this be related? Does not appear anymore, see Update #3
Update #3
Next are the details of a crash I experience. The system is x86, app is x86, W7 SP1.
The current state that is shown on the screenshots are exactly right after the crash, with windbg that pauses the process.
For some reason now the exception has different message: The operation completed successfully. But it still the same Win32Exception coming from the same piece of code.
I also need to pinpoint that I'm running with reduced amount of desktop heap and with AppAnalyzer Basic options on - in order to make the fault more frequent (which seems to work). The time assumption was indeed a coincidence, no time related shared theme noticed anymore.
One possibility is that the global atom table has run out of available space. There is a limit of 0x4000 string atoms in the table, and there is also a limit on the total amount of space allocated to the table. Window classes are one of the things that go into this table.
I have never attempted to debug such an issue myself, but I did find an article about checking for this problem using WinDbg: Identifying Global Atom Table Leaks. You might want to look into that as a possible cause.
If this turns out to be the culprit, one possible cause is that the application is not closing Window instances. HwndWrapper cleans up its global atom in its Dispose, which happens in response to WM_DESTROY, which happens in response to calling Close on the Window (or setting DialogResult, which ends up closing the window if the value changes and the window was shown by calling ShowDialog rather than Show). There may be other possible causes for an atom leak as well.
P.S. The reason I suspect this is because "Not enough storage is available to process this command" is the error that is returned when RegisterClassEx is unable to add to the global atom table.
Looks like an issue which was not resolved on purpose by Microsoft, check this Connect link, in which it was stated:
We appreciate the feedback. However, this issue will not be addressed in the next version of WPF. Thank you.
–WPF Team.
A workaround is provided, it might help:
You can work around this bug by adding the following code to your thread proc:
Dispatcher dispatcher = Dispatcher.CurrentDispatcher;
dispatcher.BeginInvokeShutdown(DispatcherPriority.Normal);
Dispatcher.Run();
This asks the dispatcher associated with the thread to shut down right away.
From my experience I received that type of exception in case your UI thread hangs up and other threads continue placing messages to main application UI dispatcher. So in the some period of type the message queue is full and than you will recieve this exception.
To debug that you may need find your thread 1(which is UI) in VS during debug session and monitor it's activities. Maybe there is some infinite waiter on some external event or etc.

How to interpret this stack trace

I recently released a Windows phone 8 app.
The app sometimes seem to crash randomly but the problem is it crash without breaking and the only info I get is a message on output that tells me there were an Access violation without giving any details.
So after releasing, from the crash reports I was able to obtain some more information, but they're kinda cryptical to me.
The info are:
Problem function: unknown //not very useful
Exception type: c0000005 //this is the code for Access violation exception
Stack trace:
Frame Image Function Offset
0 qcdx9um8960 0x00035426
1 qcdx9um8960 0x000227e2
I'm not used to work with memory pointer et similia and I'm not used to see a stack trace like that.
So I have those question:
How should I interpret/read those information, what's the meaning of every piece of information?
Is there a way to leverage those information to target my search for the problem?
Is there a way to get those information while debugging in VS2012
Notes:
I'm not asking what an Access Violation is
I tagged this as c# and c++ because my code is in c# but the exception is generated (I'm semi-guessing) by c++ implementation for the WebBrowser component
edit:
I tried setting the Debug type to Native only, this let me obtain the same info I had in the crash report on the dev center. This way the debugger break when the exception is thrown and let me see the disassebled code, unfortunately there's no qcdx9um8960 .pdb file (even on Microsoft Symbol Server), so I don't know the function name that caused the error.
Curiously, a search on the web for the image name "qcdx9um8960" returns several results referencing Windows Phone 8 and the WebBrowser control. Gathering the answers and replies (some even by MSFT), here is what you should possibly look into:
If you upgraded your application from Windows Phone 6/7 to 8, make sure you are not still referencing any 6/7 DLLs. 1
Make sure you aren't testing or publishing your software in Debug mode. There is a "qcdx9um8960.pdb" file that might be missing, causing the access violation. 1
"...there is a possible race condition known issue if the app has multiple copies of WebBrowser open. See if your code perhaps inadvertently makes more than one instance." 1
That image, "qcdx9um8960" is referencing a Qualcomm DirectX driver DLL. Perhaps it's not the WebBrowser component's fault, but the DirectX driver it might be using to render the web pages. 2
The name of the image suggests that the crash is happening on devices powered by a Qualcomm Snapdragon S4 Plus with model number MSM8960. 3
Assuming the processor above, and cross referencing Windows phones that use that chip, you're likely looking at the issue occurring on the Nokia Lumia 920T. 3 That's not to say that the driver doesn't work on several processor architectures or phones.
There are several other hits regarding crashes and issues debugging in the presence of that DLL, so unfortunately for you, I think you might be at the mercy of some third party software that has a few unresolved issues.
References
1 Access Violation since updated to WP8
2 [Toolkit][WP8] Performance issues with DepthStencilBuffer
3 Snapdragon (system on chip)
This kind of crash "should" never be caused by managed code, so you could go looking for a case where your app invokes some system or library API incorrectly. That's tedious. And the problem might have nothing to do with your app, it might be entirely internal to someone else's code. E.g, maybe WebBrowser crashes when user browses to some evil page. Or the failing code could be running on a thread that never even runs your code. From your observation that the debugger doesn't show any message before the access violation, and the fact that there are only 2 frames on the call stack, I suspect that's most likely.
So you should focus first on getting a (fairly) reliable repro scenario: the (minimal) set of steps that will (often or usually) produce the crash. This may involve interviewing the users who experienced the crash, or maybe some test automation on your part to try to accelerate the failure rate.
Once you have that, Microsoft (or another 3rd party) will accept responsibility -- managed code is never supposed to be able to cause an unhandled exception like access violation. And the scenario might give you a hint about how you can change your app's behavior to avoid the problem, because a real fix might take a long time to be released and distributed.

I've found a bug in the JIT/CLR - now how do I debug or reproduce it?

I have a computationally-expensive multi-threaded C# app that seems to crash consistently after 30-90 minutes of running. The error it gives is
The runtime has encountered a fatal error. The address of the error was at 0xec37ebae, on thread 0xbcc. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.
(0xc0000005 is the error-code for Access Violation)
My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint. In fact, the line of code that the debugger says caused the crash is
overallLength += distanceTravelled;
Where both values are of type double
Given all this, I believe the crash must be due to a bug in the compiler or CLR or JIT. I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin. I've never had to view the CIL-binary, or the compiled JIT output, or the native stacktrace (there is no managed stacktrace at the time of the crash), so I'm not sure how. I can't even figure out how to view the state of all the variables at the time of the crash (VS unfortunately won't tell me like it does after managed-exceptions, and outputting them to console/a file would slow down the app 1000-fold, which is obviously not an option).
So, how do I go about debugging this?
[Edit] Compiled under VS 2010 SP1, running latest version of .Net 4.0 Client Profile. Apparently it's ".Net 4.0C/.Net 4.0E, .Net CLR 1.1.4322"
I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin.
"Smaller reproduction" definitely sounds like a great idea here... even if "smaller" won't mean "quicker to reproduce".
Before you even start, try to reproduce the error on another machine. If you can't reproduce it on another machine, that suggests a whole different set of tests to do - hardware, installation etc.
Also, check you're on the latest version of everything. It would be annoying to spend days debugging this (which is likely, I'm afraid) and then end up with a response of "Yes, we know about this - it was a bug in .NET 4 which was fixed in .NET 4.5" for example. If you can reproduce it on a variety of framework versions, that would be even better :)
Next, cut out everything you can in the program:
Does it have a user interface at all? If possible, remove that.
Does it use a database? See if you can remove all database access: definitely any output which isn't used later, and ideally input too. If you can hard code the input within the app, that would be ideal - but if not, files are simpler for reproductions than database access.
Is it data-sensitive? Again, without knowing much about the app it's hard to know whether this is useful, but assuming it's processing a lot of data, can you use a binary search to find a relatively small amount of data which causes the problem?
Does it have to be multi-threaded? If you can remove all the threading, obviously that may well then take much longer to reproduce the problem - but does it still happen at all?
Try removing bits of business logic: if your app is componentized appropriately, you can probably fake out whole significant components by first creating a stub implementation, and then simply removing the calls.
All of this will gradually reduce the size of the app until it's more manageable. At each step, you'll need to run the app again until it either crashes or you're convinced it won't crash. If you have a lot of machines available to you, that should help...
tl;dr Make sure you're compiling to .Net 4.5
This sounds suspiciously like the same error found here. From the MSDN page:
This bug can be encountered when the Garbage Collector is freeing and compacting memory. The error can happen when the Concurrent Garbage Collection is enabled and a certain combination of foreground Garbage Collection and background Garbage Collection occurs. When this situation happens you will see the same call stack over and over. On the heap you will see one free object and before it ends you will see another free object corrupting the heap.
The fix is to compile to .Net 4.5. If for some reason you can't do this, you can also disable concurrent garbage collection by disabling gcConcurrent in the app.config file:
<configuration>
<runtime>
<gcConcurrent enabled="false"/>
</runtime>
</configuration>
Or just compile to x86.
WinDbg is your friend:
http://blogs.msdn.com/b/tess/archive/2006/02/09/net-crash-managed-heap-corruption-calling-unmanaged-code.aspx
http://www.codeproject.com/Articles/23589/Get-Started-Debugging-Memory-Related-Issues-in-Net
http://www.codeproject.com/Articles/22245/Quick-start-to-using-WinDbg
Download Debug Diagnostic Tool v1.2
Run program
Add Rule "Crash"
Select "Specific Process"
on page Advanced Configuration set your exception if you know on which exception it fails or just leave this page as is
Set userdump location
Now wait for process to crash, log file is created by DebugDiag. Now activate tab Advanced Analysis, select Crash/Hang Analyzers in top list and dump file in lower list and hit Start Analysis. This will generate html report for you. Hopes you found usefull info in that report. If you have problem with analyze, upload html report somewhere and place url here so we can focus on it.
My app does not invoke any native code, or use any unsafe blocks, or
even any non-CLS compliant types like uint
You may think this, but threading, synchronization via semaphore, mutex it any handles all are native. .net is a layer over operating system, .net itself does not support pure clr code for multithreading apps, this is because OS already does it.
Most likely this is thread synchronization error. Probably multiple threads are trying to access shared resource like file etc that is outside clr boundary.
You may think you aren't accessing com etc, but when you call certain API like get desktop folder path etc it is called through shell com API.
You have following two options,
Publish your code so that we can review the bottleneck
Redesign your app using .net parallel threading framework, which includes variety of algorithms requiring CPU intensive operations.
Most likely programs fail after certain period of time as collections grow up and operations fail to execute before other thread interfere. For example, producer consumer problem, you will not notice any problem till producer will become slower or fail to finish its operation before consumer kicks in.
Bug in clr is rare, because clr is very stable. But poorly written code may lead error to appear as bug in clr. Clr can not and will never detect whether the bug is in your code or in clr itself.
Did you run a memory test for your machine as the one time I had comparable symptoms one of my dimms turned out to be faulty (a very good memorytester is included in Win7; http://www.tomstricks.com/how-to-test-your-ram-or-memory-with-windows-memory-diagnostic-tool-in-windows-7/)
It might also be a heating/throttling issue if your CPU gets too hot after this period of time. Although that would happen sooner imho.
There should be a dumpfile that you can analyze. If you never did this find someone who did, or send that to microsoft
I will suggest you open a support case via http://support.microsoft.com immediately, as the support guys can show you how to collect the necessary information.
Generally speaking, like #paulsm4 and #psulek said, you can utilize WinDbg or Debug Diag to capture crash dumps of the process, and within it, all necessary information is embedded. However, if this is the very first time you use those tools, you might be puzzled. Microsoft support team can provide you step by step guidance on them, or they can even set up a Live Meeting session with you to capture the data, as the program crashes so often.
Once you are familiar with the tools, in the future you can perform similar troubleshooting more easily,
http://blogs.msdn.com/b/lexli/archive/2009/08/23/when-the-application-program-crashes-on-windows.aspx
BTW, it is too early to say "I've found a bug". Though you cannot obviously find in your program a dependency on native code, it might still have a dependency on native code. We should not draw a conclusion before debugging further into the issue.

What exactly happens during a "managed-to-native transition"?

I understand that the CLR needs to do marshaling in some cases, but let's say I have:
using System.Runtime.InteropServices;
using System.Security;
[SuppressUnmanagedCodeSecurity]
static class Program
{
[DllImport("kernel32.dll", SetLastError = false)]
static extern int GetVersion();
static void Main()
{
for (; ; )
GetVersion();
}
}
When I break into this program with a debugger, I always see:
Given that there is no marshaling that needs to be done (right?), could someone please explain what's actually happening in this "managed-to-native transition", and why it is necessary?
First the call stack needs to be set up so that a STDCALL can happen. This is the calling convention for Win32.
Next the runtime will push a so called execution frame. There are many different types of frames: security asserts, GC protected regions, native code calls, ...
The runtime uses such a frame to track that currently native code is running. This has implications for a potentially concurrent garbage collection and probably other stuff. It also helps the debugger.
So not a lot is happening here actually. It is a pretty slim code path.
Besides the marshaling layer, which is responsible for converting parameters for you and figuring out calling conventions, the runtime needs to do a few other things to keep internal state consistent.
The security context needs to be checked, to make sure the calling code is allowed to access native methods. The current managed stack frame needs to be saved, so that the runtime can do a stack walk back for things like debugging and exception handling (not to mention native code that calls into a managed callback). Internal bits of state need to be set to indicate that we're currently running native code.
Additionally, registers may need to be saved, depending on what needs to be tracked and which are guaranteed to be restored by the calling convention. GC roots that are in registers (locals) might need to be marked in some way so that they don't get garbage collected during the native method.
So mainly it's stack handling and type marshaling, with some security stuff thrown in. Though it's not a huge amount of stuff, it will represent a significant barrier against calling smaller native methods. For example, trying to P/Invoke into an optimized math library rarely results in a performance win, since the overhead is enough to negate any of the potential benefits. Some performance profiling results are discussed here.
I realise that this has been answered, but I'm surprised that no one has suggested that you show the external code in the debug window. If you right click on the [Native to Managed Transition] line and tick the Show External Code option, you will see exactly which .NET methods are being called in the transition. This may give you a better idea. Here is an example:
I can't really see much that'd be necessary to do. I suspect that it is mainly informative, to indicate to you that part of your call stack shows native functions, and also to indicate that the IDE and debugger may behave differently across that transition (since managed code is handled very differently in the debugger, and some features you expect may not work)
But I guess you should be able to find out simply by inspecting the disassembly around the transition. See if it does anything unusual.
Since you are calling a dll. it needs to go out of the managed environment. It is going into windows core. You are breaking the .net barrier and going into windows code that doesn't run the same as .NET.

Finally Block Not Running?

Ok this is kind of a weird issue and I am hoping someone can shed some light. I have the following code:
static void Main(string[] args)
{
try
{
Console.WriteLine("in try");
throw new EncoderFallbackException();
}
catch (Exception)
{
Console.WriteLine("in Catch");
throw new AbandonedMutexException();
}
finally
{
Console.WriteLine("in Finally");
Console.ReadLine();
}
}
NOW when I compile this to target 3.5(2.0 CLR) it will pop up a window saying "XXX has stopped working". If I now click on the Cancel button it will run the finally, AND if I wait until it is done looking and click on the Close Program button it will also run the finally.
Now what is interesting and confusing is IF I do the same thing compiled against 4.0 Clicking on the Cancel button will run the finally block and clicking on the Close Program button will not.
My question is: Why does the finally run on 2.0 and not on 4.0 when hitting the Close Program button? What are the repercussions of this?
EDIT: I am running this from a command prompt in release mode(built in release mode) on windows 7 32 bit. Error Message: First Result below is running on 3.5 hitting close after windows looks for issue, second is when I run it on 4.0 and do the same thing.
I am able to reproduce the behavior now (I didn't get the exact steps from your question when I was reading it the first time).
One difference I can observe is in the way that the .NET runtime handles the unhandled exception. The CLR 2.0 runs a helper called Microsoft .NET Error Reporting Shim (dw20.exe) whereas the CLR 4.0 starts Windows Error Reporting (WerFault.exe).
I assume that the two have different behavior with respect to terminating the crashing process. WerFault.exe obviously kills the .NET process immediately whereas the .NET Error Reporting Shim somehow closes the application so that the finally block still is executed.
Also have a look at the Event Viewer: WerFault logs an application error notifying that the crashed process was terminated:
Application: ConsoleApplication1.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Threading.AbandonedMutexException
Stack:
at Program.Main(System.String[])
dw20.exe however only logs an information item with event id 1001 to the Event Log and does not terminate the process.
Think about how awful that situation is: something unexpected has happened that no one ever wrote code to handle. Is the right thing to do in that situation to run even more code, that was probably also not built to handle this situation? Possibly not. Often the right thing to do here is to not attempt to run the finally blocks because doing so will make a bad situation even worse. You already know the process is going down; put it out of its misery immediately.
In a scenario where an unhandled exception is going to take down the process, anything can happen. It is implementation-defined what happens in this case: whether the error is reported to Windows error reporting, whether a debugger starts up, and so on. The CLR is perfectly within its rights to attempt to run finally blocks, and is also perfectly within its rights to fail fast. In this scenario all bets are off; different implementations can choose to do different things.
All my knowledge on this subject is taken from this article here: http://msdn.microsoft.com/en-us/magazine/cc793966.aspx - please note it is written for .NET 2.0 but I have a feeling it makes sense for what we were experiencing in this case (more than "because it decided to" anyways)
Quick "I dont have time to read that article" answer (although you should, it's a really good one):
The solution to the problem (if you absolutly HAVE to have your finally blocks run) would be to a) put in a global error handler or b) force .NET to always run finally blocks and do things the way it did (arguably the wrong way) in .NET 1.1 - Place the following in your app.config:
<legacyUnhandledExceptionPolicy enabled="1">
The reason for it:
When an exception is thrown in .NET it starts walking back through the stack looking for exception handlers and when it finds one it then does a second walk back through the stack running finally blocks before running the content of the catch. If it does not find a catch then this second walk never happens thus the finally blocks are never run here which is why a global exception handler will always run finally clauses as the CLR will run them when it finds the catch, NOT when it runs it (which I belive means even if you do a catch/throw your finally blocks will still get run).
The reason the app.config fix works is because for .NET 1.0 and 1.1 the CLR had a global catch in it which would swallow Exceptions before they went unmanaged which would, being a catch of course, trigger the finally blocks to run. Of course there is no way the framework can know enough about said Exception to handle it, take for example a stack overflow, so this is probably the wrong way of doing it.
The next bit is where it gets a bit sticky, and I am making assumptions based off of what the article says here.
If you are in .NET 2.0+ without the legacy exception handling on then your Exception would fall out into the Windows exception handling system (SEH) which seems pretty darn similar to the CLR one, in that it walks back through frames until it fails to find a catch and then calls a series of events called the Unhandled Exception Filter (UEF). This is an event you can subscribe to, but it can only have ONE thing subscribed to it at a time, so when something does subscribe Windows hands it the address of the callback that was there before, allowing you to set up a chain of UEF handlers - BUT THEY DON'T HAVE TO HONOR that address, they should call the address themselves, but if one breaks the chain, bap, you get no more error handling. I assume that this is what is happening when you cancel windows error reporting, it breaks the UEF chain which means that the application is shut down immediately and the finally blocks are not run, however if you let it run to the end and close it, it will call the next UEF in the chain. .NET will have registerd one which is what the AppDomain.UnhandledException is called from (thus even this event is not guaranteed) which I assume is also where you get your finally blocks called from - as I can't see how if you never transition back into the CLR a managed finally block can run (the article does not go into this bit.)
I believe this has something to do with changes to how the debugger is attached.
From the .NET Framework 4 Migration Issues document:
You are no longer notified when the debugger fails to start, or when there is no registered debugger that should be started.
What happens is that you choose to start the debugger, but you cancel it. I believe this falls under this category and the application just stops because of this.
Ran this in both release and debug, in both framework 3.5 and 4.0, I see "in Finally" in all instances, yes running it from command line, went as far as closing my vs sessions, maybe it's something on your machine or as Kobi pointed out, maybe platform related (I'm on Win7 x64)

Categories

Resources