Corrupted state exceptions (CSE) across AppDomain - c#

For some background info, .NET 4.0 no longer catches CSEs by default: http://msdn.microsoft.com/en-us/magazine/dd419661.aspx
I'm working on an app that executes code in a new AppDomain. If that code throws a CSE, the exception bubbles up to the main code if it's not handled. My question is, can I safely assume that a CSE on the second AppDomain won't corrupt the state in the main AppDomain, and thus exit the second AppDomain and continue running the main AppDomain?

In the context of a corrupted state exception, in general, you cannot assume anything to be true anymore. The point of these exceptions is that something has happened, usually due to buggy unmanaged code, that has violated some core assumption that Windows or the CLR makes about the structure of memory. That means that, in theory, the very structures that the CLR uses to track which app domains exist in memory could be corrupted. The kinds of things that cause CSEs are generally indicative that things have gone catastrophically wrong.
Having said all that, off-the-record, in some cases, you may be able to make a determination that it is safe to continue from a particular exception. An EXCEPTION_STACK_OVERFLOW, for example, is probably recoverable, and an EXCEPTION_ACCESS_VIOLATION usually indicates that Windows caught a potential bug before it had a chance to screw anything up. It's up to you if you're willing to risk it, depending on how much you know about the code that is throwing the CSEs in the first place.

Related

Are CER needed to merely protect shared managed states within an AppDomain?

I do have an operation that must be reliably performed as a whole or not be performed at all.
The goal is only to preserve the consistency of some in-memory managed shared states.
Those states are contained within an application domain. They are not visible outside of this domain.
I therefore do not have to react when the domain or the process are teared down.
I am writing a class library and the user may call my code from anywhere. However my code does not call any user code, not even virtual methods.
The CLR may be hosted.
To my understanding I do not need constrained execution regions (CER) since:
CER are only needed against the infamous OutOfMemoryException, ThreadAbortException and StackOverflowException.
My code does not make any allocation, so I do not care about OutOfMemory (anyway allocations must not be done within a CER).
If a stack overflow occurs the process will be teared down anyway (or the domain in some hosted scenarios).
Thread aborts are already delayed until the end of a finally block and my code is already within one.
Am I correct on those points? Do you see other reasons why I should need CER?
I finally found at least one reason why a CER is still needed: even if my code does not do any allocation, the JIT compiler may have to allocate memory on the first execution.
Therefore putting a CER is required to force the runtime to JIT everything beforehand and prevent a possible OOM.

Why does .NET behave so poorly when StackOverflowException is thrown?

I'm aware that StackOverflowExceptions in .NET can't be caught, take down their process, and have no stack trace. This is officially documented on MSDN. However, I'm wondering what the technical (or other) reasons are behind the behavior. All MSDN says is:
In prior versions of the .NET Framework, your application could catch
a StackOverflowException object (for example, to recover from
unbounded recursion). However, that practice is currently discouraged
because significant additional code is required to reliably catch a
stack overflow exception and continue program execution.
What is this "significant additional code"? Are there other documented reasons for this behavior? Even if we can't catch SOE, why can't we at least get a stack trace? Several co-workers and I just sunk several hours into debugging a production StackOverflowException that would have taken minutes with a stack trace, so I'm wondering if there is a good reason for my suffering.
The stack of a thread is created by Windows. It uses so-called guard pages to be able to detect a stack overflow. A feature that's generally available to user mode code as described in this MSDN Library article. The basic idea is that the last two pages of the stack (2 x 4096 = 8192 bytes) are reserved and any processor access to them triggers a page fault that's turned into an SEH exception, STATUS_GUARD_PAGE_VIOLATION.
This is intercepted by the kernel in the case of those pages belonging to a thread stack. It changes the protection attributes of the first of those 2 pages, thus giving the thread some emergency stack space to deal with the mishap, then re-raises a STATUS_STACK_OVERFLOW exception.
This exception is in turn intercepted by the CLR. At that point there's about 3 kilobytes of stack space left. This is, for one, not enough to run the Just-in-time compiler (JITter) to compile the code that could deal with the exception in your program, the JITter needs much more space than that. The CLR therefore cannot do anything else but rudely abort the thread. And by .NET 2.0 policy that also terminates the process.
Note how this is less of a problem in Java, it has a bytecode interpreter so there's a guarantee that executable user code can run. Or in a non-managed program written in languages like C, C++ or Delphi, code is generated at build time. It is however still a very difficult mishap to deal with, the emergency space in the stack is blown so there is no scenario where continuing to run code on the thread is safe to do. The likelihood that a program can continue operating correctly with a thread aborted at a completely random location and rather corrupted state is quite unlikely.
If there was any effort at all in considering raising an event on another thread or in removing the restriction in the winapi (the number of guard pages is not configurable) then that's either a very well-kept secret or just wasn't considered useful. I suspect the latter, don't know it for a fact.
The stack is where virtually everything about the state of a program is stored. The address of each return site when methods are called, local variables, method parameters, etc. If a method overflows the stack, its execution must, by necessity, stop immediately (since there is no more stack space left for it to continue running). Then, to gracefully recover, somebody needs to clean up whatever that method did to the stack before it died. This means knowing what the stack looked like before the method was called. This incurs some overhead.
And if you can't clean up the stack, then you can't get a stack trace either, because the information required to generate the trace comes from "unrolling" the stack to discover which methods were called.
To handle stack overflow or out-of-memory conditions gracefully, it is necessary to trigger an exception somewhat before the stack has actually overflowed or heap memory is totally exhausted, at a time when the available stack and heap resources will be adequate to execute any cleanup code that will need to run before the exceptions are caught. In the case of stack-overflow exceptions, handling them cleanly would basically require checking the stack pointer on entry to each method (which shouldn't really be all that expensive). Normally, they're handled by setting an access-violation trap just beyond the end of the stack, but the problem with doing that is that the trap won't fire until it's already too late to handle things cleanly. One could set the trap to fire on the last memory block of the stack, rather than the one past, and have the system change the trap to the block past the stack once it fires and triggers a StackOverflowException, but the problem is there would be no nice way to ensure that the "almost out of stack" trap got re-enabled once the stack had unwound that far.
That having been said, an alternative approach would be to allow threads to set a delegate for what should happen if the thread blows its stack, and then say that in case of StackOverflowException the thread's stack will be cleared and it will run the supplied delegate. The trap could be re-instated before running the delegate (the stack would be empty at that point), and code could maintain a thread-status object that the delegate could use to know whether any important finally blocks got skipped.

Can there be a scenario when garbage collector fails to run due to an exception?

Just out of curiosity I was wondering if there is a possibility of a scenario when garbage collector fails to run or doesn't run at all (possibly due to an exception) ?
If yes, most probably there would be an OutOfMemory/ Stackoverflow exception . Then in that case just by looking at the exception message, stacktrace etc can we identify the core issue of gc failing to run.
As others have mentioned, numerous things can prevent the GC from running. FailFast fails fast; it doesn't stop to take out the trash before the building is demolished. But you asked specifically about exceptions.
An uncaught exception produces implementation-defined behaviour, so it is implementation-defined whether finally blocks run, whether garbage collection runs, and whether the finalizer queue objects are finalized when there is an uncaught exception. An implementation of the CLR is permitted to do anything when that happens, and "anything" includes both "run the GC" and "do not run the GC". And in fact implementations of the CLR have changed their behaviour over time; in v1.0 of the CLR an uncaught exception on the finalizer thread took out the process, in v2.0 an uncaught exception on the finalizer thread is caught, the error is logged, and finalizers keep on running.
There are four questions of interest:
Can something cause the program to die entirely, without the garbage-collector getting a chance to run
Can something prevent the garbage-collector from running without causing the system to die entirely
Can something prevent objects' finalizers from running without causing the system to die entirely
Can an exception make an object uncollectable for an arbitrary period of time
With regard to the first one, the answer is "definitely". There are so many ways that could potentially happen, that there's no need to list them here.
With regard to the second question, the answer is "generally no", since failure of the garbage collector would cripple a program; there may be some cases, however, in which portions of a program which do not use GC-managed memory may be able to keep running even though the portions that use managed objects could be blocked indefinitely.
With regard to the third question, it used to be in .net that an exception in a finalizer could interfere with the action of other finalizers without killing the entire application; such behavior has been changed since .net 2.0 so that uncaught exceptions thrown from finalizers will usually kill the whole program. It is possible, however, that an exception which is thrown and caught within a poorly-written finalizer might result in its failing to clean up everything it was supposed to, leading to question #4.
With regard to the fourth question, it is quite common for objects to establish long-lived (possibly static) references to themselves when they are created, and for them to destroy such references as part of clean-up code. If an exception prevents that clean-up code from running as expected, it may cause the objects to become uncollectable even if they are no longer useful.
yes, in Java there used to be the situation where the program could stop without the GC being run for the last time - in most cases this is OK as all the memory is cleared up when the program's heap is destroyed, but you can have the problem of objects not having their finalisers being run, this may or may not be a problem for you, depending what those finalisers would do.
I doubt you'll be able to determine the GC failure, as the program will be as dead as a parrot, in a non-clean manner, so you probably won't even get a stacktrace. You might be able to post-mortem debug it (if you've turned on the right dbg settings, .NET is sh*te when it comes to working nicely with the excellent Windows debugging tools).
There are certain edge cases where a finally block will not execute - calling FailFast is one case, and see the question here for others.
Given this, I would imagine there are cases (especially in using statements / IDisposable objects) where the resource cleanup/garbage collection occurring in a finally block are not executed.
More explicitly, something like this:
try
{
//new up an expensive object, maybe one that uses native resources
Environment.FailFast(string.Empty);
}
finally
{
Console.WriteLine("never executed");
}

Getting ReportAvOnComRelease Exception when using 3rd party COM

I am a new C# programmer and have created an application which uses a 3rd party COM object to track telephone call recordings from a call recording server. The creator of the COM software is also the vendor who makes the call recording software, so you would think it should work. I have been on many phone calls and code reviews with their staff and they have come up with very little to help.
The application responds to events from the COM object like OnCallStart and OnCallEnd, AgentLogon, AgentLogoff, ServerDown, etc. I do nothing more than monitor what the events return and write it to a file. The application compiles without a problem and runs for a few minutes and then it gives me the following error (I had to open up the Exception in the Debug>Exceptions menu to finally see it):
ReportAvOnComRelease was detected
Message: An exception was caught but handled while releasing a COM interface pointer through Marshal.Release or Marshal.ReleaseComObject or implicitly after the corresponding RuntimeCallableWrapper was garbage collected. This is the result of a user refcount error or other problem with a COM object's Release. Make sure refcounts are managed properly. The COM interface pointer's original vtable pointer was 0x45ecbac. While these types of exceptions are caught by the CLR, they can still lead to corruption and data loss so if possible the issue causing the exception should be addressed.
It gives me no more than that. No vtable details, refcounts or anything else. I coded a GC.Collect() and let the app run for a minute and then fired the GC.Collect and got the error. I can do that with some consistency. I have read article after article about this error and the need to Marshal correctly, but I am not the one marshaling. VS creates a RCW for the COM object and I use that so I have no control there, or do I? None of the articles gave me any code examples or anything other than theoretical chit chat.
Is there a better way to do this? How can I find exactly what is causing the error? There seems to be no way to isolate this thing. I found one article from a guy from Microsoft that called this the "Silent Assassin", but he gave no solutions and essentially admitted that MS didn't have any either. Read Here
I am at my wits end. Any help is appreciated.
Well, it is a serious defect in the COM server you are using. Yes, it is indirectly triggered by the CLR, the last reference count of a COM object is released when the finalizer thread runs the RCW finalizer. Marshal.ReleaseComObject counts the reference count down to 0 and the COM server's IUnknown::Release() implementation method will clean up the object.
That's always a vulnerable time for a COM server. When it corrupted the heap earlier, a common time for this to trigger a CPU hardware fault (AV = Access Violation) is when it releases memory. Microsoft put a catcher for this hardware exception in place to help diagnose the problem. Without it, you'd have very little chance to figure out what happened because the finalizer runs at an unpredictable time without any of your own code actively running.
The fault is quite serious, you're left with a corrupted heap that's only partly cleaned up. If you keep going, you'll typically just get more AVs and/or you'll leak memory. The worst possible outcome, quite likely btw, is that it doesn't die afterwards but just starts generating bad data or misbehaves unpredictable, causing you to think that it is your code that is buggy.
There's only one party that can fix this problem, the supplier of this COM server. Carefully specify the machine you are running on, especially the operating system version is important, and give them a small piece of code (source included) that reproduces the exception. Keeping it small and highly visible is important or they'll claim it was your code that corrupted the heap. They are likely to do so anyway, heap corruption is very difficult to debug. If you cannot get them responsive, you'd be wise to shop for another vendor.

When is it OK to catch an OutOfMemoryException and how to handle it?

Yesterday I took part in a discussion on SO devoted to OutOfMemoryException and the pros and cons of handling it (C# try {} catch {}).
My pros for handling it were:
The fact that OutOfMemoryException was thrown doesn't generally mean that the state of a program was corrupted;
According to documentation "the following Microsoft intermediate (MSIL) instructions throw OutOfMemoryException: box, newarr, newobj" which just (usually) means that the CLR attempted to find a block of memory of a given size and was unable to do that; it does not mean that no single byte left at our disposition;
But not all people were agree with that and speculated about unknown program state after this exception and an inability to do something useful since it will require even more memory.
Therefore my question is: what are the serious reasons not to handle OutOfMemoryException and immediately give up when it occurs?
Edited: Do you think that OOME is as fatal as ExecutionEngineException?
IMO, since you can't predict what you can/can't do after an OOM (so you can't reliably process the error), or what else did/didn't happen when unrolling the stack to where you are (so the BCL hasn't reliably processed the error), your app must now be assumed to be in a corrupt state. If you "fix" your code by handling this exception you are burying your head in the sand.
I could be wrong here, but to me this message says BIG TROUBLE. The correct fix is to figure out why you have chomped though memory, and address that (for example, have you got a leak? could you switch to a streaming API?). Even switching to x64 isn't a magic bullet here; arrays (and hence lists) are still size limited; and the increased reference size means you can fix numerically fewer references in the 2GB object cap.
If you need to chance processing some data, and are happy for it to fail: launch a second process (an AppDomain isn't good enough). If it blows up, tear down the process. Problem solved, and your original process/AppDomain is safe.
We all write different applications. In a WinForms or ASP.Net app I would probably just log the exception, notify the user, try to save state, and shutdown/restart. But as Igor mentioned in the comments this could very well be from building some form of image editing application and the process of loading the 100th 20MB RAW image could push the app over the edge. Do you really want the use to lose all of their work from something as simple as saying. "Sorry, unable to load more images at this time".
Another common instance that it could be useful to catch out of memory exceptions is in back end batch processing. You could have a standard model of loading multi-mega-byte files into memory for processing, but then one day out of the blue a multi-giga-byte file is loaded. When the out-of-memory occurs you could log the message to a user notification queue and then move on to the next file.
Yes it is possible that something else could blow at the same time, but those too would be logged and notified if possible. If finally the GC is unable to process any more memory the application is going to go down hard anyway. (The GC runs in an unprotected thread.)
Don't forget we all develop different types of applications. And unless you are on older, constrained machines you will probably never get an OutOfMemoryException for typical business apps... but then again not all of us are business tool developers.
To your edit...
Out-of-memory may be caused by unmanaged memory fragmentation and pinning. It can also be caused by large allocation requests. If we were to put up a white flag and draw a line in the sand over such simple issues, nothing would ever get done in large data processing projects. Now comparing that to a fatal Engine exception, well there is nothing you can do at the point the runtime falls over dead under your code. Hopefully you are able to log (but probably not) why your code fell on its face so you can prevent it in the future. But, more importantly, hopefully your code is written in a manner that could allow for safe recovery of as much data as you can. Maybe even recover the last known good state in your application and possibly skip the offending corrupt data and allow it to be manually processed and recovered.
Yet at the same time it is just as possible to have data corruption caused by SQL injection, out-of-sync versions of software, pointer manipulation, buffer over runs, and many other problems. Avoiding an issue just because you think you may not recover from it is a great way to give users error messages as constructive as Please contact your system administrator.
Some commenters have noted that there are situations, when OOM could be the immediate result of attempting to allocate a large number of bytes (graphics application, allocating large array, etc.). Note that for that purpose you could use the MemoryFailPoint class, which raises an InsufficientMemoryException (itself derived from OutOfMemoryException). That can be caught safely, as it is raised before the actual attempt to allocate the memory has been made. However, this can only really reduce the likelyness of an OOM, never fully prevent it.
It all depends on the situation.
Quite a few years ago now I was working on a real-time 3D rendering engine. At the time we loaded all the geometry for the model into memory on start up, but only loaded the texture images when we needed to display them. This meant when the day came our customers were loading huge (2GB) models we were able to cope. The geometry occupied less than 2GB, but when all the textures were added it would be > 2GB. By trapping the out of memory error that was raised when we tried to load the texture we were able to carry on displaying the model, but just as the plain geometry.
We still had a problem if the geometry was > 2GB, but that was a different story.
Obviously, if you get an out of memory error with something fundamental to your application then you've got no choice but to shut down - but do that as gracefully as you can.
Suggest Christopher Brumme's comment in "Framework Design Guideline" p.238 (7.3.7 OutOfMemoryException):
At one end of the spectrum, an OutOfMemoryException could be the result of a failure to obtain 12 bytes for implicitly autoboxing, or a failure to JIT some code that is required for critical backout. These cases are catastrophic failures and ideally would result in termination of the process. At the other end of the spectrum, an OutOfMemoryException could be the result of a thread asking for a 1 GB byte array. The fact that we failed this allocation attempt has no impact on the consistency and viability of the rest of the process.
The sad fact is that CRL 2.0 cannot distinguish among any points on this spectrum. In most managed processes, all OutOfMemoryExceptions are considered equivalent and they all result in a managed exception being propagated up the thread. However, you cannot depend on your backout code being executed, because we might fail to JIT some of your backout methods, or we might fail to execute static constructors required for backout.
Also, keep in mind that all other exceptions can get folded into an OutOfMemoryException if there isn't enough memory to instantiate those other exception objects. Also, we will give you a unique OutOfMemoryException with its own stack trace if we can. But if we are tight enough on memory, you will share an uninteresting global instance with everyone else in the process.
My best recommendation is that you treat OutOfMemoryException like any other application exception. You make your best attempts to handle it and ramain consistent. In the future, I hope the CLR can do a better job of distinguishing catastrophic OOM from the 1 GB byte array case. If so, we might provoke termination of the process for the catastrophic cases, leaving the application to deal with the less risky ones. By threating all OOM cases as the less risky ones, you are preparing for that day.
Marc Gravell has already provided an excellent answer; seeing as how I partly "inspired" this question, I would like to add one thing:
One of the core principles of exception handling is never to throw an exception inside an exception handler. (Note - re-throwing a domain-specific and/or wrapped exception is OK; I am talking about an unexpected exception here.)
There are all sorts of reasons why you need to prevent this from happening:
At best, you mask the original exception; it becomes impossible to know for sure where the program originally failed.
In some cases, the runtime may simply be unable to handle an unhandled exception in an exception handler (say that 5 times fast). In ASP.NET, for example, installing an exception handler at certain stages of the pipeline and failing in that handler will simply kill the request - or crash the worker process, I forget which.
In other cases, you may open yourself up to the possibility of an infinite loop in the exception handler. This may sound like a silly thing to do, but I have seen cases where somebody tries to handle an exception by logging it, and when the logging fails... they try to log the failure. Most of us probably wouldn't deliberately write code like this, but depending on how you structure your program's exception handling, you can end up doing it by accident.
So what does this have to do with OutOfMemoryException specifically?
An OutOfMemoryException doesn't tell you anything about why the memory allocation failed. You might assume that it was because you tried to allocate a huge buffer, but maybe it wasn't. Maybe some other rogue process on the system has literally consumed all of the available address space and you don't have a single byte left. Maybe some other thread in your own program went awry and went into an infinite loop, allocating new memory on each iteration, and that thread has long since failed by the time the OutOfMemoryException ends up on your current stack frame. The point is that you don't actually know just how bad the memory situation is, even if you think you do.
So start thinking about this situation now. Some operation just failed at an unspecified point deep in the bowels of the .NET framework and propagated up an OutOfMemoryException. What meaningful work can you perform in your exception handler that does not involve allocating more memory? Write to a log file? That takes memory. Display an error message? That takes even more memory. Send an alert e-mail? Don't even think about it.
If you try to do these things - and fail - then you'll end up with non-deterministic behaviour. You'll possibly mask the out-of-memory error and get mysterious bug reports with mysterious error messages bubbling up from all kinds of low-level components you wrote that aren't supposed to be able to fail. Fundamentally, you've violated your own program's invariants, and this is going to be a nightmare to debug if your program ever does end up running under low-memory conditions.
One of the arguments presented to me before was that you might catch an OutOfMemoryException and then switch to lower-memory code, like a smaller buffer or a streaming model. However, this "Expection Handling" is a well-known anti-pattern. If you know you're about to chew up a huge amount of memory and aren't sure whether or not the system can handle it, then check the available memory, or better yet, just refactor your code so that it doesn't need so much memory all at once. Don't rely on the OutOfMemoryException to do it for you, because - who knows - maybe the allocation will just barely succeed and trigger a bunch of out-of-memory errors immediately after your exception handler (possibly in some completely different component).
So my simple answer to this question is: Never.
My weasel-answer to this question is: It's OK in a global exception handler, if you're really really careful. Not in a try-catch block.
One practical reason for catching this exception is to attempt a graceful shutdown, with a friendly error message instead of an exception trace.
The problem is larger than .NET. Almost any application written from the fifties to now has big problems if no memory is available.
With virtual address spaces the problem has been sort-of salvaged but NOT solved because even address spaces of 2GB or 4GB may become too small. There are no commonly available patterns to handle out-of-memory. There could be an out-of-memory warning method, a panic method etc. that is guaranteed to still have memory available.
If you receive an OutOfMemoryException from .NET almost anything may be the case. 2 MB still available, just 100 bytes, whatever. I wouldn't want to catch this exception (except to shutdown without a failure dialog). We need better concepts. Then you may get a MemoryLowException where you CAN react to all sorts of situations.
The problem is that - in contrast to other Exceptions - you usually have a low memory situation when the exception occurs (except when the memory to be allocated was huge, but you don't really know when you catch the exception).
Therefore, you must be very careful not to allocate memory when handling this exception. And while this sounds easy it's not, actually it's very hard to avoid any memory allocation and do something useful. Therefore, catching it is usually not a good idea IMHO.
Write code, don't hijack the JVM. When VM is humbly telling you that a memory allocation request failed your best bet is to discard the state of application to avert corrupting application data. Even if you decide to catch OOM you should only try to gather diagnostic information like dumping log, stacktrace etc. Please do not try to initiate a backout procedure as you are not sure whether it will get a chance to execute or not.
Real world analogy: You are traveling in a plane and all engines fail. What would you do after catching a AllEngineFailureException ? Best bet is to grab the mask and prepare for a crash.
When in OOM, dump!!

Categories

Resources