There is a similar question from a decade ago, but there was no good answer - hopefully things have changed since then.
I have a fairly multithreaded Winforms app based on .NET 4.72. I am looking at it with Process Explorer Threads view and it has a lot of clr.dll!LogHelp_TerminateOnAssert+0x6835 type calls. I've setup the Symbols path but it didn't really clear anything up for me.
I took a dump of the application and ran it through DebugDiag and WinDbg and didn't see anything suspicious that stood out.
So my questions:
Should I be concerned with the large number of LogHelp_TerminateOnAssert calls?
Is the application leaking memory?
Does it have an excessive number of exceptions that don't filter down when I am running the app in Visual Studio?
The only entry from my code here is !get_FrameReceived and the stack for that thread is as follows:
The stack for the thread with the most cycles is like this:
Large offsets
clr.dll!LogHelp_TerminateOnAssert+0x6835
means that the actual execution in that method is 0x6835 = 26661 bytes away from its beginning. It's unlikely that a method is that big. (As #blabb points out, it's a 1 byte method).
Usually you see that when you have not set up the symbols correctly (like in the linked original question), but you have that fixed.
Chances are that Microsoft has only release the public symbols of clr.dll and not the private ones. In that case, you'll only see the last known public method.
Start address
Please note that the column is named "Start address". Process Explorer will show the first entry on the stack.
So this is where everything starts. You seem to be concerned that this is where everything ends.
Note: some known internal methods like RtlUserThreadStart and BaseThreadInitThunk will be skipped when displaying the start address. Otherwise they'd probably all look the same.
What the thread is really doing is on the top of the list, i.e. ZwRemoveIoCompletion, so it seems to do some IO operation.
Your questions
Should I be concerned with the large number of LogHelp_TerminateOnAssert calls?
No. These are just the starting point for something good. The GetQueuedCompletionStatus() looks like there's some IO going on and .NET uses IO Completion Ports (IOCP) for you.
Is the application leaking memory?
You don't tell that from a look at call stacks. You tell that by looking at the memory over time.
If you have too much network IO going on and the network can't keep up with it, .NET may have more and more items in the queue, so it may look like a memory leak.
Does it have an excessive number of exceptions that don't filter down when I am running the app in Visual Studio?
You would also not tell that from the call stack. You would attach a debugger (e.g. WinDbg) and check for exceptions (like sxe clr), if you don't trust Visual Studio.
on release build all these asserts are compiled into a simple ret something similar to
ifdef ( debug ) { function body here } elseif { ret } endif
so the symbols with such great offset are bogus
so you may need to load the actual symbols for that address for a sensible callstack
you can see the size of function in clr 4.0.30319 clr.dll is just 1 byte
0:000> x /v /t clr!LogHelp_TerminateOnAssert
pub func 100115a0 0 <NoType> clr!LogHelp_TerminateOnAssert (<no parameter info>)
0:000> .fnent clr!LogHelp_TerminateOnAssert
Debugger function entry 01bad5e0 for:
(100115a0) clr!RtlUnwindCallback | (100115a1) clr!memset
Exact matches:
clr!RtlUnwindCallback (void)
clr!_TlgDefineProvider_annotation__Tlgg_hClrProviderProv (void)
OffStart: 000115a0
ProcSize: 0x1
Prologue: 0x0
Params: 0n0 (0x0 bytes)
Locals: 0n0 (0x0 bytes)
Registers: 0n0
0:000> u clr!LogHelp_TerminateOnAssert l1
clr!RtlUnwindCallback:
100115a0 c3 ret
Related
I have a service that is reporting a large amount of logical threads. From PerfMon:
.NET CLR LocksAndThreads -> # of current logical threads: 663
.NET CLR LocksAndThreads -> # of current physical threads: 659
Process -> Thread Count: 15
This is too high, so I captured a memory dump (via sysinternals procdump.exe) and opened it from Visual Studio (Debug with Mixed). Once everything is loaded up, I looked in the threads window, and it only shows the 15 OS threads, not the .net physical or .net logical. The service itself is a windows service that hosts 4 WCF services (System.ServiceModel.ServiceHost).
How do I find out what these threads are, so that I can fix the code and get rid of them?
How do I get the logical threads to be recognized and displayed by visual studio?
Is it a problem with Visual Studio, or a problem with the dump itself?
First, you need to obtain a memory dump. There are a variety of methods to do this. The easiest one that I've found is procdump.exe, part of SysInternals, available with documentation here.
Next, you need to download and install WinDbg and get SOS working (SOS is the module that let's you view .net managed processes). More information on setting that up is available here. If you got your dump from a server (as was my case), then the .net versions might be slightly off, and SOS won't be able to work correctly with the dump file. If that happens, you'll need to copy a few relevant files from the source .net version to your windbg installation as per this comment.
Once everything is setup and you have the dump loaded, run the command:
!threads
This should give you a list of .net logical threads. Also of note is that before the list of threads, it will give you a summary. Of importance for this question, DeadThread was extremely high (600+). This indicates that the threads have completed, but the memory they are holding (the stack) can't be released.
In my specific case, I found that the threads that were getting hung were threads from the WCF thread pool that couldn't GC because other threads were holding references to it. I changed the other threads to null out the parent thread when they didn't need it anymore, , and that allowed it to get GC'd correctly.
I have a computationally-expensive multi-threaded C# app that seems to crash consistently after 30-90 minutes of running. The error it gives is
The runtime has encountered a fatal error. The address of the error was at 0xec37ebae, on thread 0xbcc. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.
(0xc0000005 is the error-code for Access Violation)
My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint. In fact, the line of code that the debugger says caused the crash is
overallLength += distanceTravelled;
Where both values are of type double
Given all this, I believe the crash must be due to a bug in the compiler or CLR or JIT. I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin. I've never had to view the CIL-binary, or the compiled JIT output, or the native stacktrace (there is no managed stacktrace at the time of the crash), so I'm not sure how. I can't even figure out how to view the state of all the variables at the time of the crash (VS unfortunately won't tell me like it does after managed-exceptions, and outputting them to console/a file would slow down the app 1000-fold, which is obviously not an option).
So, how do I go about debugging this?
[Edit] Compiled under VS 2010 SP1, running latest version of .Net 4.0 Client Profile. Apparently it's ".Net 4.0C/.Net 4.0E, .Net CLR 1.1.4322"
I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin.
"Smaller reproduction" definitely sounds like a great idea here... even if "smaller" won't mean "quicker to reproduce".
Before you even start, try to reproduce the error on another machine. If you can't reproduce it on another machine, that suggests a whole different set of tests to do - hardware, installation etc.
Also, check you're on the latest version of everything. It would be annoying to spend days debugging this (which is likely, I'm afraid) and then end up with a response of "Yes, we know about this - it was a bug in .NET 4 which was fixed in .NET 4.5" for example. If you can reproduce it on a variety of framework versions, that would be even better :)
Next, cut out everything you can in the program:
Does it have a user interface at all? If possible, remove that.
Does it use a database? See if you can remove all database access: definitely any output which isn't used later, and ideally input too. If you can hard code the input within the app, that would be ideal - but if not, files are simpler for reproductions than database access.
Is it data-sensitive? Again, without knowing much about the app it's hard to know whether this is useful, but assuming it's processing a lot of data, can you use a binary search to find a relatively small amount of data which causes the problem?
Does it have to be multi-threaded? If you can remove all the threading, obviously that may well then take much longer to reproduce the problem - but does it still happen at all?
Try removing bits of business logic: if your app is componentized appropriately, you can probably fake out whole significant components by first creating a stub implementation, and then simply removing the calls.
All of this will gradually reduce the size of the app until it's more manageable. At each step, you'll need to run the app again until it either crashes or you're convinced it won't crash. If you have a lot of machines available to you, that should help...
tl;dr Make sure you're compiling to .Net 4.5
This sounds suspiciously like the same error found here. From the MSDN page:
This bug can be encountered when the Garbage Collector is freeing and compacting memory. The error can happen when the Concurrent Garbage Collection is enabled and a certain combination of foreground Garbage Collection and background Garbage Collection occurs. When this situation happens you will see the same call stack over and over. On the heap you will see one free object and before it ends you will see another free object corrupting the heap.
The fix is to compile to .Net 4.5. If for some reason you can't do this, you can also disable concurrent garbage collection by disabling gcConcurrent in the app.config file:
<configuration>
<runtime>
<gcConcurrent enabled="false"/>
</runtime>
</configuration>
Or just compile to x86.
WinDbg is your friend:
http://blogs.msdn.com/b/tess/archive/2006/02/09/net-crash-managed-heap-corruption-calling-unmanaged-code.aspx
http://www.codeproject.com/Articles/23589/Get-Started-Debugging-Memory-Related-Issues-in-Net
http://www.codeproject.com/Articles/22245/Quick-start-to-using-WinDbg
Download Debug Diagnostic Tool v1.2
Run program
Add Rule "Crash"
Select "Specific Process"
on page Advanced Configuration set your exception if you know on which exception it fails or just leave this page as is
Set userdump location
Now wait for process to crash, log file is created by DebugDiag. Now activate tab Advanced Analysis, select Crash/Hang Analyzers in top list and dump file in lower list and hit Start Analysis. This will generate html report for you. Hopes you found usefull info in that report. If you have problem with analyze, upload html report somewhere and place url here so we can focus on it.
My app does not invoke any native code, or use any unsafe blocks, or
even any non-CLS compliant types like uint
You may think this, but threading, synchronization via semaphore, mutex it any handles all are native. .net is a layer over operating system, .net itself does not support pure clr code for multithreading apps, this is because OS already does it.
Most likely this is thread synchronization error. Probably multiple threads are trying to access shared resource like file etc that is outside clr boundary.
You may think you aren't accessing com etc, but when you call certain API like get desktop folder path etc it is called through shell com API.
You have following two options,
Publish your code so that we can review the bottleneck
Redesign your app using .net parallel threading framework, which includes variety of algorithms requiring CPU intensive operations.
Most likely programs fail after certain period of time as collections grow up and operations fail to execute before other thread interfere. For example, producer consumer problem, you will not notice any problem till producer will become slower or fail to finish its operation before consumer kicks in.
Bug in clr is rare, because clr is very stable. But poorly written code may lead error to appear as bug in clr. Clr can not and will never detect whether the bug is in your code or in clr itself.
Did you run a memory test for your machine as the one time I had comparable symptoms one of my dimms turned out to be faulty (a very good memorytester is included in Win7; http://www.tomstricks.com/how-to-test-your-ram-or-memory-with-windows-memory-diagnostic-tool-in-windows-7/)
It might also be a heating/throttling issue if your CPU gets too hot after this period of time. Although that would happen sooner imho.
There should be a dumpfile that you can analyze. If you never did this find someone who did, or send that to microsoft
I will suggest you open a support case via http://support.microsoft.com immediately, as the support guys can show you how to collect the necessary information.
Generally speaking, like #paulsm4 and #psulek said, you can utilize WinDbg or Debug Diag to capture crash dumps of the process, and within it, all necessary information is embedded. However, if this is the very first time you use those tools, you might be puzzled. Microsoft support team can provide you step by step guidance on them, or they can even set up a Live Meeting session with you to capture the data, as the program crashes so often.
Once you are familiar with the tools, in the future you can perform similar troubleshooting more easily,
http://blogs.msdn.com/b/lexli/archive/2009/08/23/when-the-application-program-crashes-on-windows.aspx
BTW, it is too early to say "I've found a bug". Though you cannot obviously find in your program a dependency on native code, it might still have a dependency on native code. We should not draw a conclusion before debugging further into the issue.
I've run into a situation where, according to a minidump, certain files are causing a stack overflow in a recursive-descent parser. Unfortunately I can't get my hands on an example of a file that does this in order to reproduce the issue (the client has confidentiality concerns), which leaves me a bit hamstrung on diagnosing the real problem for the moment.
Clearly the parser needs some attention, but right now my top priority is to just keep the program running. As a stopgap measure, what can I do to keep this from bringing down the whole program?
My first choice would be to find some way to anticipate that I'm running out of room on the stack so that I can gracefully abort the parser before the overflow happens. Failing to parse the file is an acceptable option. The second choice would be to let it happen, catch the error and log it, then continue with the rest of the data.
The parsing is happening in a Parallel.ForEach() loop. I'm willing to swap that out for some other approach if that will help.
EDIT: What would be really killer is if I could just get the size of the current thread's stack, and the position of the stack pointer. Is this possible?
EDIT 2: I finally managed to wring a sample file out of someone and trap the error in a debugger. It turns out it's not code that belongs to us at all - the exception's happening somewhere in HtmlAgilityPack. So it looks like I'm going to have to try and find a completely different tack.
Stack has 1 MB limit by default on desktop CLR, but you can increase it.
You can use a continuation passing style to use heap instead of stack.
In C# 5.0, there's async mechanism provided by compiler that automates this process. I haven't tried this with the latest build. As mentioned by Alex, there is no support for tail-call optimization in C#, and this might be big enough of a reason to adopt F# for parsing problems. Here's some material on lexing and parsing with F#. YMMV, as demonstrated in this article.
You'd also need graph cycle detection to make your program solid in the presence of bad inputs.
As a way to collect more info, you can needle through an accumulator integer that tracks how deep is your call stack. This will not directly translate into memory consumed by said call stack, but it gives you a general idea. For example, you could throw and catch your own exception when that number is greater than some user-configurable or predefined threshold.
public void Recursive(int acc)
{
if (acc > myLimit)
throw new MyOverflowException(acc);
Recursive(acc+1);
}
and then at the call-site:
try { Recursive(0); } catch (MyOverflowException) { /* handle it*/ }
As requested, I'll link you up to the fabulous blog by Eric Lippert on this very topic.
A thread crashing due to SOE will bring down the whole process and there's not much you can do about it.
As a recovery measure you could instead launch the parser as a separate process and set up an IPC mechanism to communicate with the child. That way, the child process is free to die without impacting the main process.
When diagnosing a high CPU issue, the first question that comes to mind is which thread(s) is using all the CPU and what is it doing (in Managed Code terms)? To figure this out one needs to install Process Explorer for example to find the offending thread. Then one needs to capture a dump of the process, load it in something like windbg and find out what the thread(s) are doing - find out the managed stack trace of each thread.
This process is somewhat time consuming. Is there a tool (free or for pay), or reliable code that could be written, that could do all this in a matter of seconds (click of a button). The end result I'd like to see is a list of threads ordered by CPU utilization and the current method it is in with the option to drill down to see the whole stack trace. Basically the same thing you'd see in Process Explorer except for managed code.
This would need to work for .NET 4.0.
you can build your own mini profiler like http://samsaffron.com/archive/2009/11/11/Diagnosing+runaway+CPU+in+a+Net+production+application
Check out Diagnosing runaway CPU in a .Net production application
I think I have a curly one here... I have an WinForms application that crashes fairly regularly every hour or so when running as an x64 process. I suspect this is due to stack corruption and would like to know if anyone has seen a similar issue or has some advice for diagnosing and detecting the issue.
The program in question has no visible UI. It's just a message window that sits in the background and acts as a sort of 'middleware' between our other client programs and a server.
It dies in different ways on different machines. Sometimes it's an 'APPCRASH' dialog that reports a fault in ntdll.dll. Sometimes it's an 'APPCRASH' that reports our own dll as the culprit. Sometimes it's just a silent death. Sometimes our unhandled exception hook logs the error, sometimes it doesn't.
In the cases where Windows Error Reporting kicks in, I've examined memory dumps from several different crash scenarios and found the same Managed exception in memory each time. This is the same exception I see reported as an unhandled exception in the cases where we it logs before it dies.
I've also been lucky (?) enough to have the application crash while I was actively debugging with Visual Studio - and saw that same exception take down the program.
Now here's the kicker. This particular exception was thrown, caught and swallowed in the first few seconds of the program's life. I have verified this with additional trace logging and I have taken memory dumps of the application a couple of minutes after application startup and verified that exception is still sitting there in the heap somewhere. I've also run a memory profiler over the application and used that to verify that no other .NET object had a reference to it.
The code in question looks a bit like this (vastly simplified, but maintains the key points of flow control)
public class AClass
{
public object FindAThing(string key)
{
object retVal = null;
Collection<Place> places= GetPlaces();
foreach (Place place in places)
{
try
{
retval = place.FindThing(key);
break;
}
catch {} // Guaranteed to only be a 'NotFound' exception
}
return retval;
}
}
public class Place
{
public object FindThing(string key)
{
bool found = InternalContains(key); // <snip> some complex if/else logic
if (code == success)
return InternalFetch(key);
throw new NotFoundException(/*UsefulInfo*/);
}
}
The stack trace I see, both in the event log and when looking at the heap with windbg looks a bit like this.
Company.NotFoundException:
Place.FindThing()
AClass.FindAThing()
Now... to me that reeks of something like stack corruption. The exception is thrown and caught while the application is starting up. But the pointer to it survives on the stack for an hour or more, like a bullet in the brain, and then suddenly breaches a crucial artery, and the application dies in a puddle.
Extra clues:
The code within 'InternalFetch' uses some Marshal.[Alloc/Free]CoTask and pinvoke code. I have run FxCop over it looking for portability issues, and found nothing.
This particular manifestation of the issue is only affecting x64 code built in release mode (with code optimization on). The code I listed for the 'Place.Find' method reflects the optimized .NET code. The unoptimized code returns the found object as the last statement, not 'throw exception'.
We make some COM calls during startup before the above code is run... and in a scenario where the above problem is going to manifest, the very first COM call fails. (Exception is caught and swallowed). I have commented out that particular COM call, and it does not stop the exception sticking around on the heap.
The problem might also affect 32 bit systems, but if it does - then the problem does not manifest in the same spot. I was only sent (typical users!) a few pixels worth of a screen shot of an 'APP CRASH' dialog, but the one thing I could make out was 'StackHash_2264' in the faulting module field.
EDIT:
Breakthrough!
I have narrowed down the problem to a particular call to SetTimer.
The pInvoke looks like this:
[DllImport("user32")]
internal static extern IntPtr SetTimer(IntPtr hwnd, IntPtr nIDEvent, int uElapse, TimerProc CB);
internal delegate void TimerProc(IntPtr hWnd, uint nMsg, IntPtr nIDEvent, int dwTime);
There is a particular class that starts a timer in its constructor. Any timers set before that object is constructed work. Any timers set after that object is constructed work. Any timer set during that constructor causes the application to crash, more often than not. (I have a laptop that crashes maybe 95% of the time, but my desktop only crashes 10% of the time).
Whether the interval is set to 1 hour, or 1 second, seems to make no different. The application dies when the timer is due - usually by throwing some previously handled exception as described above. The callback does not actually get executed. If I set the same timer on the very next line of managed code after the constructor returns - all is fine and happy.
I have had a debugger attached when the bad timer was about to fire, and it caused an access violation in 'DispatchMessage'. The timer callback was never called. I have enabled the MDAs that relate to managed callbacks being garbage collected, and it isn't triggering. I have examined the objects with sos and verified that the callback still existed in memory, and that the address it pointed to was the correct callback function.
If I run '!analyze -v' at this point, it usually (but not always) reports something along the lines of 'ERROR_SXS_CORRUPT_ACTIVATION_STACK'
Replacing the call to SetTimer with Microsoft's 'System.Windows.Forms.Timer' class also stops the crash. I've used a Reflector on the class and can see internally it still calls SetTimer - but does not register a procedure. Instead it has a native window that receives the callback. It's pInvoke definition actually looks wrong... it uses 'ints' for the eventId, where MSDN documentation says it should be a UIntPtr.
Our own code originally also used 'int' for nIDEvent rather than IntPtr - I changed it during the course of this investigation - but the crash continued both before and after this declaration change. So the only real difference that I can see is that we are registering a callback, and the Windows class is not.
So... at this stage I can 'fix' the problem by shuffing one particular call to SetTimer to a slightly different spot. But I am still no closer to actually understanding what is so special about starting the timer within that constructor that causes this error. And I dearly would like to understand the root cause of this issue.
Just briefly thinking about it it sounds like an x64 interop issue (i.e., calling x32 native functions from x64 managed code is fraught with danger). Does the problem go away if you force your application to compile as x32 platform from within project properties?
You can read suggestions on forcing x32 compile during x32/x64 development on Dotnetrocks. Richard Campbell's suggestion is that Visual Studio should default to x32 platform and not AnyCPU.
http://www.dotnetrocks.com/default.aspx?showNum=341 (transcript).
With regard to advanced debugging, I have not had a chance to debug x64 interop code, but i hear that this book is an great resource: Advanced .NET Debugging.
Finally, one thing you might try is force Visual Studio to break when an exception is thrown.
Use something like DebugDiag for x64 or Windbg to write a dump on Kernel32!TerminateProcess and second chance exception on .NET which should give you the actual .excr context frame of the exception that occurred.
This should help you in identifying the call-stack for the process terminate.
IMO it could be mostly because of PInvoke calls. You could use Managed Debugging Assistants to debug these issues.
If MDA is used along with Windbg it would give out messages that would be helpful in debugging
Also I have found tools from the http://clrinterop.codeplex.com/ team are extremely handy when dealing with interop
EDIT
This should give an answer why it is not working in 64 bit Issue with callback method in SetTimer Windows API called from C# code .
This does sound like a corruption issue. I would go through all of your interop calls and ensure that all of the parameters to the DllImport'ed functions are the correct types. For exmaple, using an int in place of an IntPtr will work in 32 bit code but can crash 64 bit.
I would use a site like PInvoke.net to verify all of the signatures.