my .net program is causing a BSOD - c#

I am getting a blue screen when my windows winform application is run.
It seems that only one user is getting this. I am not sure where at this time to look for the problem. I am however using some code that I found on CodeProject to trap mouse events and keyboard events http://www.codeproject.com/KB/cs/globalhook.aspx could this possibly be it?
I am looking for suggestions on how I might trap this error. It's only happening on one users computer out of 40, so I am a little perplexed - especially since this user is the primary stakeholder.
Update:
We have one more incident - the common denominator is the docking port. The user was using the same docking port.

It is impossible for your code to be causing a BSOD. If you're not running in kernel mode, then a BSOD isn't your fault (if you'll excuse the pun).
OTOH, I have seen managed code trigger a bug in a piece of kernel-mode code. This bug then caused a BSOD. In my case, the kernel-mode code was part of a piece of VPN software that wanted to understand what code you were running so it could decide whether or not to allow you access to the VPN. The code was using kernel-mode hooks to do this, and they had a bug that was triggered by the loading of large numbers of assemblies.
Apparently, they had never tested their code while Visual Studio was running. It loads add-ins and such at runtime, which triggered their bug. A piece of C# code that simply loaded large numbers of assemblies into an AppDomain (then unloaded the AppDomain and started over) also triggered their bug, so it wasn't a Visual Studio problem.
The moral of the story is that someone needs to look at the crash dump and figure out which piece of kernel-mode software caused the crash, then maybe you can figure out what was going on in the system to trigger the kernel-mode software to crash.

The problem isn't going to be your app directly, but some issue with his system, so as roufamatic says check is event logs. However there are two common 'hardware' errors that tend to cause this sort of issue and you could profitably check these to see if they[ll give you a lead.
Bad memory. If a memory error has developed it's not unusual to see a particular program that can 'cause' the bad memory to be accessed and so lead to a BSOD. If for example he's generally running fairly lightweight applications then it's possible for the memory error to be in a location that is not normally used. When you load up your app - particularly if it's got a large memory footprint and calling in lots of dependencies, you may well trigger the crash indirectly. So run a full check on the machine RAM.
Printer drivers. This used to be more of a problem than it is now, but if you're running on XP then it still does pop up occasionally. Printer drivers are notoriously badly written and quite often buggy. It's not unheard of for an app to call a printer driver which in turns crashes the system. To check this just remove the printer drivers then reload them again afterwards.
EDIT: As pointed out in comments, any bad hardware or bad drivers can cause this sort of behaviour. I simply highlight memory and printer drivers because experience shows that these two are by far and away the most common causes so worth considering first.

I've had to solve this problem before. I was writing user-mode C# code to talk to a HID device on the USB bus. This problem occurred on my laptop but on nobody else's machine. It turned out that it was causing problems because I was running a 64-bit OS and hence had 64-bit drivers. All other users had 32-bit drivers, which didn't have the problem. This was a mildly controlled set of users. I knew each user and they were competent users who could give me information about their hardware, OS and drivers. We were also all using the same device.
I don't remember how I DETERMINED the problem. But I solved it quite simply by setting the application project to only target 32-bit Windows. Without a 64-bit app, the faulty driver is never used.
Have the users update their drivers, their hardware, or change the code to avoid using the driver altogether.

Related

How to unload unsued COM objects/libraries after a complete restart?

Here is the thing. I'm connecting via COM to some devices at KNX/EIB. But sometimes - and I want to be ready for worst-case anyways - my application crashes leaving all objects and libraries exposed somewhere, somehow. I noticed when I restart the app I have trouble to get a connection again. I get an error for a connection procedure that is actually working well normally. Sometimes this connect procedure is working sometimes it is not, randomly. That is bad! After some time (several minutes) it seems to work again after a series of complete fails. But I think I see a pattern now. It doesn't work after a crash with no clean disconnect. My guess is there are objects that hold a connection to the device that us why I can't get a new connection. This is why I ask this question.
Question:
How do I unload those unused objects to kill undead connections?
How do I make Windows to check for unused libraries to be unloaded?
I just want to tell Windows, "I messed up badly and I need to continue my work. Please clean up my mess for me, so I can start fresh! Do I deserve a 2nd chance?"
Edit:
The scenario is the app has crashed and closed. I have no references to anything anymore. No finally clause or anything. The app can only be started again. What can I do to clean up the mess that has been made before, programmatically?
Edit 2:
Hans gave me the hint of killing the responsible server. So for now I solve that with calling taskkill on startup (at least as long I'm in dev). And it works!
C:\Windows\System32\taskkill.exe /F /IM Falcon.exe
This is the failure mode of an out-of-process COM server. If the client program crashes to the desktop without releasing the interface pointers then the server is completely unaware that the client isn't around anymore. And tends to get balky when you try to reconnect, many servers just permit one client.
By far the most common way that programmers induce this failure mode is by using a debugger. They'll click the Red Button or use the Stop Debugging command. Bam, no cleanup of course.
COM garbage-collects unused servers automatically. But that isn't particularly fast, takes an easy 10 minutes before it decides it needs to step in. And doesn't always work for every server, Office programs notoriously don't get cleaned-up for example.
Not much you can do about this when your app keels over in regular usage. Otherwise the kind of problem that killed middle-ware. Still, having such a mishap in a C# program is pretty unusual, the CLR releases interface pointers at program termination even when the app crashed with an exception. You'd have to have the very nasty kind of mishaps to bypass this, critical exceptions like ExecutionEngineException or the one this site is named after.
Don't focus too much on the Stop Debugging induced failures, it is normal and using Task Manager to kill the server is expected and required. Otherwise just be sure to get the nasty bugs out of your code and you won't have a problem. If you need more help then be sure to contact the owner of the server, be sure to have a small repro project available that demonstrates the issue.

How to interpret this stack trace

I recently released a Windows phone 8 app.
The app sometimes seem to crash randomly but the problem is it crash without breaking and the only info I get is a message on output that tells me there were an Access violation without giving any details.
So after releasing, from the crash reports I was able to obtain some more information, but they're kinda cryptical to me.
The info are:
Problem function: unknown //not very useful
Exception type: c0000005 //this is the code for Access violation exception
Stack trace:
Frame Image Function Offset
0 qcdx9um8960 0x00035426
1 qcdx9um8960 0x000227e2
I'm not used to work with memory pointer et similia and I'm not used to see a stack trace like that.
So I have those question:
How should I interpret/read those information, what's the meaning of every piece of information?
Is there a way to leverage those information to target my search for the problem?
Is there a way to get those information while debugging in VS2012
Notes:
I'm not asking what an Access Violation is
I tagged this as c# and c++ because my code is in c# but the exception is generated (I'm semi-guessing) by c++ implementation for the WebBrowser component
edit:
I tried setting the Debug type to Native only, this let me obtain the same info I had in the crash report on the dev center. This way the debugger break when the exception is thrown and let me see the disassebled code, unfortunately there's no qcdx9um8960 .pdb file (even on Microsoft Symbol Server), so I don't know the function name that caused the error.
Curiously, a search on the web for the image name "qcdx9um8960" returns several results referencing Windows Phone 8 and the WebBrowser control. Gathering the answers and replies (some even by MSFT), here is what you should possibly look into:
If you upgraded your application from Windows Phone 6/7 to 8, make sure you are not still referencing any 6/7 DLLs. 1
Make sure you aren't testing or publishing your software in Debug mode. There is a "qcdx9um8960.pdb" file that might be missing, causing the access violation. 1
"...there is a possible race condition known issue if the app has multiple copies of WebBrowser open. See if your code perhaps inadvertently makes more than one instance." 1
That image, "qcdx9um8960" is referencing a Qualcomm DirectX driver DLL. Perhaps it's not the WebBrowser component's fault, but the DirectX driver it might be using to render the web pages. 2
The name of the image suggests that the crash is happening on devices powered by a Qualcomm Snapdragon S4 Plus with model number MSM8960. 3
Assuming the processor above, and cross referencing Windows phones that use that chip, you're likely looking at the issue occurring on the Nokia Lumia 920T. 3 That's not to say that the driver doesn't work on several processor architectures or phones.
There are several other hits regarding crashes and issues debugging in the presence of that DLL, so unfortunately for you, I think you might be at the mercy of some third party software that has a few unresolved issues.
References
1 Access Violation since updated to WP8
2 [Toolkit][WP8] Performance issues with DepthStencilBuffer
3 Snapdragon (system on chip)
This kind of crash "should" never be caused by managed code, so you could go looking for a case where your app invokes some system or library API incorrectly. That's tedious. And the problem might have nothing to do with your app, it might be entirely internal to someone else's code. E.g, maybe WebBrowser crashes when user browses to some evil page. Or the failing code could be running on a thread that never even runs your code. From your observation that the debugger doesn't show any message before the access violation, and the fact that there are only 2 frames on the call stack, I suspect that's most likely.
So you should focus first on getting a (fairly) reliable repro scenario: the (minimal) set of steps that will (often or usually) produce the crash. This may involve interviewing the users who experienced the crash, or maybe some test automation on your part to try to accelerate the failure rate.
Once you have that, Microsoft (or another 3rd party) will accept responsibility -- managed code is never supposed to be able to cause an unhandled exception like access violation. And the scenario might give you a hint about how you can change your app's behavior to avoid the problem, because a real fix might take a long time to be released and distributed.

I've found a bug in the JIT/CLR - now how do I debug or reproduce it?

I have a computationally-expensive multi-threaded C# app that seems to crash consistently after 30-90 minutes of running. The error it gives is
The runtime has encountered a fatal error. The address of the error was at 0xec37ebae, on thread 0xbcc. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.
(0xc0000005 is the error-code for Access Violation)
My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint. In fact, the line of code that the debugger says caused the crash is
overallLength += distanceTravelled;
Where both values are of type double
Given all this, I believe the crash must be due to a bug in the compiler or CLR or JIT. I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin. I've never had to view the CIL-binary, or the compiled JIT output, or the native stacktrace (there is no managed stacktrace at the time of the crash), so I'm not sure how. I can't even figure out how to view the state of all the variables at the time of the crash (VS unfortunately won't tell me like it does after managed-exceptions, and outputting them to console/a file would slow down the app 1000-fold, which is obviously not an option).
So, how do I go about debugging this?
[Edit] Compiled under VS 2010 SP1, running latest version of .Net 4.0 Client Profile. Apparently it's ".Net 4.0C/.Net 4.0E, .Net CLR 1.1.4322"
I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin.
"Smaller reproduction" definitely sounds like a great idea here... even if "smaller" won't mean "quicker to reproduce".
Before you even start, try to reproduce the error on another machine. If you can't reproduce it on another machine, that suggests a whole different set of tests to do - hardware, installation etc.
Also, check you're on the latest version of everything. It would be annoying to spend days debugging this (which is likely, I'm afraid) and then end up with a response of "Yes, we know about this - it was a bug in .NET 4 which was fixed in .NET 4.5" for example. If you can reproduce it on a variety of framework versions, that would be even better :)
Next, cut out everything you can in the program:
Does it have a user interface at all? If possible, remove that.
Does it use a database? See if you can remove all database access: definitely any output which isn't used later, and ideally input too. If you can hard code the input within the app, that would be ideal - but if not, files are simpler for reproductions than database access.
Is it data-sensitive? Again, without knowing much about the app it's hard to know whether this is useful, but assuming it's processing a lot of data, can you use a binary search to find a relatively small amount of data which causes the problem?
Does it have to be multi-threaded? If you can remove all the threading, obviously that may well then take much longer to reproduce the problem - but does it still happen at all?
Try removing bits of business logic: if your app is componentized appropriately, you can probably fake out whole significant components by first creating a stub implementation, and then simply removing the calls.
All of this will gradually reduce the size of the app until it's more manageable. At each step, you'll need to run the app again until it either crashes or you're convinced it won't crash. If you have a lot of machines available to you, that should help...
tl;dr Make sure you're compiling to .Net 4.5
This sounds suspiciously like the same error found here. From the MSDN page:
This bug can be encountered when the Garbage Collector is freeing and compacting memory. The error can happen when the Concurrent Garbage Collection is enabled and a certain combination of foreground Garbage Collection and background Garbage Collection occurs. When this situation happens you will see the same call stack over and over. On the heap you will see one free object and before it ends you will see another free object corrupting the heap.
The fix is to compile to .Net 4.5. If for some reason you can't do this, you can also disable concurrent garbage collection by disabling gcConcurrent in the app.config file:
<configuration>
<runtime>
<gcConcurrent enabled="false"/>
</runtime>
</configuration>
Or just compile to x86.
WinDbg is your friend:
http://blogs.msdn.com/b/tess/archive/2006/02/09/net-crash-managed-heap-corruption-calling-unmanaged-code.aspx
http://www.codeproject.com/Articles/23589/Get-Started-Debugging-Memory-Related-Issues-in-Net
http://www.codeproject.com/Articles/22245/Quick-start-to-using-WinDbg
Download Debug Diagnostic Tool v1.2
Run program
Add Rule "Crash"
Select "Specific Process"
on page Advanced Configuration set your exception if you know on which exception it fails or just leave this page as is
Set userdump location
Now wait for process to crash, log file is created by DebugDiag. Now activate tab Advanced Analysis, select Crash/Hang Analyzers in top list and dump file in lower list and hit Start Analysis. This will generate html report for you. Hopes you found usefull info in that report. If you have problem with analyze, upload html report somewhere and place url here so we can focus on it.
My app does not invoke any native code, or use any unsafe blocks, or
even any non-CLS compliant types like uint
You may think this, but threading, synchronization via semaphore, mutex it any handles all are native. .net is a layer over operating system, .net itself does not support pure clr code for multithreading apps, this is because OS already does it.
Most likely this is thread synchronization error. Probably multiple threads are trying to access shared resource like file etc that is outside clr boundary.
You may think you aren't accessing com etc, but when you call certain API like get desktop folder path etc it is called through shell com API.
You have following two options,
Publish your code so that we can review the bottleneck
Redesign your app using .net parallel threading framework, which includes variety of algorithms requiring CPU intensive operations.
Most likely programs fail after certain period of time as collections grow up and operations fail to execute before other thread interfere. For example, producer consumer problem, you will not notice any problem till producer will become slower or fail to finish its operation before consumer kicks in.
Bug in clr is rare, because clr is very stable. But poorly written code may lead error to appear as bug in clr. Clr can not and will never detect whether the bug is in your code or in clr itself.
Did you run a memory test for your machine as the one time I had comparable symptoms one of my dimms turned out to be faulty (a very good memorytester is included in Win7; http://www.tomstricks.com/how-to-test-your-ram-or-memory-with-windows-memory-diagnostic-tool-in-windows-7/)
It might also be a heating/throttling issue if your CPU gets too hot after this period of time. Although that would happen sooner imho.
There should be a dumpfile that you can analyze. If you never did this find someone who did, or send that to microsoft
I will suggest you open a support case via http://support.microsoft.com immediately, as the support guys can show you how to collect the necessary information.
Generally speaking, like #paulsm4 and #psulek said, you can utilize WinDbg or Debug Diag to capture crash dumps of the process, and within it, all necessary information is embedded. However, if this is the very first time you use those tools, you might be puzzled. Microsoft support team can provide you step by step guidance on them, or they can even set up a Live Meeting session with you to capture the data, as the program crashes so often.
Once you are familiar with the tools, in the future you can perform similar troubleshooting more easily,
http://blogs.msdn.com/b/lexli/archive/2009/08/23/when-the-application-program-crashes-on-windows.aspx
BTW, it is too early to say "I've found a bug". Though you cannot obviously find in your program a dependency on native code, it might still have a dependency on native code. We should not draw a conclusion before debugging further into the issue.

How does running C# app inside a code profiler differ from running one outside code profiler?

I'm trying to understand how a code profiler (in this case the Drone Profiler) runs a .NET app differently from just running it directly. The reason I need to know this is because I have a very strange problem/corruption with my dev computer's .NET install which manifests itself outside of the profiler but very strangely not inside and if I can understand why I can probably fix my computer's issue.
The issue ONLY seems to affect calls to System.Net.NetworkInformation's methods (and only within .NET 3.5 to 2.0, if I build something against 4 all is well). I built a little test app which only does one thing, it calls System.Net.NetworkInformation.IsNetworkAvailable(). Outside of the profiler I get "Fatal Execution Engine Error" occurred in System.dll, and that's all the info it gives. From what I understand that error usually results from native method calls, which presumably occur when the System.dll lets some native DLL perform the IsNetworkAvailable() logic.
I tried to figure out the inside and outside the profiler difference using Process Monitor, recording events from both situations and comparing them. Both logs were the same until just a moment after iphlpapi.dll and winnsi.dll were called and just before the profiler-run code called dnsapi.dll and the non-profiler code began loading crash reporting related stuff. At that moment when it seemed to go wrong the profiler-run code created 4-6 new threads and the non-profiler (crashing) code only created 1 or 2. I don't know what that means, if anything.
Arguably unnecessary background
My Windows 7 included .NET installation (3.5 to 2.0) was working fine until my hard drive suffered some corruption and checkdisk began finding bad clusters. I imaged the drive to a new one and everything works fine except this one issue with .NET.
I need to resolve this problem reinstalling Windows or reverting to image backups.
Here are some of the things I've looked into:
I have diffed the files/directories which seemed most relevant (the .NET stuff under Windows and Program Files) pre- and post- disk trouble and seen no changes where I didn't expect any (no obvious file corruption).
I have diffed the software and system registry hives pre- and post- disk trouble and seen no changes which seemed relevant.
I have created a new user account and cleaned up any environment variables in case environment was related. No change.
I did "sfc /scannow" and it found no integrity problems.
I tried "ngen update" to regenerate pre-compiled code in case I missed something that might be damaged and nothing changed.
I removed my virus scanner to see if it was interfering, no difference.
I tried running the test code in Safe Mode, same crash issue.
I assume I need to repair my .NET installation but because Windows 7 included .NET 3.5 - 2.0 you can't just re-run a .NET installer to redo it. I do not have access to the Windows disks to try to re-install Windows over itself (the computer has a recovery partition but it is unusable); also the drive uses a whole-disk encryption solution and re-installing would be difficult.
I absolutely do not want to start from scratch here and install a fresh Windows, reinstall dozens of software packages, try and remember dozens of development-related customizations/etc.
Given all that... does anyone have any helpful advice? I need .NET 3.5 - 2.0 working as I am a developer and need to build and test against it.
Thanks!
Quinxy
The short answer is that my System.ni.dll file was damaged, I replaced it and all is well.
The long answer might help someone else by way of its approach to the solution...
My problem related to .Net being damaged in such a way that apps wouldn't run except through a profiler. I downloaded the source for the SlimTune open source profiler, built it locally, and set a break point right before the call to Process.Start(). I then compared all the parameters involved in starting the app successfully through the profiler versus manually. The only meaningful difference I found was the addition of the .NET profile parameters added to the environment variables:
cor_enable_profiling=1
cor_profiler={38A7EA35-B221-425a-AD07-D058C581611D}
I then tried setting these in my own user's environment, and voila! Now any app I ran manually would work. (I had actually tried doing the same thing a few hours earlier but I used a GUID that was included in an example and which didn't point to a real profiler and apparently .NET knew I had given it a bogus GUID and didn't run in profiling mode.)
I now went back and began reading about just how a PE file is executed by CLR hoping to figure out why it mattered that my app was run with profiling enabled. I learned a lot but nothing which seemed to apply.
I did however remember that I should recheck the chkdsk log I kept listing the files that were damaged by the drive failure. After the failure I had turned all the listed file ids into file paths/names and I had replaced all the 100+ files I could from backup but sure enough when I went back now and looked I found a note that while I had replaced 4 or 5 .NET related files successfully there was one such file I wasn't able to replace because it was "in use". That file? System.ni.dll!!! I was now able to replace this file from backup and voila my .NET install is back to normal, apps work whether profiled or not.
The frustrating thing is that when this incident first occurred I fully expected the problem to relate to a damaged file, and specifically to a file called System.dll which housed the methods that failed. And so I diffed and rediffed all files named System.dll. But I did not realize at that time that System.ni.dll was a native compiled manifestation of System.dll (or somesuch). And because I had diffed and rediffed the .NET related directories and not noticed this (no idea how I missed it) I'd given up on that approach.
Anyway... long story short, it was a damaged System.ni.dll that caused my problems, one or more clusters within it had their content replaced with 0x0 and it just so happened to manifest as the odd problem I observed.
This sounds like a timing issue, which the profiler "fixed" by making it just a little slower.
Many profilers use instrumentation (more info here), which slightly slows down the application. Apparently it slows one thread down enough that another thread can do a little bit more work, preventing that crash. Errors like these often do not manifest themselves directly on the developers machine, but surface as soon as they run on a processors with more cores or hyper-threading. Sometimes they only occur in release builds (or vice-versa in debug builds). Timing issues can be difficult to track down since the same code may give different results in different conditions (in profiler or debugger).
From your description I'll attempt to do a wild guess on how it may be fixed:
Try to find in the source where the new threads are started. Then after they are spawned add a System.Threading.Thread.Sleep(500); line there to pause the main thread and give the new threads some time to start.
Without the source code and a few stack traces of the crashes, this is quite a bit of guesswork.

App That Uses SDK BSODs in Delphi 2007 But Works in C# [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I'm coding an application that uses a third party SDK (OCXs). I use the SDK in C# and it works just fine. However, I can create the simplest test application with the same objects from the SDK in Delphi 2007 and it compiles ok, but BSODs on the same machine when it gets to a certain point. I've run some other test applications that use the SDK and they operate correctly, so I know the SDK is installed ok and functioning properly.
Other Delphi projects I work on that don't use this specific SDK operate properly.
Any ideas on how to try to fix this problem? Might I need to remove the OCXs that I've installed in Delphi and add them back? How do you do that?
Very unusual problem. Windows NT based OS's are normally very good at containing faults that occur at Ring 3 inside of user applications. Does the SDK in question interact with hardware through kernel level drivers? Also, what information appears on screen when the system BSOD's. Sometimes this is a clue as to the nature of the problem (e.g. Interrupt handler reentrancy problems, faulty hardware, interrupt conflicts). I have seen this type of problem with Dialogic boards on occasion. The problem may also be a kernel level fault that is not driver related.
Without additional details it has hard to do more than speculate as to the cause. I am inclined to think this is not a FPU control word problem, but I could be wrong. FPU control word problems can result in exceptions, but the exception would need to occur in a critical execution context like a driver where it would not be caught and handled by the OS.
FWIW I have seen FPU control word problems in Delphi JNI libraries that crashed the JVM and host application. The fix was to explicitly wrap the call with code to change and restore the control word as suggested by NineBerry. This works provided the application is not multi-threaded.
It turns out this was caused by Kaspersky antivirus. I tracked it down by running Windbg against the crash dump and it pointed to kl1.dll. After searching around I tracked down this article which identified it as Kaspersky. I disabled Kaspersky and the BSOD no longer occurs.
There is a possibility that floating point exceptions cause the problem. Most runtime environments disable floating point exceptions, but with Delphi, they are enabled.
The disable floating point exceptions, use the following code:
Set8087CW($133f);
Note that this will change the behaviour of your own code as well. You will not get any exceptions any more, if there are errors in a floating point calculation, instead the result variable will be set to either NaN, Inf+ or Inf-.
Just because you can use the SDK fine from other applications doesn't inherently mean that it doesn't have a problem. It may be reacting badly to certain data items or even to items that shouldn't exist--assuming the contents of memory that doesn't formally contain anything.
I'm sure you've run into bugs caused by uninitialized data items that simply get whatever value was in memory, something like this could be going on with the data area in question being passed in. (Say, a padding byte in a record structure...)

Categories

Resources