Question about Garbage collection C# .NET - c#

I am experiencing problem in my application with OutOfMemoryException. My application can search for words within texts. When I start a long running process search to search about 2000 different texts for about 2175 different words the application will terminate at about 50 % through with a OutOfMemoryException (after about 6 hours of processing)
I have been trying to find the memory leak. I have an object graph like: (--> are references)
a static global application object (controller) --> an algorithm starter object --> text mining starter object --> text mining algorithm object (this object performs the searching).
The text mining starter object will start the text mining algorithm object's run()-method in a separate thread.
To try to fix the issue I have edited the code so that the text mining starter object will split the texts to search into several groups and initialize one text mining algorithm object for each group of texts sequentially (so when one text mining algorithm object is finished a new will be created to search the next group of texts). Here I set the previous text mining algorithm object to null. But this does not solve the issue.
When I create a new text mining algorithm object I have to give it some parameters. These are taken from properties of the previous text mining algorithm object before I set that object to null. Will this prevent garbage collection of the text mining algorithm object?
Here is the code for the creation of new text mining algorithm objects by the text mining algorithm starter:
private void RunSeveralAlgorithmObjects()
{
IEnumerable<ILexiconEntry> currentEntries = allLexiconEntries.GetGroup(intCurrentAlgorithmObject, intNumberOfAlgorithmObjectsToUse);
algorithm.LexiconEntries = currentEntries;
algorithm.Run();
intCurrentAlgorithmObject++;
for (int i = 0; i < intNumberOfAlgorithmObjectsToUse - 1; i++)
{
algorithm = CreateNewAlgorithmObject();
AddAlgorithmListeners();
algorithm.Run();
intCurrentAlgorithmObject++;
}
}
private TextMiningAlgorithm CreateNewAlgorithmObject()
{
TextMiningAlgorithm newAlg = new TextMiningAlgorithm();
newAlg.SortedTermStruct = algorithm.SortedTermStruct;
newAlg.PreprocessedSynonyms = algorithm.PreprocessedSynonyms;
newAlg.DistanceMeasure = algorithm.DistanceMeasure;
newAlg.HitComparerMethod = algorithm.HitComparerMethod;
newAlg.LexiconEntries = allLexiconEntries.GetGroup(intCurrentAlgorithmObject, intNumberOfAlgorithmObjectsToUse);
newAlg.MaxTermPercentageDeviation = algorithm.MaxTermPercentageDeviation;
newAlg.MaxWordPercentageDeviation = algorithm.MaxWordPercentageDeviation;
newAlg.MinWordsPercentageHit = algorithm.MinWordsPercentageHit;
newAlg.NumberOfThreads = algorithm.NumberOfThreads;
newAlg.PermutationType = algorithm.PermutationType;
newAlg.RemoveStopWords = algorithm.RemoveStopWords;
newAlg.RestrictPartialTextMatches = algorithm.RestrictPartialTextMatches;
newAlg.Soundex = algorithm.Soundex;
newAlg.Stemming = algorithm.Stemming;
newAlg.StopWords = algorithm.StopWords;
newAlg.Synonyms = algorithm.Synonyms;
newAlg.Terms = algorithm.Terms;
newAlg.UseSynonyms = algorithm.UseSynonyms;
algorithm = null;
return newAlg;
}
Here is the start of the thread that is running the whole search process:
// Run the algorithm in it's own thread
Thread algorithmThread = new Thread(new ThreadStart
(RunSeveralAlgorithmObjects));
algorithmThread.Start();
Can something here prevent the previous text mining algorithm object from being garbage collected?

I recommend first identifying what exactly is leaking. Then postulate a cause (such as references in event handlers).
To identify what is leaking:
Enable native debugging for the project. Properties -> Debug -> check Enable unmanaged code debugging.
Run the program. Since the memory leak is probably gradual, you probably don't need to let it run the whole 6 hours; just let it run for a while and then Debug -> Break All.
Bring up the Immediate window. Debug -> Windows -> Immediate
Type one of the following into the immediate window, depending on whether you're running 32 or 64 bit, .NET 2.0/3.0/3.5 or .NET 4.0:
.load C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll for 32-bit .NET 2.0-3.5
.load C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319\sos.dll for 32-bit .NET 4.0
.load C:\WINDOWS\Microsoft.NET\Framework64\v2.0.50727\sos.dll for 64-bit .NET 2.0-3.5
.load C:\WINDOWS\Microsoft.NET\Framework64\v4.0.30319\sos.dll for 64-bit .NET 4.0
You can now run SoS commands in the Immediate window. I recommend checking the output of !dumpheap -stat, and if that doesn't pinpoint the problem, check !finalizequeue.
Notes:
Running the program the first time after enabling native debugging may take a long time (minutes) if you have VS set up to load symbols.
The debugger commands that I recommended both start with ! (exclamation point).
These instructions are courtesy of the incredible Tess from Microsoft, and Mario Hewardt, author of Advanced .NET Debugging.
Once you've identified the leak in terms of which object is leaking, then postulate a cause and implement a fix. Then you can do these steps again to determine for sure whether or not the fix worked.

1) As I said in a comment, if you use events in your code (the AddAlgorithmListeners makes me suspect this), subscribing to an event can create a "hidden" dependency between objects which is easily forgotten. This dependency can mean that an object is not freed, because someone is still listening to one of it's events. Make sure you unsubscribe from all events when you no longer need to listen to them.
2) Also, I'd like to point you to one (probably not-so-off-topic) issue with your code:
private void RunSeveralAlgorithmObjects()
{
...
algorithm.LexiconEntries = currentEntries;
// ^ when/where is algorithm initialized?
for (...)
{
algorithm = CreateNewAlgorithmObject();
....
}
}
Is algoritm already initialized when this method is invoked? Otherwise, setting algorithm.LexiconEntries wouldn't seem like a valid thing to do. This means your method is dependent on some external state, which to me looks like a potential place for bugs creeping in your program logic.
If I understand it correctly, this object contains some state describing the algorithm, and CreateNewAlgorithmObject derives a new state for algorithm from the current state. If this was my code, I would make algorithm an explicit parameter to all your functions, as a signal that the method depends on this object. It would then no longer be hidden "external" state upon which your functions depend.
P.S.: If you don't want to go down that route, the other thing you could consider to make your code more consistent is to turn CreateNewAlgorithmObject into a void method and re-assign algorithm directly inside that method.

Is AddAlgorithmListeners attaching event handlers to events exposed by the algorithm object ? Are the listening objects living longer than the algorithm object - in which case they can continue to keep the algorithm object from being collected.
If yes, try unsubscribing events before you let the object go out of scope.
for (int i = 0; i < intNumberOfAlgorithmObjectsToUse - 1; i++)
{
algorithm = CreateNewAlgorithmObject();
AddAlgorithmListeners();
algorithm.Run();
RemoveAlgoritmListeners(); // See if this fixes your issue.
intCurrentAlgorithmObject++;
}

my suspect is in AddAlgorithmListeners(); are you sure you remove the listener after execution completed?

Is the IEnumerable returned by GetGroup() throw-away or cached? That is, does it hold onto the objects it has emitted, as if it does it would obviously grow linearly with each iteration.
Memory profiling is useful, have you tried examining the application with a profiler? I found Red Gate's useful in the past (it's not free, but does have an evaluation version, IIRC).

Related

Memory Allocation Time (The Fast Way)

For a really simple code snippet, I'm trying to see how much of the time is spent actually allocating objects on the small object heap (SOH).
static void Main(string[] args)
{
const int noNumbers = 10000000; // 10 mil
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
object o = currentNumber; // BOXING occurs here
numbers.Add(o);
}
}
In particular, I want to know how much time is spent allocating space for the all the boxed int instances on the heap (I know, this is an ArrayList and there's horrible boxing going on as well - but it's just for educational purposes).
The CLR has 2 ways of performing memory allocations on the SOH: either calling the JIT_TrialAllocSFastMP (for multi-processor systems, ...SFastSP for single processor ones) allocation helper - which is really fast since it consists of a few assembly instructions - or failing back to the slower JIT_New allocation helper.
PerfView sees just fine the JIT_New being invoked:
However, I can't figure out which - if any - is the native function involved for the "quick way" of allocating. I certainly don't see any JIT_TrialAllocSFastMP. I've already tried raising the count of the loop (from 10 to 500 mil), in the hope of increasing my chances of of getting a glimpse of a few stacks containing the elusive function, but to no avail.
Another approach was to use JetBrains dotTrace (line-by-line) performance viewer, but it falls short of what I want: I do get to see the approximate time it takes the boxing operation for each int, but 1) it's just a bar and 2) there's both the allocation itself and the copying of the value (of which the latter is not what I'm after).
Using the JetBrains dotTrace Timeline viewer won't work either, since they currently don't (quite) support native callstacks.
At this point it's unclear to me if there's a method being dynamically generated and called when JIT_TrialAllocSFastMP is invoked - and by miracle neither of the PerfView-collected stack frames (one every 1 ms) ever capture it -, or somehow the Main's method body gets patched, and those few assembly instructions mentioned above are somehow injected directly in the code. It's also hard to believe that the fast way of allocating memory is never called.
You could ask "But you already have the .NET Core CLR code, why can't you figure out yourself ?". Since the .NET Framework CLR code is not publicly available, I've looked into its sibling, the .NET Core version of the CLR (as Matt Warren recommends in his step 6 here). The \src\vm\amd64\JitHelpers_InlineGetThread.asm file contains a JIT_TrialAllocSFastMP_InlineGetThread function. The issue is that parsing/understanding the C++ code there is above my grade, and also I can't really think of a way to "Step Into" and see how the JIT-ed code is generated, since this is way lower-level that your usual Press-F11-in-Visual-Studio.
Update 1: Let's simplify the code, and only consider individual boxed int values:
const int noNumbers = 10000000; // 10 mil
object o = null;
for (int i=0;i<noNumbers;i++)
{
o = i;
}
Since this is a Release build, and dead code elimination could kick in, WinDbg is used to check the final machine code.
The resulting JITed code, whose main loop is highlighted in blue below, which simply does repeated boxing, shows that the method that handles the memory allocation is not inlined (note the call to hex address 00af30f4):
This method in turn tries to allocate via the "fast" way, and if that fails, goes back to the "slow" way of a call to JIT_New itself):
It's interesting how the call stack in PerfView obtained from the code above doesn't show any intermediary method between the level of Main and the JIT_New entry itself (given that Main doesn't directly call JIT_New):

How to make the JIT extend stack variables to the end of scope (GC is too quick)

We're dealing with the GC being too quick in a .Net program.
Because we use a class with native resources and we do not call GC.KeepAlive(), the GC collects the object before the Native access ends. As a result the program crashes.
We have exactly the problem as described here:
Does the .NET garbage collector perform predictive analysis of code?
Like so:
{ var img = new ImageWithNativePtr();
IntPtr p = img.GetData();
// DANGER!
ProcessData(p);
}
The point is: The JIT generates information that shows the GC that img is not used at the point when GetData() runs. If a GC-Thread comes at the right time, it collects img and the program crashes. One can solve this by appending GC.KeepAlive(img);
Unfortunately there is already too much code written (at too many places) to rectify the issue easily.
Therefore: Is there for example an Attribute (i.e. for ImageWithNativePtr) to make the JIT behave like in a Debug build? In a Debug build, the variable img will remain valid until the end of the scope ( } ), while in Release it looses validity at the comment DANGER.
To the best of my knowledge there is no way to control jitter's behavior based on what types a method references. You might be able to attribute the method itself, but that won't fill your order. This being so, you should bite the bullet and rewrite the code. GC.KeepAlive is one option. Another is to make GetData return a safe handle which will contain a reference to the object, and have ProcessData accept the handle instead of IntPtr — that's good practice anyway. GC will then keep the safe handle around until the method returns. If most of your code has var instead of IntPtr as in your code fragment, you might even get away without modifying each method.
You have a few options.
(Requires work, more correct) - Implement IDisposable on your ImageWithNativePtr class as it compiles down to try { ... } finally { object.Dispose() }, which will keep the object alive provided you update your code with usings. You can ease the pain of doing this by installing something like CodeRush (even the free Xpress supports this) - which supports creating using blocks.
(Easier, not correct, more complicated build) - Use Post Sharp or Mono.Cecil and create your own attribute. Typically this attribute would cause GC.KeepAlive() to be inserted into these methods.
The CLR has nothing built in for this functionality.
I believe you can emulate what you want with a container that implements IDispose, and a using statement. The using statement allows for defining the scope and you can place anything you want in it that needs to be alive over that scope. This can be a helpful mechanism if you have no control over the implementation of ImageWithNativePtr.
The container of things to dispose is a useful idiom. Particularly when you really should be disposing of something ... which is probably the case with an image.
using(var keepAliveContainer = new KeepAliveContainer())
{
var img = new ImageWithNativePtr();
keepAliveContainer.Add(img);
IntPtr p = img.GetData();
ProcessData(p);
// anything added to the container is still referenced here.
}

Uniquely Identifying Reference Types in the Debugger

I come from a C++ background, so apologies if this is a non-C# way of thinking, but I just need to know. :)
In C++ if I have two pointers, and I want to know if they point to the same thing, I can look in the memory/watch window and see their value - to see if they are pointing to the same memory space.
In C#, I haven't been able to find something along those lines. One reference type with exactly the same values could in fact be the exact same object, or it could be something wildly different.
Is there a way for me to see this kind of information in C#? Perhaps some kind of equivalent to the & operator for the watch window or some such?
What you're looking for are object id's. For any referenc type in the debugger you can right click and say "Make Object ID". This will add a # suffix to the value column whenever that instance is displayed in the debugger. You can also add #1, #2, etc ... to the watch window to see them again any time later.
Step 0 - Run this code
static void Main(string[] args)
{
var x = "a string";
var y = x;
System.Diagnostics.Debugger.Break();
}
Step 1 - Right Click and select "Make Object Id"
Step 2 - Instances now display with the 1# suffix. Note: I did nothing special in this step. Immediately after clicking "Make Object Id" both rows updated to display the 1# suffix since they refer to the same instance.
Step 3 - See them at any time by adding 1# to the watch window
In C# projects, if you right-click on a variable's name in one of the variable windows and select "create object ID", Visual Studio will assign a unique ID to that instance and display it in the Value column. The IDs look like {1#}, {2#}, etc. If two objects have the same ID then they're referentially identical.
In code or in the Immediate window, you can also check to see if two objects are identical by using Object.ReferenceEquals().
I don't believe there's a good way to get an actual memory address for an object in the debugger. I'm guessing that's by design, since an object's location in memory is likely to change during garbage collection in a managed application. Of course you could declare an unsafe block, pin the object, and grab a pointer to it using all the usual C/C++ operators. Then you'd be able to see the pointer's value in the debugger. I wouldn't recommend that as a good habit, though - pinning objects tends to muck with the garbage collector's ability to maintain an orderly heap, which can in turn lead to worse performance and memory consumption.
You could use the Immediate Window and use Object.ReferenceEquals(obj1, obj2) to test this out!
I think you want the System.Object.ReferenceEquals function.

Find references to the object in runtime

I have an object, which lives forever. I am deleteing all references I can see, to it after using it, but it still not collected. Its life cycle is pretty sophisticated so I can't be sure that all references been cleared.
if ( container.Controls.Count > 0 )
{
var controls = new Control[ container.Controls.Count ];
container.Controls.CopyTo( controls, 0 );
foreach ( var control in controls )
{
container.Controls.Remove( control );
control.Dispose();
}
controls = null;
}
GC.Collect();
GC.Collect(1);
GC.Collect(2);
GC.Collect(3);
How can I find out what references does it still have? Why is it not collected?
Try using a memory profiler, (e.g. ants) it will tell you what is keeping the object alive. Trying to 2nd guess this type of problem is very hard.
Red-gate gives 14 days trial that should be more then enough time to tack down this problem and decide if a memory profiler provides you with long term value.
There are lots of other memory profilers on the market (e.g. .NET Memory Profiler) most of them have free trials, however I have found that the Red-Gate tools are easy to use, so tend try them first.
You'll have to use Windbg and Sosex extension.
The !DumpHeap and !GCRoot commands can help you to identify the instance, and all remaining references that keep it alive.
I solved a similar issue with the SOS extension (which apparently does no longer work with Visual Studio 2013, but works fine with older versions of Visual Studio).
I used following code to get the address of the object for which I wanted to track references:
public static string GetAddress(object o)
{
if (o == null)
{
return "00000000";
}
else
{
unsafe
{
System.TypedReference tr = __makeref(o);
System.IntPtr ptr = **(System.IntPtr**) (&tr);
return ptr.ToString ("X");
}
}
}
and then, in Visual Studio 2012 immediate window, while running in the debugger, type:
.load C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos.dll
which will load the SOS.dll extension.
You can then use GetAddress(x) to get the hexadecimal address of the object (for instance 8AB0CD40), and then use:
!do 8AB0CD40
!GCRoot -all 8AB0CD40
to dump the object and find all references to the object.
Just keep in mind that if the GC runs, it might change the address of the object.
I've been using .NET Memory Profiler to do some serious memory profiling on one of our projects. It's a great tool to look into the memory management of your app. I don't get paid for this info :) but it just helped me alot.
The garbage collection in .NET is not a counting scheme (such as COM), but a mark-and-sweep implementation. Basically, the GC runs at "random" times when it feels the need to do so, and the collection of the objects is therefore not deterministic.
You can, however, manually trigger a collection (GC.Collect()), but you may have to wait for finalizers to run then (GC.WaitForPendingFinalizers()). Doing this in a production app, however, is discouraged, because it may affect the efficiency of the memory management (GC runs too often, or waits for finalizers to run). If the object still exists, it actually still has some live reference somewhere.
It is not collected because you haven't removed all references to it. The GC will only mark objects for collection if they have no roots in the application.
What means are you using to check up on the GC to see if it has collected your object?

using ThreadPools to search through object lists

I have these container objects (let's call them Container) in a list. Each of these Container objects in turn has a DataItem (or a derivate) in a list. In a typical scenario a user will have 15-20 Container objects with 1000-5000 DataItems each. Then there are some DataMatcher objects that can be used for different types of searches. These work mostly fine (since I have several hundred unit tests on them), but in order to make my WPF application feel snappy and responsive, I decided that I should use the ThreadPool for this task. Thus I have a DataItemCommandRunner which runs on a Container object, and basically performs each delegate in a list it takes as a parameter on each DataItem in turn; I use the ThreadPool to queue up one thread for each Container, so that the search in theory should be as efficient as possible on multi-core computers etc.
This is basically done in a DataItemUpdater class that looks something like this:
public class DataItemUpdater
{
private Container ch;
private IEnumerable<DataItemCommand> cmds;
public DataItemUpdater(Container container, IEnumerable<DataItemCommand> commandList)
{
ch = container;
cmds = commandList;
}
public void RunCommandsOnContainer(object useless)
{
Thread.CurrentThread.Priority = ThreadPriority.AboveNormal;
foreach (DataItem di in ch.ItemList)
{
foreach (var cmd in cmds)
{
cmd(sh);
}
}
//Console.WriteLine("Done running for {0}", ch.DisplayName);
}
}
(The useless object parameter for RunCommandsOnContainer is because I am experimenting with this with and without using threads, and one of them requires some parameter. Also, setting the priority to AboveNormal is just an experiment as well.)
This works fine for all but one scenario - when I use the AllWordsMatcher object type that will look for DataItem objects containing all words being searched for (as opposed to any words, exact phrase or regular expression for instance).
This is a pretty simple somestring.Contains(eachWord) based object, backed by unit tests. But herein lies some hairy strangeness.
When the RunCommandsOnContainer runs using ThreadPool threads, it will return insane results. Say I have a string like this:
var someString = "123123123 - just some numbers";
And I run this:
var res = someString.Contains("data");
When it runs, this will actually return true quite a lot - I have debugging information that shows it returning true for empty strings and other strings that simply do not contain the data. Also, it will some times return false even when the string actually contains the data being looked for.
The kicker in all this? Why do I suspect the ThreadPool and not my own code?
When I run the RunCommandsOnContainer() command for each Container in my main thread (i.e. locking the UI and everything), it works 100% correctly - every time! It never finds anything it shouldn't, and it never skips anything it should have found.
However, as soon as I use the ThreadPool, it starts finding a lot of items it shouldn't, while some times not finding items it should.
I realize this is a complex problem (it is painful trying to debug, that's for sure!), but any insight into why and how to fix this would be greatly appreciated!
Thanks!
Rune
It's a bit hard to see from the fragment you're posting, but judging by the symptoms I would look at the AllWordsMatcher (look for static state). If AllWordsMatcher is stateful you should also check that you're creating a new instance for each thread.
More generally I'd look at all the instances involved in the matching/searching process, specifically at the working objects being used when multithreaded. From past experience, the problem usually lies there. (It's easy to look too much at the object graph representing your business data Container/DataItem in this case)

Categories

Resources