I wanted to detect whether or not my code generates garbage. So I created the following unit test.
[TestClass]
public class AllocationTest
{
int[] generationCollections = new int[3];
[TestMethod]
public void TestGarbageGeneration()
{
generationCollections[0] = GC.CollectionCount(0);
generationCollections[1] = GC.CollectionCount(1);
generationCollections[2] = GC.CollectionCount(2);
// Test for garbage here
for (int generation = 0; generation < generationCollections.Length; generation++)
{
Assert.AreEqual(GC.CollectionCount(generation), generationCollections[generation]);
}
}
}
I put the code in question where the "Test for garbage here" comment is and the results are unpredictable. My understanding is that this is due to the fact that GC runs on a separate thread and can be triggered by code other than my test at any time.
I tried GC.Collect to forcefully run collections before and after the test code but then realized that that always increment the collection count, so that test always fails.
Is there a meaningful way to test for garbage in a unit test?
You can use WMemoryProfiler to find out how many additional types were created. If you profile your own process you will get all addtional created types + some instances used by WMemoryProfiler to generate the report.
You can work around by using a separate process to monitor your managaed heap or by limiting yourself to only your types. If you leak memory you will see it normally in addtional instances created by you.
using (var dumper = new InProcessMemoryDumper(false,false))
{
var statOld = dumper.GetMemoryStatistics();
// allocaton code here
var diff = dumper.GetMemoryStatisticsDiff(statOld);
foreach (var diffinst in diff.Where(d => d.InstanceCountDiff > 1))
{
Console.WriteLine("Added {0} {1}", diffinst.TypeName, diffinst.InstanceCountDiff);
}
}
If you are after how much memory temporary objects did use you you will need to use some profiling Api or tools like PerfView which does use ETL traces generaeted by the CLR. For GC you would need to programatically enable specific stuff like his. I think the GCAllocationTick_V1 event would be interesting in your case as well.
If you do keep a reference to your object before you try to get the diff you would get a pretty good understanding how much memory your object graph will consume.
What you can try to do is to use exactly the same logic to dump GC state before actually asserting like
// do some logic
// GC.Collect, Thread.Sleep, ...
currentCollections[0] = GC.CollectionCount(0);
currentCollections[1] = GC.CollectionCount(1);
currentCollections[2] = GC.CollectionCount(2);
and after that do asserts with these dumped values (BTW in assertions first parameter is expected, and second one is actual)
for (int generation = 0; generation < generationCollections.Length; generation++)
{
Assert.AreEqual(generationCollections[generation], currentCollections(generation));
}
So this may work for most cases, but there is no way to make GC do something - you may just ask it do to something, and then wait in belief...
Related
According 8th step of this post I wrote following simple unit test to sure my Test class doesn't cause memory leak:
private class TestClass
{
}
[TestMethod]
public void MemoryLeakTest()
{
vat testObj = new TestClass();
var weakRef = new WeakReference(testObj)
testObj = null;
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
Assert.IsFalse(weakRef.IsAlive);
}
The test pass in the master branch of my code repository, but when I run the test in another feature branch, the test fails. So I have two questions:
Is this method reliable to detect memory leak of my classes?
What conditions can cause this test pass in one branch and fail in another branch?
There are very few possibilities where this test would do something useful. All of them would involve that the constructor of Test does some kind of registration on a global variable (either register itself on a global event or store the this pointer in a static member). Since that is something that's very rarely done, doing the above test for all classes is overshooting. Also, the test does not cover the far more typical cases for a memory leak in C#: Building up a data structure (e.g. a List) and never clean it up.
This test may fail for a number of reasons: GC.Collect() does not necessarily force all garbage to be cleaned. It should, but there's no guarantee that it will always happen. Also, since testObj is a local variable, it is not yet out of scope when you call GC.Collect(), so depending on how the code is optimized, the variable cannot be considered garbage yet, which makes the test fail.
After reading so much about how to do it, I'm quite confused.
So here is what I want to do:
I have a datastructure/object that holds all kinds of information. I tread the datastructure as if it were immutable. Whenever I need to update information, I make a DeepCopy and do the changes to it. Then I swap the old and the newly created object.
Now I don't know how to do everything right.
Let's look at it from the side of the reader/consumer threads.
MyObj temp = dataSource;
var a = temp.a;
... // many instructions
var b = temp.b;
....
As I understand reading references is atomic. So I don't need any volatile or locking to assign the current reference of dataSource to temp. But what about the Garbage Collection. As I understand the GC has some kind of reference counter to know when to free memory. So when another thread updates dataSource exactly at the moment when dataSource is assigned to temp, does the GC increase the counter on the right memory block?
The other thing is compiler/CLR optimization. I assign dataSource to temp and use temp to access data members. What does the CLR do? Does it really make a copy of dataSource or does the optimizer just use dataSource to access .a and .b? Let's assume that between temp.a and temp.b are lot's of instructions so that the reference to temp/dataSource cannot be held in a CPU register. So is temp.b really temp.b or is it optimized to dataSource.b because the copy to temp can be optimized away. This is especially important if another thread updates dataSource to point to a new object.
Do I really need volatile, lock, ReadWriterLockSlim, Thread.MemoryBarrier or something else?
The important thing to me is that I want to make sure that temp.a and temp.b access the old datastructure even when another thread updates dataSource to another newly created data structure. I never change data inside an existing structure. Updates are always done by creating a copy, updating data and then updating the reference to the new copy of the datastructre.
Maybe one more question. If I don't use volatile, how long does it take until all cores on all CPUs see the updated reference?
When it comes to volatile please have a look here: When should the volatile keyword be used in C#?
I have done a little test programm:
namespace test1 {
public partial class Form1 : Form {
public Form1() { InitializeComponent(); }
Something sharedObj = new Something();
private void button1_Click(object sender, EventArgs e) {
Thread t = new Thread(Do); // Kick off a new thread
t.Start(); // running WriteY()
for (int i = 0; i < 1000; i++) {
Something reference = sharedObj;
int x = reference.x; // sharedObj.x;
System.Threading.Thread.Sleep(1);
int y = reference.y; // sharedObj.y;
if (x != y) {
button1.Text = x.ToString() + "/" + y.ToString();
Update();
}
}
}
private void Do() {
for (int i = 0; i < 1000000; i++) {
Something someNewObject = sharedObj.Clone(); // clone from immutable
someNewObject.Do();
sharedObj = someNewObject; // atomic
}
}
}
class Something {
public Something Clone() { return (Something)MemberwiseClone(); }
public void Do() { x++; System.Threading.Thread.Sleep(0); y++; }
public int x = 0;
public int y = 0;
}
}
In Button1_click there is a for-loop and inside the for-loop I access a datastructure/object once using the direct "shareObj" and once using a temporarily created "reference". Using the reference is enough to make sure that "var a" and "var b" are initialized with values from the same object.
The only thing I don't understand is, why is "Something reference = sharedObj;" not optimized away and "int x = reference.x;" not replaced by "int x = sharedObj.x;"?
How does the compiler, jitter know not to optimize this? Or are temporarily objects never optimized in C#?
But most important: Is my example running as intended because it is correct or is it running as intended only by accident?
As I understand reading references is atomic.
Correct. This is a very limited property though. It means reading a reference will work; you'll never get the bits of half an old reference mixed with the bits of half a new reference resulting in a reference that doesn't work. If there's a concurrent change it promises nothing about whether you get the old or the new reference (what would such a promise even mean?)
So I don't need any volatile or locking to assign the current reference of dataSource to temp.
Maybe, though there are cases where this can have problems.
But what about the Garbage Collection. As I understand the GC has some kind of reference counter to know when to free memory.
Incorrect. There is no reference counting in .NET garbage collection.
If there is a static reference to an object, then it is not eligible for reclamation.
If there is an active local reference to an object, then it is not eligble for reclamation.
If there is a reference to an object in a field of an object that is not eligible for reclamation, then it too is not eligible for reclamation, recursively.
There's no counting here. Either there is an active strong reference prohibiting reclamation, or there isn't.
This has a great many very important implications. Of relevance here is that there can never be any incorrect reference counting, since there is no reference counting. Strong references will not disappear under your feet.
The other thing is compiler/CLR optimization. I assign dataSource to temp and use temp to access data members. What does the CLR do? Does it really make a copy of dataSource or does the optimizer just use dataSource to access .a and .b?
That depends on what dataSource and temp are as far as whether they are local or not, and how they are used.
If dataSource and temp are both local, then it is perfectly possible that either the compiler or the jitter would optimise the assignment away. If they are both local though, they are both local to the same thread, so this isn't going to impact multi-threaded use.
If dataSource is a field (static or instance), then since temp is definitely a local in the code shown (because its initialised in the code fragment shown) the assignment cannot be optimised away. For one thing, grabbing a local copy of a field is in itself a possible optimisation, being faster to do several operations on a local reference than to continually access a field or static. There's not much point having a compiler or jitter "optimisation" that just makes things slower.
Consider what actually happens if you were to not use temp:
var a = dataSource.a;
... // many instructions
var b = dataSource.b;
To access dataSource.a the code must first obtain a reference to dataSource and then access a. Afterwards it obtains a reference to dataSource and then accesses b.
Optimising by not using a local makes no sense, since there's going to be an implicit local anyway.
And there is the simple fact that the fear you have is something considered: After temp = dataSource there's no assumption that temp == dataSource because there could be other threads changing dataSource, so it's not valid to make optimisations predicated on temp == dataSource.*
Really the optimisations you are concerned about are either not relevant or not valid and hence not going to happen.
There is a case that could cause you problems though. It is just about possible for a thread running on one core to not see a change to dataSource made by a thread changing on another core. As such if you have:
/* Thread A */
dataSource = null;
/* Some time has passed */
/* Thread B */
var isNull = dataSource == null;
Then there's no guarantee that just because Thread A had finished setting dataSource to null, that Thread B would see this. Ever.
The memory models in use in .NET itself and in the processors .NET generally runs on (x86 and x86-64) would prevent that happening, but in terms of possible future optimisations, this is something that's possible. You need memory barriers to ensure that Thread A's publishing definitely affects Thread B's reading. lock and volatile are both ways to ensure that.
*One doesn't even need to be multi-threaded for this to not follow, though it is possible to prove in particular cases that there are no single-thread changes that would break that assumption. That doesn't really matter though, because the multi-threaded case still applies.
Let's think of it as a family tree, a father has kids, those kids have kids, those kids have kids, etc...
So I have a recursive function that gets the father uses Recursion to get the children and for now just print them to debug output window...But at some point ( after one hour of letting it run and printing like 26000 rows) it gives me a StackOverFlowException.
So Am really running out of memory? hmmm? then shouldn't I get an "Out of memory exception"? on other posts I found people were saying if the number of recursive calls are too much, you might still get a SOF exception...
Anyway, my first thought was to break the tree into smaller sub-strees..so I know for a fact that my root father always has these five kids, so Instead of Calling my method one time with root passed to it, I said ok call it five times with Kids of root Passes to it.. It helped I think..but still one of them is so big - 26000 rows when it crashes - and still have this issue.
How about Application Domains and Creating new Processes at run time at some certain level of depth? Does that help?
How about creating my own Stack and using that instead of recursive methods? does that help?
here is also a high-level of my code, please take a look, maybe there is actually something silly wrong with this that causes SOF error:
private void MyLoadMethod(string conceptCKI)
{
// make some script calls to DB, so that moTargetConceptList2 will have Concept-Relations for the current node.
// when this is zero, it means its a leaf.
int numberofKids = moTargetConceptList2.ConceptReltns.Count();
if (numberofKids == 0)
return;
for (int i = 1; i <= numberofKids; i++)
{
oUCMRConceptReltn = moTargetConceptList2.ConceptReltns.get_ItemByIndex(i, false);
//Get the concept linked to the relation concept
if (oUCMRConceptReltn.SourceCKI == sConceptCKI)
{
oConcept = moTargetConceptList2.ItemByKeyConceptCKI(oUCMRConceptReltn.TargetCKI, false);
}
else
{
oConcept = moTargetConceptList2.ItemByKeyConceptCKI(oUCMRConceptReltn.SourceCKI, false);
}
//builder.AppendLine("\t" + oConcept.PrimaryCTerm.SourceString);
Debug.WriteLine(oConcept.PrimaryCTerm.SourceString);
MyLoadMethod(oConcept.ConceptCKI);
}
}
How about creating my own Stack and using that instead of recursive methods? does that help?
Yes!
When you instantiate a Stack<T> this will live on the heap and can grow arbitrarily large (until you run out of addressable memory).
If you use recursion you use the call stack. The call stack is much smaller than the heap. The default is 1 MB of call stack space per thread. Note this can be changed, but it's not advisable.
StackOverflowException is quite different to OutOfMemoryException.
OOME means that there is no memory available to the process at all. This could be upon trying to create a new thread with a new stack, or in trying to create a new object on the heap (and a few other cases).
SOE means that the thread's stack - by default 1M, though it can be set differently in thread creation or if the executable has a different default; hence ASP.NET threads have 256k as a default rather than 1M - was exhausted. This could be upon calling a method, or allocating a local.
When you call a function (method or property), the arguments of the call are placed on the stack, the address the function should return to when it returns are put on the stack, then execution jumps to the function called. Then some locals will be placed on the stack. Some more may be placed on it as the function continues to execute. stackalloc will also explicitly use some stack space where otherwise heap allocation would be used.
Then it calls another function, and the same happens again. Then that function returns, and execution jumps back to the stored return address, and the pointer within the stack moves back up (no need to clean up the values placed on the stack, they're just ignored now) and that space is available again.
If you use up that 1M of space, you get a StackOverflowException. Because 1M (or even 256k) is a large amount of memory for these such use (we don't put really large objects in the stack) the three things that are likely to cause an SOE are:
Someone thought it would be a good idea to optimise by using stackalloc when it wasn't, and they used up that 1M fast.
Someone thought it would be a good idea to optimise by creating a thread with a smaller than usual stack when it wasn't, and they use up that tiny stack.
A recursive (whether directly or through several steps) call falls into an infinite loop.
It wasn't quite infinite, but it was large enough.
You've got case 4. 1 and 2 are quite rare (and you need to be quite deliberate to risk them). Case 3 is by far the most common, and indicates a bug in that the recursion shouldn't be infinite, but a mistake means it is.
Ironically, in this case you should be glad you took the recursive approach rather than iterative - the SOE reveals the bug and where it is, while with an iterative approach you'd probably have an infinite loop bringing everything to a halt, and that can be harder to find.
Now for case 4, we've got two options. In the very very rare cases where we've got just slightly too many calls, we can run it on a thread with a larger stack. This doesn't apply to you.
Instead, you need to change from a recursive approach to an iterative one. Most of the time, this isn't very hard thought it can be fiddly. Instead of calling itself again, the method uses a loop. For example, consider the classic teaching-example of a factorial method:
private static int Fac(int n)
{
return n <= 1 ? 1 : n * Fac(n - 1);
}
Instead of using recursion we loop in the same method:
private static int Fac(int n)
{
int ret = 1;
for(int i = 1; i <= n, ++i)
ret *= i;
return ret;
}
You can see why there's less stack space here. The iterative version will also be faster 99% of the time. Now, imagine we accidentally call Fac(n) in the first, and leave out the ++i in the second - the equivalent bug in each, and it causes an SOE in the first and a program that never stops in the second.
For the sort of code you're talking about, where you keep producing more and more results as you go based on previous results, you can place the results you've got in a data-structure (Queue<T> and Stack<T> both serve well for a lot of cases) so the code becomes something like):
private void MyLoadMethod(string firstConceptCKI)
{
Queue<string> pendingItems = new Queue<string>();
pendingItems.Enqueue(firstConceptCKI);
while(pendingItems.Count != 0)
{
string conceptCKI = pendingItems.Dequeue();
// make some script calls to DB, so that moTargetConceptList2 will have Concept-Relations for the current node.
// when this is zero, it means its a leaf.
int numberofKids = moTargetConceptList2.ConceptReltns.Count();
for (int i = 1; i <= numberofKids; i++)
{
oUCMRConceptReltn = moTargetConceptList2.ConceptReltns.get_ItemByIndex(i, false);
//Get the concept linked to the relation concept
if (oUCMRConceptReltn.SourceCKI == sConceptCKI)
{
oConcept = moTargetConceptList2.ItemByKeyConceptCKI(oUCMRConceptReltn.TargetCKI, false);
}
else
{
oConcept = moTargetConceptList2.ItemByKeyConceptCKI(oUCMRConceptReltn.SourceCKI, false);
}
//builder.AppendLine("\t" + oConcept.PrimaryCTerm.SourceString);
Debug.WriteLine(oConcept.PrimaryCTerm.SourceString);
pendingItems.Enque(oConcept.ConceptCKI);
}
}
}
(I haven't completely checked this, just added the queuing instead of recursing to the code in your question).
This should then do more or less the same as your code, but iteratively. Hopefully that means it'll work. Note that there is a possible infinite loop in this code if the data you are retrieving has a loop. In that case this code will throw an exception when it fills the queue with far too much stuff to cope. You can either debug the source data, or use a HashSet to avoid enqueuing items that have already been processed.
Edit: Better add how to use a HashSet to catch duplicates. First set up a HashSet, this could just be:
HashSet<string> seen = new HashSet<string>();
Or if the strings are used case-insensitively, you'd be better with:
HashSet<string> seen = new HashSet<string>(StringComparison.InvariantCultureIgnoreCase) // or StringComparison.CurrentCultureIgnoreCase if that's closer to how the string is used in the rest of the code.
Then before you go to use the string (or perhaps before you go to add it to the queue, you have one of the following:
If duplicate strings shouldn't happen:
if(!seen.Add(conceptCKI))
throw new InvalidOperationException("Attempt to use \" + conceptCKI + "\" which was already seen.");
Or if duplicate strings are valid, and we just want to skip performing the second call:
if(!seen.Add(conceptCKI))
continue;//skip rest of loop, and move on to the next one.
I think you have a recursion's ring (infinite recursion), not a really stack overflow error. If you are got more memory for stack - you will get the overflow error too.
For test it:
Declare a global variable for storing a operable objects:
private Dictionary<int,object> _operableIds = new Dictionary<int,object>();
...
private void Start()
{
_operableIds.Clear();
Recurtion(start_id);
}
...
private void Recurtion(int object_id)
{
if(_operableIds.ContainsKey(object_id))
throw new Exception("Have a ring!");
else
_operableIds.Add(object_id, null/*or object*/);
...
Recurtion(other_id)
...
_operableIds.Remove(object_id);
}
Should if statements be used to assist in the stack's memory de-allocation?
Example A:
var objectHolder = new ObjectHolder();
if (true)
{
List<DefinedObject> objectList;
using (var sr = new GenericStreamReader<DefinedObject>())
{
objectList= sr.Get().ToList();
}
if (true)
{
var DOF = new DefinedObjectFactory();
objectHolder.DefinedObjects = DOF.DefineObjects(objectList);
}
}
//example endpoint
Example B:
var objectHolder = new ObjectHolder();
List<DefinedObject> objectList;
using (var sr = new GenericStreamReader<DefinedObject>())
{
objectList= sr.Get().ToList();
}
var DOF = new DefinedObjectFactory();
objectHolder.DefinedObjects = DOF.DefineObjects(objectList);
//example endpoint
Will Example A have a lighter footprint on the stack when example endpoint is reached versus when example endpoint is reached in Example B??
First off, the whole point of a stack-based allocation system is precisely that you do not need to optimize it in any way. Don't worry about it. The jitter is perfectly capable of realizing that a local will never be read or written again, and re-using its storage if it feels that's the best thing to do. Let the jitter do its job; it doesn't need your help. (*)
Rather, write your program so that local variables make sense to the reader. That's what you should be optimizing for.
Finally, there is never a need to say "if (true) { }" to introduce a new scope. Just introduce a new scope. It is perfectly legal to say:
void M()
{
{ // new scope
}
{ // another one
}
}
(*) There is a situation where the jitter needs your help, and that is the situation where a local refers to an object on the heap that contains a resource that is going to be used by unmanaged code. The jitter does not know that unmanaged code is going to use the object's resources, and might decide that no one is using this object any more and clean it up early. The finalizer of the object might then release the resource on the finalizer thread while the unmanaged code is using the resource! An object is not guaranteed to stay alive just because a local variable is holding onto it. If the local variable is never read from again then the jitter can re-use its storage and tell the garbage collector that it is safe to collect the referred-to object, which will then potentially crash the unmanaged code. You can use KeepAlive to hint to the jitter that a particular object needs to remain alive and not be optimized away.
if(true) will be compiled out in optimized build (the only build where variable lifespan is shorter than whole method) - so there is absolutely no difference between two versions you've suggested.
Based on the usage, I'm assuming DefinedObjectFactory is a class, not a struct. Therefore, the only thing that's on the stack is a reference to DefinedObjectFactory. The actual object is on the heap, and is controlled by the garbage collector.
The only stack space you're potentially saving is the space for a single pointer, so it's not worth it.
Even if this does make some difference, I think it's very likely that you're worrying about the wrong things. Is stack space allocation really an issue for your app?
In general, doing "clever" things in code for the sake of micro-optimizations is usually not worth it. It's usually a much better idea to write your code in the most clean and straightforward way possible. After doing that, if you find you actually have some perf/scalability problems (based on doing actual measurements), you can choose to rewrite/optimize the parts that are bottlenecks.
Most times you'll find that the clean/straightforward/readable version of the code performs just fine. And if it doesn't, the problems are probably not in the places you thought they were.
There's only one thing you can guarantee about de-allocation and garbage collection - it will happen at some stage.
As other people have said, the only thing your if(true) will achieve is being optimised out by the Jitter.
You're already using a using(..) { } pattern so instead of using the if(true) block I'd refactor your code to support this:
if (true)
{
var DOF = new DefinedObjectFactory();
objectHolder.DefinedObjects = DOF.DefineObjects(objectList);
}
To:
using (var DOF = new DefinedObjectFactory())
{
objectHolder.DefinedObjects = DOF.DefineObjects(objectList);
}
And see if that helps.
There is one otherthing that you can try but it's absolutely not recommended for production code as you shouldn't pre-empt the memory manager, but that is simply add a call to
GC.Collect();
When you exit your blocks.
I don't think it'll help but it might demonstrate to you why it's generally not worth worrying about scope and de-allocation.
I am experiencing problem in my application with OutOfMemoryException. My application can search for words within texts. When I start a long running process search to search about 2000 different texts for about 2175 different words the application will terminate at about 50 % through with a OutOfMemoryException (after about 6 hours of processing)
I have been trying to find the memory leak. I have an object graph like: (--> are references)
a static global application object (controller) --> an algorithm starter object --> text mining starter object --> text mining algorithm object (this object performs the searching).
The text mining starter object will start the text mining algorithm object's run()-method in a separate thread.
To try to fix the issue I have edited the code so that the text mining starter object will split the texts to search into several groups and initialize one text mining algorithm object for each group of texts sequentially (so when one text mining algorithm object is finished a new will be created to search the next group of texts). Here I set the previous text mining algorithm object to null. But this does not solve the issue.
When I create a new text mining algorithm object I have to give it some parameters. These are taken from properties of the previous text mining algorithm object before I set that object to null. Will this prevent garbage collection of the text mining algorithm object?
Here is the code for the creation of new text mining algorithm objects by the text mining algorithm starter:
private void RunSeveralAlgorithmObjects()
{
IEnumerable<ILexiconEntry> currentEntries = allLexiconEntries.GetGroup(intCurrentAlgorithmObject, intNumberOfAlgorithmObjectsToUse);
algorithm.LexiconEntries = currentEntries;
algorithm.Run();
intCurrentAlgorithmObject++;
for (int i = 0; i < intNumberOfAlgorithmObjectsToUse - 1; i++)
{
algorithm = CreateNewAlgorithmObject();
AddAlgorithmListeners();
algorithm.Run();
intCurrentAlgorithmObject++;
}
}
private TextMiningAlgorithm CreateNewAlgorithmObject()
{
TextMiningAlgorithm newAlg = new TextMiningAlgorithm();
newAlg.SortedTermStruct = algorithm.SortedTermStruct;
newAlg.PreprocessedSynonyms = algorithm.PreprocessedSynonyms;
newAlg.DistanceMeasure = algorithm.DistanceMeasure;
newAlg.HitComparerMethod = algorithm.HitComparerMethod;
newAlg.LexiconEntries = allLexiconEntries.GetGroup(intCurrentAlgorithmObject, intNumberOfAlgorithmObjectsToUse);
newAlg.MaxTermPercentageDeviation = algorithm.MaxTermPercentageDeviation;
newAlg.MaxWordPercentageDeviation = algorithm.MaxWordPercentageDeviation;
newAlg.MinWordsPercentageHit = algorithm.MinWordsPercentageHit;
newAlg.NumberOfThreads = algorithm.NumberOfThreads;
newAlg.PermutationType = algorithm.PermutationType;
newAlg.RemoveStopWords = algorithm.RemoveStopWords;
newAlg.RestrictPartialTextMatches = algorithm.RestrictPartialTextMatches;
newAlg.Soundex = algorithm.Soundex;
newAlg.Stemming = algorithm.Stemming;
newAlg.StopWords = algorithm.StopWords;
newAlg.Synonyms = algorithm.Synonyms;
newAlg.Terms = algorithm.Terms;
newAlg.UseSynonyms = algorithm.UseSynonyms;
algorithm = null;
return newAlg;
}
Here is the start of the thread that is running the whole search process:
// Run the algorithm in it's own thread
Thread algorithmThread = new Thread(new ThreadStart
(RunSeveralAlgorithmObjects));
algorithmThread.Start();
Can something here prevent the previous text mining algorithm object from being garbage collected?
I recommend first identifying what exactly is leaking. Then postulate a cause (such as references in event handlers).
To identify what is leaking:
Enable native debugging for the project. Properties -> Debug -> check Enable unmanaged code debugging.
Run the program. Since the memory leak is probably gradual, you probably don't need to let it run the whole 6 hours; just let it run for a while and then Debug -> Break All.
Bring up the Immediate window. Debug -> Windows -> Immediate
Type one of the following into the immediate window, depending on whether you're running 32 or 64 bit, .NET 2.0/3.0/3.5 or .NET 4.0:
.load C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll for 32-bit .NET 2.0-3.5
.load C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319\sos.dll for 32-bit .NET 4.0
.load C:\WINDOWS\Microsoft.NET\Framework64\v2.0.50727\sos.dll for 64-bit .NET 2.0-3.5
.load C:\WINDOWS\Microsoft.NET\Framework64\v4.0.30319\sos.dll for 64-bit .NET 4.0
You can now run SoS commands in the Immediate window. I recommend checking the output of !dumpheap -stat, and if that doesn't pinpoint the problem, check !finalizequeue.
Notes:
Running the program the first time after enabling native debugging may take a long time (minutes) if you have VS set up to load symbols.
The debugger commands that I recommended both start with ! (exclamation point).
These instructions are courtesy of the incredible Tess from Microsoft, and Mario Hewardt, author of Advanced .NET Debugging.
Once you've identified the leak in terms of which object is leaking, then postulate a cause and implement a fix. Then you can do these steps again to determine for sure whether or not the fix worked.
1) As I said in a comment, if you use events in your code (the AddAlgorithmListeners makes me suspect this), subscribing to an event can create a "hidden" dependency between objects which is easily forgotten. This dependency can mean that an object is not freed, because someone is still listening to one of it's events. Make sure you unsubscribe from all events when you no longer need to listen to them.
2) Also, I'd like to point you to one (probably not-so-off-topic) issue with your code:
private void RunSeveralAlgorithmObjects()
{
...
algorithm.LexiconEntries = currentEntries;
// ^ when/where is algorithm initialized?
for (...)
{
algorithm = CreateNewAlgorithmObject();
....
}
}
Is algoritm already initialized when this method is invoked? Otherwise, setting algorithm.LexiconEntries wouldn't seem like a valid thing to do. This means your method is dependent on some external state, which to me looks like a potential place for bugs creeping in your program logic.
If I understand it correctly, this object contains some state describing the algorithm, and CreateNewAlgorithmObject derives a new state for algorithm from the current state. If this was my code, I would make algorithm an explicit parameter to all your functions, as a signal that the method depends on this object. It would then no longer be hidden "external" state upon which your functions depend.
P.S.: If you don't want to go down that route, the other thing you could consider to make your code more consistent is to turn CreateNewAlgorithmObject into a void method and re-assign algorithm directly inside that method.
Is AddAlgorithmListeners attaching event handlers to events exposed by the algorithm object ? Are the listening objects living longer than the algorithm object - in which case they can continue to keep the algorithm object from being collected.
If yes, try unsubscribing events before you let the object go out of scope.
for (int i = 0; i < intNumberOfAlgorithmObjectsToUse - 1; i++)
{
algorithm = CreateNewAlgorithmObject();
AddAlgorithmListeners();
algorithm.Run();
RemoveAlgoritmListeners(); // See if this fixes your issue.
intCurrentAlgorithmObject++;
}
my suspect is in AddAlgorithmListeners(); are you sure you remove the listener after execution completed?
Is the IEnumerable returned by GetGroup() throw-away or cached? That is, does it hold onto the objects it has emitted, as if it does it would obviously grow linearly with each iteration.
Memory profiling is useful, have you tried examining the application with a profiler? I found Red Gate's useful in the past (it's not free, but does have an evaluation version, IIRC).