C# garbage collection - c#

I have a business app that I have written, that effectively recurses through a directory structure looking for specific Excel files, and stores their addresses. It then loops through these files and parses them by creating a DocumentParser object for each file, this is done one at a time, and not async. The software seems to be very stable, so much so that the business would like to run it to recurse through a massive directory containing upwards of 10000 relevant Excel files.
My question is, as I am creating a new DocumentParser object each time, will the GC be effective enough to discard each of the objects when they go out of scope, ie when that Excel sheet has been parsed, or is there a way I can monitor this and where necessary manually do a GC? I've never had to deal with such large amounts of data before, generally only testing it on a maximum of 40-50 Excel files at a time.
Thanks.

The GC is a very complex piece of software. And the GC is at least the only one that knows when garbage collection is necessary. So my advice is to leave the GC on it's own.
Additionally: The GC will handle these masses objects. Perhaps you will recognize a decrease of performance. If this is a problem you can try to optimize your code. But not premature.

I would leave the GC to its business. 10,000 objects is not really much work for the GC. And it's likely the cost of the GC work will be much lower than the cost of the Excel work. So it's not worth complicating your design to tweak things for the GC. If you end up with so many files to process that your application can't finish in time, it's most likely going to be the speed of the Excel processing holding you up.
However one note which may be relevant: if the DocumentParser is using unmanaged memory in its work with the Excel file, you can use GC.Add/RemoveMemoryPressure to indicate to the GC the real added cost when opening the file. If you didn't write the DocumentParser yourself, the author may already be doing this.
The issue here is that you may have a managed object that costs something in the order of 100 bytes, which allocates a large amount of unmanaged memory when it does Excel work. The GC will have no way of knowing this, so these methods help notify the GC that there is more memory pressure than it was aware of. This may change its behaviour in how/when it decides to collect, which may lead to the application maintaining a lower memory footprint. If the application's memory usage balloons out over time, then you may start seeing some slow downs from length garbage collection and possibly paging on the machine (depending on how much memory you have). You'll want to keep an eye on its memory usage to make sure it's not leaking memory as it processes - a memory profiler may be helpful there.

You don't need to manually call the GC unless you are holding some very large resource which is not the case in your situation. The GC will tweak itself with every call and if you call it manually you will just disrupt its internal profiling data.
BTW GC can collect stuff not only when it goes out of scope but also after its last usage (i.e. while it is still in scope but the variable is not used anymore).

Yes and no - The GC is effective enough to release when it needs to, but you can't generally be sure when that is.
There is a way to force a GC collection but it's generally considered to be bad practise in production code because of the effects of forcing a stack walk when it's not required is worse then using a bit of extra memory until the GC decides it needs to free resources to allocate more objects.

Related

When GC.Collect() is called and frees up more than 3GB of space, is this necessarily a good thing?

I've read many questions on this network in regards to when one should use GC.Collect(), but none address this concern.
If we somehow manage to use it, as such as it frees a large number of memory, is this necessarily a good thing, rather than simply relying on the Garbage Collector to manage the memory?
Regardless of how idle my ASP.NET application stays, or how much time passes, or operations are done, the memory that gets freed when explicitly calling GC.Collect() is never addressed (only memory on top of it is, like it's some kind of a cache.)
One theory I have in mind, is that perhaps the reason those 3GB+ are not freed (but rather the application frees memory above that threshold) might be due to reusing parts of it, or maybe some other helpful optimization, at the cost of, of course, using that memory throughout the run-time of the application.
Could anyone share some insight on these concerns? I'm following best practices and disposing objects whenever given the chance. No unmanaged resource is left to be finalized - it's always disposed by me.
EDIT: The Garbage Collector used is the workstation one, not the server.
The garbage collector's job is to simulate a machine with infinite memory. If your system has enough unallocated memory to continue functioning, why should the collector spend CPU cycles performing the hard work of identifying unreferenced objects? That's a performance cost with zero benefit.
If left alone, the collector will run and free those 3GB when the system starts to run low on memory. Until that point it is dormant.

Simple algorithm to determine when to free some memory .Net

Our system keeps hold of lots of large objects for performance. However, when running low on memory, we want to drop some of the objects. The objects are prioritized, so I know which ones to drop. Is there a simple way of determining when to free memory? Also, dropping 1 object may not be enough, so I guess I need a loop to drop, check, drop again if necessary, etc. But in c#, I won't necessarily see the effect immediately of dropping an object, so how do I avoid kicking too much stuff out?
I guess it's just a simple function of used vs total physical & virtual memory. But what function?
Edit: Some clarifications
"Large objects" was misleading. I meant logical "package" of objects (the objects should be small enough individually to avoid the LOB - that's the intention certainly) that together are large (~ 100MB?)
A request can come in which requires the use of one such package. If it is in memory, the response is rapid. If not, it needs to be reconstructed, which is very slow. So I want to keep stuff in memory as long as possible, but can ditch the least requested ones when necessary.
We have no sensible way to serialize these packages. We should probably do that, but it's a lot of work and there's a lot of resistance to doing so.
Our original simple approach is to periodically compare the following to a configurable threshold.
var c = new ComputerInfo();
return c.AvailablePhysicalMemory / c.TotalPhysicalMemory;
There're a lot of different topics on this questions and I think is best to clarify them before actually answering.
First of, you say your app does get a hold of a lot of "large objects". Define large object. Anything larger than about 85K goes into the LOH which only gets collected as part of a generation 2 collection (the most expensive of them all), anything smaller than that, even if you think is a "big" object, is not and it's treated as any other kind of object.
Secondly there're two problems in terms of "managing memory"
One is managing the amount of space you're using inside your virtual memory space. That is, in 32 bit systems making sure you can address all the memory you're asking for, which in Windows 32 bit uses to be around 1,5 GB.
Secondly is managing disposing of that memory when it's needed, which is a part of the garbage collector work so that it triggers when there's a shortage on memory (although that doesn't mean you can't get an OutOfMemoryException if you don't give the GC time enough to do its job).
With that said, I think you should forget about taking the place of the GC... just let it do its job and, if you're worried then find the critical paths that may fail (on memory request) and protect yourself against OutOfMemoryExceptions.
There're a lot of different patterns for handling the case you're posting and most of them really depend on your business scenario. One example is having a state machine that can actually go to an "OutOfMemory" state, in which case the system switches to freeing memory before doing anything else (that includes disposing old objects and invoking the GC to clean everything up, all while you patiently wait for it to happen).
Other techniques involve saving the data to the disk and then manually swapping in and out objects based on some algorithm when you reach certain levels. That means stopping all your threads (or some, depending on business) and moving the data back and forth.
If your large objects are all controlled in terms of location you can also declare a facade over their creation, so that the facade can check whether it needs to free objects or not based on the amount of memory (virtual memory) your process is using. BTW, use the PerformanceInfo API call as quoted in the other answer as this will include the amount of memory used by unmanaged code, which is, nonetheless, located inside the virtual memory space of your process.
Don't worry too much about "real" memory, as the operating system will make sure the most appropriate pages are located in memory.
Then there're hundreds of other optimizations that completely depend on your business scenario. For example databases "know" to bring data to memory depending on the query and predicting the data you're going to use in advance so the data is ready and they do remove objects that are not used... but that's another topic.
Edit: Based on your edits to the question.
Checking memory in the facade will not add a significant overhead in terms of performance.
If you start getting low on memory you should take a decision of how many objects / how much space are you going to free. Don't do it one at a time, take a bunch of them and free enough memory so that you don't have to collect again.
If you go with the previous approach you can service the request after you've freed enough space and continue cleaning in background.
One of the fastest ways of handling memory / disk swapping is by using memory mapped files.
Use GC.GetTotalMemory and if this exceeds your expectation then you can nullify the objects that you want to release and call GC.Collect.
Have a look at the accepted answer to this question. It uses the GetPerformanceInfo Windows API to determine memory consumption of all sorts. Task Manager is using the same information. This should help you writing a class that observes memory consumption periodically.
Once memory runs low you can fill a FIFO queue with soon-to-be deleted tasks.
The observer will delete the first object in the queue and maybe call GCCollect manually, I'm not too sure about this.
Give the collection some time before you recheck the mem consumption for your application. If there is still not enough free mem, delete the next object from the queue and so on...

Undesirable Garbage Collection

In a title "Forcing a Garbage Colection" from book "C# 2010 and the .NET 4 Platform" by Andrew Troelsen written:
"Again, the whole purpose of the .NET garbage collector is to manage memory on our behalf. However, in some very rare circumstances, it may be beneficial to programmatically force a garbage collection using GC.Collect(). Specifically:
• Your application is about to enter into a block of code that you don’t want interrupted by a possible garbage collection.
...
"
But stop! Is there a such case when Garbage Collection is undesirable? I never saw/read something like that (because of my little development experience of course). If while your practice you have done something like that, please share. For me it's very interesting point.
Thank you!
Yes, there's absolutely a case when garbage collection is undesirable: when a user is waiting for something to happen, and they have to wait longer because the code can't proceed until garbage collection has completed.
That's Troelsen's point: if you have a specific point where you know a GC isn't problematic and is likely to be able to collect significant amounts of garbage then it may be a good idea to provoke it then, to avoid it triggering at a less opportune moment.
I run a recipe related website, and I store a massive graph of recipes and their ingredient usage in memory. Due to the way I pivot this information for quick access, I have to load several gigs of data into memory when the application loads before I can organize the data into a very optimized graph. I create a huge amount of tiny objects on the heap that, once the graph is built, become unreachable.
This is all done when the web application loads, and probably takes 4-5 seconds to do. After I do so, I call GC.Collect(); because I'd rather re-claim all that memory now rather than potentially block all threads during an incoming HTTP request while the garbage collector is freaking out cleaning up all these short lived objects. I also figure it's better to clean up now since the heap is probably less fragmented at this time, since my app hasn't really done anything else so far. Delaying this might result in many more objects being created, and the heap needing to be compressed more when GC runs automatically.
Other than that, in my 12 years of .NET programming, I've never come across a situation where I wanted to force the garbage collector to run.
The recommendation is that you should not explicitly call Collect in your code. Can you find circumstances where it's useful?
Others have detailed some, and there are no doubt more. The first thing to understand though, is don't do it. It's a last resort, investigate other options, learn how GC works look at how your code is impacted, follow best practices for your designs.
Calling Collect at the wrong point will make your performance worse. Worse still, to rely on it makes your code very fragile. The rare conditions required to make a call to Collect beneficial, or at last not harmful, can be utterly undone with a simple change to the code, which will result unexpected OOMs, sluggish performamnce and such.
I call it before performance measurements so that the GC doesn't falsify the results.
Another situation are unit-tests testing for memory leaks:
object doesItLeak = /*...*/; //The object you want to have tested
WeakReference reference = new WeakRefrence(doesItLeak);
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
Assert.That(!reference.IsAlive);
Besides those, I did not encounter a situation in which it would actually be helpful.
Especially in production code, GC.Collect should never be found IMHO.
It would be very rare, but GC can be a moderately expensive process so if there's a particular section that's timing sensitive, you don't want that section interupted by GC.
Your application is about to enter into a block of code that you don’t
want interrupted by a possible garbage collection. ...
A very suspect argument (that is nevertheless used a lot).
Windows is not a Real Time OS. Your code (Thread/Process) can always be pre-empted by the OS scheduler. You do not have a guaranteed access to the CPU.
So it boils down to: how does the time for a GC-run compare to a time-slot (~ 20 ms) ?
There is very little hard data available about that, I searched a few times.
From my own observation (very informal), a gen-0 collection is < 40 ms, usually a lot less. A full gen-2 can run into ~100 ms, probably more.
So the 'risk' of being interrupted by the GC is of the same order of magnitude as being swapped out for another process. And you can't control the latter.

Deallocate memory from large data structures in C#

I have a fewSortedList<>andSortedDictionary<>structures in my simulation code and I add millions of items in them over time. The problem is that the garbage collector does not deallocate quickly enough memory so there is a huge hit on the application's performance. My last option was to engage theGC.Collect()method so that I can reclaim that memory back. Has anyone got a different idea? I am aware of theFlyweightpattern which is another option but I would appreciate other suggestions that would not require huge refactoring of my code.
You are fighting the "There's no free lunch" principle. You cannot assume that stuffing millions of items in a list isn't going to affect perf. Only the SortedList<> should be a problem, it is going to start allocating memory in the Large Object Heap. That allocation isn't going to be freed soon, it takes a gen #2 collection to chuck stuff out of the LOH again. This delay should not otherwise affect the perf of your program.
One thing you can do is avoiding the multiple of copies of the internal array that SortedList<> will jam into the LOH when it keeps growing. Try to guess a good value for Capacity so it pre-allocates the large array up front.
Next, use Perfmon.exe or TaskMgr.exe and looks at the page fault delta of your program. It should be quite busy while you're allocating. If you see low values (100 or less) then you might have a problem with the paging file being fragmented. A common scourge on older machines that run XP. Defragging the disk and using SysInternals' PageDefrag utility can do extraordinary wonders.
I think the SortedList uses a array as backing field, which means that large SortedList get allocated on the Large object heap. The large object heap can get defragmentated, which can cause an out of memory exception while in principle there is still enough memory available.
See this link.
This might be your problem, as intermediate calls to GC.collect prevent the LOH from getting badly defragmented in some scenarios, which explains why calling it helps you reduce the problem.
The problem can be mitigated by splitting large objects into smaller fragments.
I'd start with doing some memory profiling on your application to make sure that the items you remove from those lists (which I assume is happening from the way your post is written) are actually properly released and not hanging around places.
What sort of performance hit are we talking and on what operating system? If I recall, GC will run when it's needed, not immediately or even "soon". So task manager showing high memory allocated to your application is not necessarily a problem. What happens if you put the machine under higher load (e.g. run several copies of your application)? Does memory get reclaimed faster in that scenario or are you starting to run out of memory?
I hope answers to these questions will help point you in a right direction.
Well, if you keep all of the items in those structures, the GC will never collect the resources because they still have references to them.
If you need the items in the structures to be collected, you must remove them from the data structure.
To clear the entire data structure try using Clear() and setting the data structure reference to null. If the data is still not getting collected fast enough, call CC.Collect().

C# .NET Linq Memory Cleanup or Leak?

I have a large 2GB file with 1.5 million listings to process. I am running a console app that performs some string manipulation then uploads each listing to the database.
I created a LINQ object and clear the object by assigning it to a new LinqObject() for each listing (loop).
When the object is complete, I add it to a list.
When the list reaches 100 objects, I submitAll on the entire list, clear the list, then repeat.
My memory usage continues to grow as the program runs. Is there anything I should be doing to keep memory usage down? I tried GC.collect. I think I want to use dispose..
Thanks in advance for looking.
It's normal for the memory usage of a program to increase when it's working. You should not try to force the garbage collector to reduce the memory usage to try to save resources, this will most likely waste resources instead.
Contrary to one's first reaction, high memory usage is not a performance problem as long as there are any free memory left at all. Having a lot of unused memory doesn't increase the performance a bit. If you try to reduce the memory usage only to keep it down, you are just wasting CPU time doing cleanup that is not needed.
If you are running out of free memory or if some other application needs it, the garbage collector will do the appropriate cleanup. In almost every situation the garbage collector will know much more about the current memory situatiuon than you can possibly anticipate when writing the code.
If you are using objects that implement the IDisposable interface, you should call the Dispose method to free unmanaged resources, but all other objects are handled by the garbage collector. Managed objects normally don't leak memory at all.
Do you need your memory usage to stay low? Absent an actual functional problem, high memory usage in and of itself is not an issue.
How large is the memory usage growing? It may be that .NET is just "settling" effectively.
It's not really clear exactly how you're doing this, but the general principle sounds okay. I suggest you take the database work out of the equation - just comment out whichever line would actually submit to the database. See how much memory that uses. Other than the StreamReader (or whatever) you shouldn't have anything else that needs disposing if you're not touching the database - just building batches of transformed objects and throwing them away.

Categories

Resources