Why are gen1/gen2 collections slower than gen0? - c#

From my anecdotal knowledge, short-lived object creation isn't too troublesome in terms of GC - implying, gen0 collections are extremely fast. Gen1/gen2 collections, however, appear to be a little more "dreaded", i.e. said to usually be a whole lot slower than gen0.
Why is that? What makes, say, a gen2 collection on average significantly slower than gen0?
I'm not aware of any structural differences between the collection approaches itself (i.e., things done in the mark/sweep/compaction phase), am I missing something? Or is it just that e.g. gen2 tends to be larger than gen0, hence more objects to check?

To amplify on canton7's answer, it's worthwhile to note a couple of additional things, one of which increases the cost of all collections (but especially gen1 and gen2) but reduces the cost of allocations between them, and one of which reduces the cost of gen0 and gen1 collections:
Many garbage collectors behave in a fashion somewhat analogous to cleaning out a building by moving everything of value to another building, dynamiting the original, and rebuilding the empty shell. A gen0 collection, which moves things from the gen0 building to the gen1 building, will be fairly fast because the gen0 "building" won't have much stuff in it. A gen2 collection would have to move everything that was in the much larger gen2 building. Garbage collection systems may use a separate building for smaller gen2 objects and larger ones, and manage the larger buildings by tracking individual regions of free space, but moving smaller objects and reclaiming storage wholesale is less work than trying to manage all the individual regions of storage that would become eligible for reuse. A key point to observe about generations here, however, is that even when it's necessary to scan a gen1 or gen2 object, it won't be necessary to move it since the "building" it's in isn't targeted for immediate demolition.
Many systems use a "card table" which can record whether each 4K chunk of memory has been written, or contains a reference that was used to modify an object, since the last gen0 or gen1 collection. This significantly slows down the first write to any such region of storage, but during a gen0 and gen1 collections, it makes it possible to skip the examination of a lot of objects. The details of how the card table are used vary, but the basic concept is that if code has a large array of references, but most of it falls within 4K blocks that aren't tagged, the GC can know without even looking in those blocks that any newer objects which would be accessible through them will also be accessible in other ways, and thus it will be possible to find all gen0 objects without bothering to look in those blocks at all.
Note that even simplistic garbage-collection systems without card tables can be simply and easily made to benefit from the principle of generational GC. For example, on Commodore 64 BASIC, whose garbage collector is horrendously slow, a program that has created lots of long-lived strings can avoid lengthy garbage-collection cycles by using a couple peek and poke statements to adjust the top-of-string-heap pointer just below the bottom of the long-lived strings so they won't be considered for relocation/reclamation. If a program uses hundreds of strings that will last throughout program execution (e.g. a table of two-digit hex strings from 00 to FF), and just a handful of other strings, this may slash garbage-collection times by more than an order of magnitude.

A couple of reasons which come to mind:
They're bigger. Collecting gen1 means also collecting gen0, and doing a gen2 collection means collecting all three. The lower generations are sized smaller as well, as gen0 is collected most frequently and so needs to be cheap.
The main cost of a collection is a function of the number of objects which survive, not the number which die. Generational garbage collectors are built around the generational hypothesis, which says that objects tend to live for a short time, or a long time, but not often in the middle. Gen0 collections by their very definition are comprised mainly of objects which die in that generation, and so collections are cheap: gen1 and gen2 collections have a higher proportion of objects which survive (gen2 should ideally be comprised only of objects which survive), and so are more expensive.
If an object is in gen0, then it can only be referenced by other gen0 objects, or by objects in higher generations which were updated to refer to it. Therefore to see whether an object in gen0 is referenced, the GC needs to check other gen0 objects, as well as only those objects in higher generations which have been updated to point to lower-generation objects (which the GC tracks, see "card tables"). To see whether a gen1 object is referenced it needs to check all of gen0 and gen1, and updated objects in gen2.

Related

What are the different heaps in .net?

I was profiling the memory usage of a Windows Forms application in dotmemory and I noticed that for my application there were 0-4 heaps all of varying sizes as well as the large object heap.
I was just wondering if anyone had a good explanation of what each heap is for and what is typically stored in each heap?
The other answers seem to be missing the fact that there is a difference between heaps and generations. I don't see why a commercial profiler would confuse the two concepts, so I strongly suspect it's heaps and not generations after all.
When the CLR GC is using the server flavor, it creates a separate heap for each logical processor in the process' affinity mask. The reason for this breakdown is mostly to improve scalability of allocations, and to perform in GC in parallel. These are separate memory regions, but you can of course have object references between the heaps and can consider them a single logical heap.
So, assuming that you have four logical processors (e.g. an i5 CPU with HyperThreading enabled), you'll have four heaps under server GC.
The Large Object Heap has an unfortunate, confusing name. It's not a heap in the same sense as the per-processor heaps. It's a logical abstraction on top of multiple memory regions that contain large objects.
You have different heaps because of how the C# garbage collector works. It uses a generational GC, which separates data based on how recently it was used. The use of different heaps allows the garbage collector to clean up memory more efficiently.
According to MSDN:
The heap is organized into generations so it can handle long-lived and short-lived objects. Garbage collection primarily occurs with the reclamation of short-lived objects that typically occupy only a small part of the heap.
Generation 0. This is the youngest generation and contains short-lived objects. An example of a short-lived object is a temporary variable. Garbage collection occurs most frequently in this generation.
Newly allocated objects form a new generation of objects and are implicitly generation 0 collections, unless they are large objects, in which case they go on the large object heap in a generation 2 collection.
Most objects are reclaimed for garbage collection in generation 0 and do not survive to the next generation.
Generation 1. This generation contains short-lived objects and serves as a buffer between short-lived objects and long-lived objects.
Generation 2. This generation contains long-lived objects. An example of a long-lived object is an object in a server application that contains static data that is live for the duration of the process.
Objects that are not reclaimed in a garbage collection are known as survivors, and are promoted to the next generation.
Important data quickly gets put on the garbage collector's back burner (higher generations) and is checked for deletion less often. This lowers the amount of time wasted checking memory that truly needs to persist, which lets you see performance gains from an efficient garbage collector.
When it comes to managed objects, there are three Small Object Heaps(SOH) and one Large Object Heap(LOH).
Large Object Heap (LOH)
Objects that are larger than 85KB are going to LOH straight away. There are some risks if you have too many large objects. That's a different discussion, for more details have a look at The Dangers of the Large Object Heap
Small Object Heap (SOH) : Gen0, Gen1, Gen2
Garbage collector uses a clever algorithm to execute the garbage collecton only when it is required. Full garbage collection process is an expensive operation which shouldn't happen too often. So, it has broken its SOH into three parts and as you have noticed each Gen has a specified amount of memory.
Every small object (<85KB) initially going to Gen0. When Gen0 is full, garbage collection executes only for Gen0. It checks all instances that are in Gen0 and clears/releases memory that is used by any unnecessary objects(non-referenced, out of scoped or disposed objects). And then it copies all the required (in used) instances to Gen1.
Above process is actually occurs even when you execute below: (not required to call manually)
// Perform a collection of generation 0 only.
GC.Collect(0);
In this way, Garbage collector clears the memory that are allocated for short lived instances first (strings which is immutable, variables in methods or smaller scopes).
When GC is keep doing this operation at one stage, Gen1 overflows. Then it does the same operation to Gen1. It clears all the unnecessary memory in Gen1 and copies all required ones to Gen2.
Above process is occurs when you execute below manually (not required to call manually)
// Perform a collection of all generations up to and including 1.
GC.Collect(1);
When GC is keep doing this operation at one stage if Gen2 overflows it tries to clean Gen2.
Above process is occurs even when you execute below manually (not required to do manually)
// Perform a collection of all generations up to and including 2.
GC.Collect(2);
If the amount of memory needs to be copy from Gen1 to Gen2 is greater than the amount of memory available in Gen2, GC throws out of memory exception.

Manual GC Gen2 data allocation

I'm prototyping some managed directx game engine before moving to c++ syntax horror.
So let's say I've got some data (f.e. an array or a hashset of references) that I'm sure it'll stay alive throughout whole application's life. Since performance is crucial here and I'm trying to avoid any lag spikes on generation promotion, I'd like to ask if there's any way to initialize an object (allocate its memory) straight ahead in GC's generation 2? I couldn't find an answer for that, but I'm pretty sure I've seen someone doing that before.
Alternatively since there would be no real need to "manage" that piece of memory, would it be possible to allocate it with unmanaged code, but to expose it to the rest of the code as a .NET type?
You can't allocate directly in Gen 2. All allocations happen in either Gen 0 or on the large object heap (if they are 85000 bytes or larger). However, pushing something to Gen 2 is easy: Just allocate everything you want to go to Gen 2 and force GCs at that point. You can call GC.GetGeneration to inspect the generation of a given object.
Another thing to do is keep a pool of objects. I.e. instead of releasing objects and thus making them eligible for GC, you return them to a pool. This reduces allocations and thus the number of GCs as well.

Can I "prime" the CLR GC to expect profligate memory use?

We have a server app that does a lot of memory allocations (both short lived and long lived). We are seeing an awful lot of GC2 collections shortly after startup, but these collections calm down after a period of time (even though the memory allocation pattern is constant).
These collections are hitting performance early on.
I'm guessing that this could be caused by GC budgets (for Gen2?). Is there some way I can set this budget (directly or indirectly) to make my server perform better at the beginning?
One counter-intuitive set of results I've seen: We made a big reduction to the amount of memory (and Large Object Heap) allocations, which saw performance over the long term improve, but early performance gets worse, and the "settling down" period gets longer.
The GC apparently needs a certain period of time to realise our app is a memory hog and adapt accordingly. I already know this fact, how do I convince the GC?
Edit
OS:64 bit Windows Server 2008 R2
We're using .Net 4.0 ServerGC Batch Latency. Tried 4.5 and the 3 different latency modes, and while average performance was improved slightly, worst case performance actually deteriorated
Edit2
A GC spike can double time taken (we're talking seconds) going from acceptable to unacceptable
Almost all spikes correlate with gen 2 collections
My test run causes a final 32GB heap size. The initial frothiness lasts for the 1st 1/5th of the run time, and performance after that is actually better (less frequent spikes), even though the heap is growing. The last spike near the end of the test (with largest heap size) is the same height as (i.e. as bad as) 2 of the spikes in the initial "training" period (with much smaller heaps)
Allocation of extremely large heap in .NET can be insanely fast, and number of blocking collections will not prevent it from being that fast. Problems that you observe are caused by the fact that you don't just allocate, but also have code that causes dependency reorganizations and actual garbage collection, all at the same time when allocation is going on.
There are a few techniques to consider:
try using LatencyMode (http://msdn.microsoft.com/en-us/library/system.runtime.gcsettings.latencymode(v=vs.110).aspx), set it to LowLatency while you are actively loading the data - see comments to this answer as well
use multiple threads
do not populate cross-references to newly allocated objects while actively loading; first go through active allocation phase, use only integer indexes to cross-reference items, but not managed references; then force full GC couple times to have everything in Gen2, and only then populate your advanced data structures; you may need to re-think your deserialization logic to make this happen
try forcing your biggest root collections (arrays of objects, strings) to second generation as early as possible; do this by preallocating them and forcing full GC two times, before you start populating data (loading millions of small objects); if you are using some flavor of generic Dictionary, make sure to preallocate its capacity early on, to avoid reorganizations
any big array of references is a big source of GC overhead - until both array and referenced objects are in Gen2; the bigger the array - the bigger the overhead; prefer arrays of indexes to arrays of references, especially for temporary processing needs
avoid having many utility or temporary objects deallocated or promoted while in active loading phase on any thread, carefully look through your code for string concatenation, boxing and 'foreach' iterators that can't be auto-optimized into 'for' loops
if you have an array of references and a hierarchy of function calls that have some long-running tight loops, avoid introducing local variables that cache the reference value from some position in the array; instead, cache the offset value and keep using something like "myArrayOfObjects[offset]" construct across all levels of your function calls; it helped me a lot with processing pre-populated, Gen2 large data structures, my personal theory here is that this helps GC manage temporary dependencies on your local thread's data structures, thus improving concurrency
Here are the reasons for this behavior, as far as I learned from populating up to ~100 Gb RAM during app startup, with multiple threads:
when GC moves data from one generation to another, it actually copies it and thus modifies all references; therefore, the fewer cross-references you have during active load phase - the better
GC maintains a lot of internal data structures that manage references; if you do massive modifications to references themselves - or if you have a lot of references that have to be modified during GC - it causes significant CPU and memory bandwidth overhead during both blocking and concurrent GC; sometimes I observed GC constantly consuming 30-80% of CPU without any collections going on - simply by doing some processing, which looks weird until you realize that any time you put a reference to some array or some temporary variable in a tight loop, GC has to modify and sometimes reorganize dependency tracking data structures
server GC uses thread-specific Gen0 segments and is capable of pushing entire segment to next Gen (without actually copying data - not sure about this one though), keep this in mind when designing multi-threaded data load process
ConcurrentDictionary, while being a great API, does not scale well in extreme scenarios with multiple cores, when number of objects goes above a few millions (consider using unmanaged hashtable optimized for concurrent insertion, such as one coming with Intel's TBB)
if possible or applicable, consider using native pooled allocator (Intel TBB, again)
BTW, latest update to .NET 4.5 has defragmentation support for large object heap. One more great reason to upgrade to it.
.NET 4.6 also has an API to ask for no GC whatsoever (GC.TryStartNoGCRegion), if certain conditions are met: https://msdn.microsoft.com/en-us/library/dn906202(v=vs.110).aspx
Also see a related post by Maoni Stephens: https://blogs.msdn.microsoft.com/maoni/2017/04/02/no-gcs-for-your-allocations/

.NET: What is typical garbage collector overhead?

5% of execution time spent on GC? 10%? 25%?
Thanks.
This blog post has an interesting investigation into this area.
The posters conclusion? That the overhead was negligible for his example.
So the GC heap is so fast that in a real program, even in tight loops, you can use closures and delegates without even giving it a second’s thought (or even a few nanosecond’s thought). As always, work on a clean, safe design, then profile to find out where the overhead is.
It depends entirely on the application. The garbage collection is done as required, so the more often you allocate large amounts of memory which later becomes garbage, the more often it must run.
It could even go as low as 0% if you allocate everything up front and the never allocate any new objects.
In typical applications I would think the answer is very close to 0% of the time is spent in the garbage collector.
The overhead varies widely. It's not really practical to reduce the problem domain into "typical scenarios" because the overhead of GC (and related functions, like finalization) depend on several factors:
The GC flavor your application uses (impacts how your threads may be blocked during a GC).
Your allocation profile, including how often you allocate (GC triggers automatically when an allocation request needs more memory) and the lifetime profile of objects (gen 0 collections are fastest, gen 2 collections are slower, if you induce a lot of gen 2 collections your overhead will increase).
The lifetime profile of finalizable objects, because they must have their finalizers complete before they will be eligible for collection.
The impact of various points on each of those axes of relevancy can be analyzed (and there are probably more relevant areas I'm not recalling off the top of my head) -- so the problem is really "how can you reduce those axes of relevancy to a 'common scenario?'"
Basically, as others said, it depends. Or, "low enough that you shouldn't worry about it until it shows up on a profiler report."
In native C/C++ there is sometimes a large cost of allocating memory due to finding a block of free memory that is of the right size, there is also a none 0 cost of freeing memory due to having to linked the freed memory into the correct list of blocks, and combine small blocks into large blocks.
In .NET it is very quick to allocate a new object, but you pay the cost when the garbage collector runs. However to cost of garbage collection short lived object is as close to free as you can get.
I have always found that if the cost of garbage collection is a problem to you, then you are likely to have over bigger problems with the design of your software. Paging can be a big issue with any GC if you don’t have enough physical RAM, so you may not be able to just put all your data in RAM and depend on the OS to provide virtual memory as needed.
It really can vary. Look at this demonstration short-but-complete program that I wrote:
http://nomorehacks.wordpress.com/2008/11/27/forcing-the-garbage-collector/
that shows the effect of large gen2 garbage collections.
Yes, the Garbage Collector will spend some X% of time collecting when averaged over all applications everywhere. But that doesn't necessarily means that time is overhead. For overhead, you can really only count the time that would be left after releasing an equivalent amount of memory on an unmanaged platform.
With that in mind, the actual overhead is negative, but the Garbage collector will save time by release several chunks of memory in batches. That means fewer context switches and an overall improvement in efficiency.
Additionally, starting with .Net 4 the garbage collector does a lot of it's work on a different thread that doesn't interrupt your currently running code as much. As we work more and more with mutli-core machines where a core might even be sitting idle now and then, this is a big deal.

Short-lived objects

What is the overhead of generating a lot of temporary objects (i.e. for interim results) that "die young" (never promoted to the next generation during a garbage collection interval)? I'm assuming that the "new" operation is very cheap, as it is really just a pointer increment. However, what are the hidden costs of dealing with this temporary "litter"?
Not a lot - the garbage collector is very fast for gen0. It also tunes itself, adjusting the size of gen0 depending on how much it manages to collect each time it goes. (If it's managed to collect a lot, it will reduce the size of gen0 to collect earlier next time, and vice versa.)
The ultimate test is how your application performs though. Perfmon is very handy here, showing how much time has been spent in GC, how many collections there have been of each generation etc.
As you say the allocation itself is very inexpensive. The cost of generating lots of short lived objects is more frequent garbage collections as they are triggered when generation 0's budget is exhausted. However, a generation 0 collection is fairly cheap, so as long as your object really are short lived the overhead is most likely not significant.
On the other hand the common example of concatenating lots of strings in a loop pushes the garbage collector significantly, so it all depends on the number of objects you create. It doesn't hurt to think about allocation.
The cost of garbage collection is that managed threads are suspended during compaction.
In general, this isn't something you should probably be worrying about and sounds like it starts to fall very close to "micro-optimization". The GC was designed with an assumption that a "well tuned application" will have all of it's allocations in Gen0 - meaning that they all "die young". Any time you allocate a new object it is always in Gen0. A collection won't occur until the Gen0 threshold is passed and there isn't enough available space in Gen0 to hold the next allocation.
The "new" operation is actually a bunch of things:
allocating memory
running the types constructor
returning a pointer to the memory
incrementing the next object pointer
Although the new operation is designed and written efficiently it is not free and does take time to allocate new memory. The memory allocation library needs to track what chunks are available for allocation and the newly allocated memory is zeroed.
Creating a lot of objects that die young will also trigger garbage collection more often and that operation can be expensive. Especially with "stop the world" garbage collectors.
Here's an article from the MSDN on how it works:
http://msdn.microsoft.com/en-us/magazine/bb985011.aspx
Note: that it describes how calling garbage collection is expensive because it needs to build the object graph before it can start garbage collection.
If these objects are never promoted out of Generation 0 then you will see pretty good performance. The only hidden cost I can see is that if you exceed your Generation 0 budget you will force the GC to compact the heap but the GC will self-tune so this isn't much of a concern.
Garbage collection is generational in .Net. Short lived objects will collect first and frequently. Gen 0 collection is cheap, but depending on the scale of the number of objects you're creating, it could be quite costly. I'd run a profiler to find out if it is affecting performance. If it is, consider switching them to structs. These do not need to be collected.

Categories

Resources