I have been investigating some garbage collection issues in a c# server app. I'm currently using PerfView to do this. After collecting some data and getting a load of GC Stats I'm a little confused about one of the columns 'Trigger Reason'. I'm getting two values 'AllocLarge' and 'AllocSmall'. I have searched through the help and google and can't find what exactly these two terms mean.
The .NET GC treats objects larger than 85K (a large object) very differently than other objects (small objects). In particular large objects are only collected in 'Generation 2' (the most expensive kind of GC). 'AllocLarge' means a GC was triggered while allocating a large object (and thus must have provoked a Gen 2 GC). 'AllocSmall' means a GC happened in responce to a allocation of an 'ordinary' object.
Note that in general it is bad to have short lived large objects (since these force expensive GCs). You can see everywhere you allocated a large object by looking at the 'GC Alloc Stats' view and looking for the pseudo-frame 'LargeObject'. Double click on that (which brings you to the 'callers' view, and yoiu will see where you are allocating large objects.
Related
Runtime marks gen2 memory regions (aka card tables) via write barriers to detect if a younger generation object reference is written to gen2 object.if a gen 0 or 1 collection occurs.they are checked afterwards. While marking part affects assignment performance, obviously GC spends more time to collect Gen0 or 1.
I have pooled objects which are written references of short lived objects very frequently.
What i want to achieve is have those pooled objects always stay in gen0 but i cant think of any tehnique. Also i want to discuss if it is benefital. Maybe GC team should include it as a feature?
In short, i have long-lived objects. they hold references to both persistent and volatile objects... volitile object references are written to it very frequently which make them scanned every gen0 + write barrier - card table management overhead. what do you think the best way to squeze best performance?
edit: it is about zero allocation asnctaskmethodbuilder. uploded working sample to github:
https://github.com/crowo/LightTask/tree/master/LightTask
The issue you are describing indeed exists. Card tables scanning and promotions that results from it indeed may add some overhead. I've explained this mechanism in my video here. Unfortunatelly, it is just a feature and/or caveat of most generational GCs. Moreover, it may lead to nepotism.
BUT, the main question is whether it is a real problem or you just don't feel comfortable with the knowledge that such overhead happens? In other words, you should follow "measure first" approach.
Unfortunatelly there are no super easy metrics to measure the overhead from this particular problem. But typically we are doing following things:
since .NET 5 we have "generational aware analysis" available in the runtime itself, which we can see in PerfView's Generational Aware view. It will allow you to see which gen1/gen2 objects are holding references to younger generations. This may be not super useful in your case, as it sounds you already know where it happens
record .NET trace and look for MarkWithType event. If there are many Type="3" events (they represent GC_ROOT_OLDER reason of promotion) with big amount of Promoted bytes, it may suggest indeed you may have a problem:
Event Name Time MSec Process Name Rest
MarkWithType 186.370 Process(28820) HeapNum="5" Type="3" Promoted="62,104,694"
MarkWithType 186.421 Process(28820) HeapNum="0" Type="3" Promoted="52,687,589"
MarkWithType 186.633 Process(28820) HeapNum="3" Type="3" Promoted="49,932,696"
But only if you also correlate such GCs with long pause times or % TIme in GC.
So, only if you measure it somehow is a problem, and this typically will be when having volume of dozens of GBs, try to solve it. As it is an inherent consequence of how .NET GC is built, there is no easy workaround though. Here are some typical ideas:
try to follow #david-l advice to split object structures. Maybe old objects may reference those young, temporary objects only by some index, not a reference
or just try to redesign it all as #holger suggests to avoid those references at all
or make young, temporary objects also pooled using ObjectPool so they eventually all be living in gen2
or may those temporary objects structs that are inlined in long-living pooled objects
I'm prototyping some managed directx game engine before moving to c++ syntax horror.
So let's say I've got some data (f.e. an array or a hashset of references) that I'm sure it'll stay alive throughout whole application's life. Since performance is crucial here and I'm trying to avoid any lag spikes on generation promotion, I'd like to ask if there's any way to initialize an object (allocate its memory) straight ahead in GC's generation 2? I couldn't find an answer for that, but I'm pretty sure I've seen someone doing that before.
Alternatively since there would be no real need to "manage" that piece of memory, would it be possible to allocate it with unmanaged code, but to expose it to the rest of the code as a .NET type?
You can't allocate directly in Gen 2. All allocations happen in either Gen 0 or on the large object heap (if they are 85000 bytes or larger). However, pushing something to Gen 2 is easy: Just allocate everything you want to go to Gen 2 and force GCs at that point. You can call GC.GetGeneration to inspect the generation of a given object.
Another thing to do is keep a pool of objects. I.e. instead of releasing objects and thus making them eligible for GC, you return them to a pool. This reduces allocations and thus the number of GCs as well.
We have a server app that does a lot of memory allocations (both short lived and long lived). We are seeing an awful lot of GC2 collections shortly after startup, but these collections calm down after a period of time (even though the memory allocation pattern is constant).
These collections are hitting performance early on.
I'm guessing that this could be caused by GC budgets (for Gen2?). Is there some way I can set this budget (directly or indirectly) to make my server perform better at the beginning?
One counter-intuitive set of results I've seen: We made a big reduction to the amount of memory (and Large Object Heap) allocations, which saw performance over the long term improve, but early performance gets worse, and the "settling down" period gets longer.
The GC apparently needs a certain period of time to realise our app is a memory hog and adapt accordingly. I already know this fact, how do I convince the GC?
Edit
OS:64 bit Windows Server 2008 R2
We're using .Net 4.0 ServerGC Batch Latency. Tried 4.5 and the 3 different latency modes, and while average performance was improved slightly, worst case performance actually deteriorated
Edit2
A GC spike can double time taken (we're talking seconds) going from acceptable to unacceptable
Almost all spikes correlate with gen 2 collections
My test run causes a final 32GB heap size. The initial frothiness lasts for the 1st 1/5th of the run time, and performance after that is actually better (less frequent spikes), even though the heap is growing. The last spike near the end of the test (with largest heap size) is the same height as (i.e. as bad as) 2 of the spikes in the initial "training" period (with much smaller heaps)
Allocation of extremely large heap in .NET can be insanely fast, and number of blocking collections will not prevent it from being that fast. Problems that you observe are caused by the fact that you don't just allocate, but also have code that causes dependency reorganizations and actual garbage collection, all at the same time when allocation is going on.
There are a few techniques to consider:
try using LatencyMode (http://msdn.microsoft.com/en-us/library/system.runtime.gcsettings.latencymode(v=vs.110).aspx), set it to LowLatency while you are actively loading the data - see comments to this answer as well
use multiple threads
do not populate cross-references to newly allocated objects while actively loading; first go through active allocation phase, use only integer indexes to cross-reference items, but not managed references; then force full GC couple times to have everything in Gen2, and only then populate your advanced data structures; you may need to re-think your deserialization logic to make this happen
try forcing your biggest root collections (arrays of objects, strings) to second generation as early as possible; do this by preallocating them and forcing full GC two times, before you start populating data (loading millions of small objects); if you are using some flavor of generic Dictionary, make sure to preallocate its capacity early on, to avoid reorganizations
any big array of references is a big source of GC overhead - until both array and referenced objects are in Gen2; the bigger the array - the bigger the overhead; prefer arrays of indexes to arrays of references, especially for temporary processing needs
avoid having many utility or temporary objects deallocated or promoted while in active loading phase on any thread, carefully look through your code for string concatenation, boxing and 'foreach' iterators that can't be auto-optimized into 'for' loops
if you have an array of references and a hierarchy of function calls that have some long-running tight loops, avoid introducing local variables that cache the reference value from some position in the array; instead, cache the offset value and keep using something like "myArrayOfObjects[offset]" construct across all levels of your function calls; it helped me a lot with processing pre-populated, Gen2 large data structures, my personal theory here is that this helps GC manage temporary dependencies on your local thread's data structures, thus improving concurrency
Here are the reasons for this behavior, as far as I learned from populating up to ~100 Gb RAM during app startup, with multiple threads:
when GC moves data from one generation to another, it actually copies it and thus modifies all references; therefore, the fewer cross-references you have during active load phase - the better
GC maintains a lot of internal data structures that manage references; if you do massive modifications to references themselves - or if you have a lot of references that have to be modified during GC - it causes significant CPU and memory bandwidth overhead during both blocking and concurrent GC; sometimes I observed GC constantly consuming 30-80% of CPU without any collections going on - simply by doing some processing, which looks weird until you realize that any time you put a reference to some array or some temporary variable in a tight loop, GC has to modify and sometimes reorganize dependency tracking data structures
server GC uses thread-specific Gen0 segments and is capable of pushing entire segment to next Gen (without actually copying data - not sure about this one though), keep this in mind when designing multi-threaded data load process
ConcurrentDictionary, while being a great API, does not scale well in extreme scenarios with multiple cores, when number of objects goes above a few millions (consider using unmanaged hashtable optimized for concurrent insertion, such as one coming with Intel's TBB)
if possible or applicable, consider using native pooled allocator (Intel TBB, again)
BTW, latest update to .NET 4.5 has defragmentation support for large object heap. One more great reason to upgrade to it.
.NET 4.6 also has an API to ask for no GC whatsoever (GC.TryStartNoGCRegion), if certain conditions are met: https://msdn.microsoft.com/en-us/library/dn906202(v=vs.110).aspx
Also see a related post by Maoni Stephens: https://blogs.msdn.microsoft.com/maoni/2017/04/02/no-gcs-for-your-allocations/
We have an application in C# that controls one of our device and reacts to signal this device gives us.
Basically, the application creates threads, treat operations (access to a database, etc) and communicates with this device.
In the life of the application, it creates objects and release them and so far, we're letting the Garbage Collector taking care of our memory. i've read that it is highly recommanded to let the GC do its stuff without interfering.
Now the problem we're facing is that the process of our application grows ad vitam eternam, growing by step. Example:
It seems to have "waves" when the application is growing and all of a sudden, the application release some memory but seems to leave memory leaks at the same time.
We're trying to investigate the application with some memory profiler but we would like to understand deeply how the garbage Collector works.
I've found an excellent article here : The Danger of Large Objects
I've also found the official documentation here : MSDN
Do you guys know another really really deep documentation of GC?
Edit :
Here is a screenshot that illustrates the behavior of the application :
You can clearly see the "wave" effect we're having on a very regular pattern.
Subsidiary question :
I've seen that my GC Collection 2 Heap is quite big and following the same pattern as the total bytes used by my application. I guess it's perfectly normal because most of our objects will survive at least 2 garbage collections (for example Singleton classes, etc)... What do you think ?
The behavior you describe is typical of problems with objects created on Large Object Heap (LOH). However, your memory consumption seems to return to some lower value later on, so check twice whether it is really a LOH issue.
You are obviously aware of that, but what is not quite obvious is that there is an exception to the size of the objects on LOH.
As described in documentation, objects above 85000 bytes in size end up on LOH. However, for some reason (an 'optimization' probably) arrays of doubles which are longer than 1000 elements also end up there:
double[999] smallArray = ...; // ends up in 'normal', Gen-0 heap
double[1001] bigArray = ...; // ends up in LOH
These arrays can result in fragmented LOH, which requires more memory, until you get an Out of memory exception.
I was bitten by this as we had an app which received some sensor readings as arrays of doubles which resulted in LOH defragmentation since every array slightly differed in length (these were readings of realtime data at various frequencies, sampled by non-realtime process). We solved the issue by implementing our own buffer pool.
I did some research on a class I was teaching a couple of years back. I don't think the references contain any information regarding the LoH but I thought it was worthwhile to share them either way (see below). Further, I suggest performing a second search for unreleased object references before blaming the garbage collector. Simply implementing a counter in the class finalizer to check that these large objects are being dropped as you believe.
A different solution to this problem, is simply to never deallocate your large objects, but instead reuse them with a pooling strategy. In my hubris I have many times before ended up blaming the GC prematurely for the memory requirements of my application growing over time, however this is more often than not a symptom of faulty implementation.
GC References:
http://blogs.msdn.com/b/clyon/archive/2007/03/12/new-in-orcas-part-3-gc-latency-modes.aspx
http://msdn.microsoft.com/en-us/library/ee851764.aspx
http://blogs.msdn.com/b/ericlippert/archive/2010/09/30/the-truth-about-value-types.aspx
http://blogs.msdn.com/b/ericlippert/archive/2009/04/27/the-stack-is-an-implementation-detail.aspx
Eric Lippert's blog is especially interesting, when it comes to understanding anything C# in detail!
Here is an update with some of my investigations :
In our application, we're using a lot of thread to make different tasks. Some of these threads have higher priority.
1) We're using a GC that is concurrent and we tried to switch it back to non-concurrent.
We've seen a dramatic improvment :
The Garbage collector is being called much often and it seems that, when called more often, it's releasing much better our memory.
I'll post a screenshot as soon as I have a good one to illustrate this.
We've found a really good article on the MSDN. We also found an interesting question on SO.
With the next Framework 4.5, 4 possibilities will be available for GC configuration.
Workstation - non-concurrent
Workstation - concurrent
Server - non-concurrent
Server - concurrent
We'll try and switch to the "server - non-concurrent" and "serveur - concurrent" to check if it's giving us better performance.
I'll keep this thread updated with our findings.
I have a fewSortedList<>andSortedDictionary<>structures in my simulation code and I add millions of items in them over time. The problem is that the garbage collector does not deallocate quickly enough memory so there is a huge hit on the application's performance. My last option was to engage theGC.Collect()method so that I can reclaim that memory back. Has anyone got a different idea? I am aware of theFlyweightpattern which is another option but I would appreciate other suggestions that would not require huge refactoring of my code.
You are fighting the "There's no free lunch" principle. You cannot assume that stuffing millions of items in a list isn't going to affect perf. Only the SortedList<> should be a problem, it is going to start allocating memory in the Large Object Heap. That allocation isn't going to be freed soon, it takes a gen #2 collection to chuck stuff out of the LOH again. This delay should not otherwise affect the perf of your program.
One thing you can do is avoiding the multiple of copies of the internal array that SortedList<> will jam into the LOH when it keeps growing. Try to guess a good value for Capacity so it pre-allocates the large array up front.
Next, use Perfmon.exe or TaskMgr.exe and looks at the page fault delta of your program. It should be quite busy while you're allocating. If you see low values (100 or less) then you might have a problem with the paging file being fragmented. A common scourge on older machines that run XP. Defragging the disk and using SysInternals' PageDefrag utility can do extraordinary wonders.
I think the SortedList uses a array as backing field, which means that large SortedList get allocated on the Large object heap. The large object heap can get defragmentated, which can cause an out of memory exception while in principle there is still enough memory available.
See this link.
This might be your problem, as intermediate calls to GC.collect prevent the LOH from getting badly defragmented in some scenarios, which explains why calling it helps you reduce the problem.
The problem can be mitigated by splitting large objects into smaller fragments.
I'd start with doing some memory profiling on your application to make sure that the items you remove from those lists (which I assume is happening from the way your post is written) are actually properly released and not hanging around places.
What sort of performance hit are we talking and on what operating system? If I recall, GC will run when it's needed, not immediately or even "soon". So task manager showing high memory allocated to your application is not necessarily a problem. What happens if you put the machine under higher load (e.g. run several copies of your application)? Does memory get reclaimed faster in that scenario or are you starting to run out of memory?
I hope answers to these questions will help point you in a right direction.
Well, if you keep all of the items in those structures, the GC will never collect the resources because they still have references to them.
If you need the items in the structures to be collected, you must remove them from the data structure.
To clear the entire data structure try using Clear() and setting the data structure reference to null. If the data is still not getting collected fast enough, call CC.Collect().