We have a server app that does a lot of memory allocations (both short lived and long lived). We are seeing an awful lot of GC2 collections shortly after startup, but these collections calm down after a period of time (even though the memory allocation pattern is constant).
These collections are hitting performance early on.
I'm guessing that this could be caused by GC budgets (for Gen2?). Is there some way I can set this budget (directly or indirectly) to make my server perform better at the beginning?
One counter-intuitive set of results I've seen: We made a big reduction to the amount of memory (and Large Object Heap) allocations, which saw performance over the long term improve, but early performance gets worse, and the "settling down" period gets longer.
The GC apparently needs a certain period of time to realise our app is a memory hog and adapt accordingly. I already know this fact, how do I convince the GC?
Edit
OS:64 bit Windows Server 2008 R2
We're using .Net 4.0 ServerGC Batch Latency. Tried 4.5 and the 3 different latency modes, and while average performance was improved slightly, worst case performance actually deteriorated
Edit2
A GC spike can double time taken (we're talking seconds) going from acceptable to unacceptable
Almost all spikes correlate with gen 2 collections
My test run causes a final 32GB heap size. The initial frothiness lasts for the 1st 1/5th of the run time, and performance after that is actually better (less frequent spikes), even though the heap is growing. The last spike near the end of the test (with largest heap size) is the same height as (i.e. as bad as) 2 of the spikes in the initial "training" period (with much smaller heaps)
Allocation of extremely large heap in .NET can be insanely fast, and number of blocking collections will not prevent it from being that fast. Problems that you observe are caused by the fact that you don't just allocate, but also have code that causes dependency reorganizations and actual garbage collection, all at the same time when allocation is going on.
There are a few techniques to consider:
try using LatencyMode (http://msdn.microsoft.com/en-us/library/system.runtime.gcsettings.latencymode(v=vs.110).aspx), set it to LowLatency while you are actively loading the data - see comments to this answer as well
use multiple threads
do not populate cross-references to newly allocated objects while actively loading; first go through active allocation phase, use only integer indexes to cross-reference items, but not managed references; then force full GC couple times to have everything in Gen2, and only then populate your advanced data structures; you may need to re-think your deserialization logic to make this happen
try forcing your biggest root collections (arrays of objects, strings) to second generation as early as possible; do this by preallocating them and forcing full GC two times, before you start populating data (loading millions of small objects); if you are using some flavor of generic Dictionary, make sure to preallocate its capacity early on, to avoid reorganizations
any big array of references is a big source of GC overhead - until both array and referenced objects are in Gen2; the bigger the array - the bigger the overhead; prefer arrays of indexes to arrays of references, especially for temporary processing needs
avoid having many utility or temporary objects deallocated or promoted while in active loading phase on any thread, carefully look through your code for string concatenation, boxing and 'foreach' iterators that can't be auto-optimized into 'for' loops
if you have an array of references and a hierarchy of function calls that have some long-running tight loops, avoid introducing local variables that cache the reference value from some position in the array; instead, cache the offset value and keep using something like "myArrayOfObjects[offset]" construct across all levels of your function calls; it helped me a lot with processing pre-populated, Gen2 large data structures, my personal theory here is that this helps GC manage temporary dependencies on your local thread's data structures, thus improving concurrency
Here are the reasons for this behavior, as far as I learned from populating up to ~100 Gb RAM during app startup, with multiple threads:
when GC moves data from one generation to another, it actually copies it and thus modifies all references; therefore, the fewer cross-references you have during active load phase - the better
GC maintains a lot of internal data structures that manage references; if you do massive modifications to references themselves - or if you have a lot of references that have to be modified during GC - it causes significant CPU and memory bandwidth overhead during both blocking and concurrent GC; sometimes I observed GC constantly consuming 30-80% of CPU without any collections going on - simply by doing some processing, which looks weird until you realize that any time you put a reference to some array or some temporary variable in a tight loop, GC has to modify and sometimes reorganize dependency tracking data structures
server GC uses thread-specific Gen0 segments and is capable of pushing entire segment to next Gen (without actually copying data - not sure about this one though), keep this in mind when designing multi-threaded data load process
ConcurrentDictionary, while being a great API, does not scale well in extreme scenarios with multiple cores, when number of objects goes above a few millions (consider using unmanaged hashtable optimized for concurrent insertion, such as one coming with Intel's TBB)
if possible or applicable, consider using native pooled allocator (Intel TBB, again)
BTW, latest update to .NET 4.5 has defragmentation support for large object heap. One more great reason to upgrade to it.
.NET 4.6 also has an API to ask for no GC whatsoever (GC.TryStartNoGCRegion), if certain conditions are met: https://msdn.microsoft.com/en-us/library/dn906202(v=vs.110).aspx
Also see a related post by Maoni Stephens: https://blogs.msdn.microsoft.com/maoni/2017/04/02/no-gcs-for-your-allocations/
Related
Using the Visual Studio Concurrency Visualizer I now see why I don't get any benefit switching to Parallel.For: only the 9% of the time the machine is busy executing the code, the rest is 71% synchronization and 17% memory management (1).
Checking all the orange stripes on the diagram below I discovered that GC is always involved (2).
After reading all these interesting topics...
Why do I have a lock here?
https://blog.marcgravell.com/2011/10/assault-by-gc.html
Prevent .NET Garbage collection for short period of time
https://devblogs.microsoft.com/premier-developer/understanding-different-gc-modes-with-concurrency-visualizer/
.. am I right assuming that all these threads need to play with a single memory management object and therefore removing the need to allocate objects on the heap my scenario will improve considerably? Like using structs instead of classes, array instead of dynamic lists, etc.?
I have a lot of work to do to bend my code in this direction. Just wanted to be sure before starting.
From your screenshot it seems like memory allocation is blocked while waiting for GC to complete. There are server and workstation GC modes, and it may be concurrent or not, but all options need to block threads at least a little while. I would check in more detail how often, and how much time you are spending in GC, and how often gen 0/1 and 2 is running.
I believe that each thread has a separate ephemeral segment it uses for allocations, so that it would not need to synchronize allocations, unless it needs a new segment, or the allocation is on the large object heap. But I'm unable to find a reference for this.
In any case, you will likely benefit from reducing the amount and size of allocations. If possible, use a object pool or memory pool to reuse memory. You might also benefit from increasing the amount of memory and checking the application for memory leaks. A general recommendation for memory is that there should be two types of allocations:
Small temporary allocations that only live for a short duration, like a temporary object that live for the duration of a method call.
Long lived allocations of any size that live for the duration of the "application".
If this pattern is followed almost all garbage should be collected in Gen 0/1, and gen 2 collections should be fairly rare.
It also depends a bit if you are allocating many small objects, or large chunks of memory. If the former you may consider using structs since these are stack allocated. If the later you also need to consider memory fragmentation, and this should also improve by using a memory pool that only allocates fixed sized chunks of memory.
Edit:
At the very simplest a object pool could be something like this:
public class ObjectPool<T>
{
private ConcurrentBag<T> pool = new ConcurrentBag<T>();
public T Get(Func<T> constructor) => pool.TryTake(out var result) ? result : constructor();
public void Return(T obj) => pool.Add(obj);
}
This assumes that the objects represent identical resources, like byte arrays of some fixed size. But there are also existing implementations:
.Net core MemoryPool
asp.Net core object pool
stack overflow question regarding object pools
Memory Management The Memory Management report shows the calls where memory management blocks occurred, along with the total blocking
times of each call stack. Use this information to identify areas that
have excessive paging or garbage collection issues.
Further more
Memory management time
These segments in the timeline are associated with blocking times that
are categorized as Memory Management. This scenario implies that a
thread is blocked by an event that is associated with a memory
management operation such as Paging. During this time, a thread has
been blocked in an API or kernel state that the Concurrency Visualizer
is counting as memory management. These include events such as paging
and memory allocation. Examine the associated call stacks and profile
reports to better understand the underlying reasons for blocks that
are categorized as Memory Management.
Yes, allocating less will likely have a large benefit on your resources and efficiency, but that is almost always the case on hot paths and thrashed applications
Heap allocations and particular Large Object Heap (LOB) allocations are costly, it also creates extra work for your The Garbage Collector and can fragment your memory causing even more inefficiency. The less you allocate, or reuse memory, or use the stack the better you are (in general).
This is also where you would learn to use a good memory profiler and get to know your garbage collector.
On saying that this would not be the only tool you would use to make your application less allocatey. A good memory profiler will go a long way, combined with learning how to read the results and affect changes based on the results.
Creating minimal allocation code is an artform, and one worth your learning
Also as #mjwills pointed out in the comments, you would run any change through your benchmark software as well, removing allocations at the cost of CPU time won't make sense. There are a lot of ways to speed up code, and low allocation is just one of a lot of approaches that may help.
Lastly, I would suggest following Marc Gravell and his blogs as a start (Mr DeAllocation), get to know your Garbage Collector and how the generations wortk, and tools like memory profilers and benchmarkers for performant silky smooth production code
I have a fewSortedList<>andSortedDictionary<>structures in my simulation code and I add millions of items in them over time. The problem is that the garbage collector does not deallocate quickly enough memory so there is a huge hit on the application's performance. My last option was to engage theGC.Collect()method so that I can reclaim that memory back. Has anyone got a different idea? I am aware of theFlyweightpattern which is another option but I would appreciate other suggestions that would not require huge refactoring of my code.
You are fighting the "There's no free lunch" principle. You cannot assume that stuffing millions of items in a list isn't going to affect perf. Only the SortedList<> should be a problem, it is going to start allocating memory in the Large Object Heap. That allocation isn't going to be freed soon, it takes a gen #2 collection to chuck stuff out of the LOH again. This delay should not otherwise affect the perf of your program.
One thing you can do is avoiding the multiple of copies of the internal array that SortedList<> will jam into the LOH when it keeps growing. Try to guess a good value for Capacity so it pre-allocates the large array up front.
Next, use Perfmon.exe or TaskMgr.exe and looks at the page fault delta of your program. It should be quite busy while you're allocating. If you see low values (100 or less) then you might have a problem with the paging file being fragmented. A common scourge on older machines that run XP. Defragging the disk and using SysInternals' PageDefrag utility can do extraordinary wonders.
I think the SortedList uses a array as backing field, which means that large SortedList get allocated on the Large object heap. The large object heap can get defragmentated, which can cause an out of memory exception while in principle there is still enough memory available.
See this link.
This might be your problem, as intermediate calls to GC.collect prevent the LOH from getting badly defragmented in some scenarios, which explains why calling it helps you reduce the problem.
The problem can be mitigated by splitting large objects into smaller fragments.
I'd start with doing some memory profiling on your application to make sure that the items you remove from those lists (which I assume is happening from the way your post is written) are actually properly released and not hanging around places.
What sort of performance hit are we talking and on what operating system? If I recall, GC will run when it's needed, not immediately or even "soon". So task manager showing high memory allocated to your application is not necessarily a problem. What happens if you put the machine under higher load (e.g. run several copies of your application)? Does memory get reclaimed faster in that scenario or are you starting to run out of memory?
I hope answers to these questions will help point you in a right direction.
Well, if you keep all of the items in those structures, the GC will never collect the resources because they still have references to them.
If you need the items in the structures to be collected, you must remove them from the data structure.
To clear the entire data structure try using Clear() and setting the data structure reference to null. If the data is still not getting collected fast enough, call CC.Collect().
I found this article here:
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
http://www.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf
In the conclusion section, it reads:
Comparing runtime, space consumption,
and virtual memory footprints over a
range of benchmarks, we show that the
runtime performance of the
best-performing garbage collector is
competitive with explicit memory
management when given enough memory.
In particular, when garbage collection
has five times as much memory as
required, its runtime performance
matches or slightly exceeds that of
explicit memory management. However,
garbage collection’s performance
degrades substantially when it must
use smaller heaps. With three times as
much memory, it runs 17% slower on
average, and with twice as much
memory, it runs 70% slower. Garbage
collection also is more susceptible to
paging when physical memory is scarce.
In such conditions, all of the garbage
collectors we examine here suffer
order-of-magnitude performance
penalties relative to explicit memory
management.
So, if my understanding is correct: if I have an app written in native C++ requiring 100 MB of memory, to achieve the same performance with a "managed" (i.e. garbage collector based) language (e.g. Java, C#), the app should require 5*100 MB = 500 MB?
(And with 2*100 MB = 200 MB, the managed app would run 70% slower than the native app?)
Do you know if current (i.e. latest Java VM's and .NET 4.0's) garbage collectors suffer the same problems described in the aforementioned article? Has the performance of modern garbage collectors improved?
Thanks.
if I have an app written in native C++
requiring 100 MB of memory, to achieve
the same performance with a "managed"
(i.e. garbage collector based)
language (e.g. Java, C#), the app
should require 5*100 MB = 500 MB? (And
with 2*100 MB = 200 MB, the managed
app would run 70% slower than the
native app?)
Only if the app is bottlenecked on allocating and deallocating memory. Note that the paper talks exclusively about the performance of the garbage collector itself.
You seem to be asking two things:
have GC's improved since that research was performed, and
can I use the conclusions of the paper as a formula to predict required memory.
The answer to the first is that there have been no major breakthroughs in GC algorithms that would invalidate the general conclusions:
GC'ed memory management still requires significantly more virtual memory.
If you try to constrain the heap size the GC performance drops significantly.
If real memory is restricted, the GC'ed memory management approach results in substantially worse performance due to paging overheads.
However, the conclusions cannot really be used as a formula:
The original study was done with JikesRVM rather than a Sun JVM.
The Sun JVM's garbage collectors have improved in the ~5 years since the study.
The study does not seem to take into account that Java data structures take more space than equivalent C++ data structures for reasons that are not GC related.
On the last point, I have seen a presentation by someone that talks about Java memory overheads. For instance, it found that the minimum representation size of a Java String is something like 48 bytes. (A String consists of two primitive objects; one an Object with 4 word-sized fields and the other an array with a minimum of 1 word of content. Each primitive object also has 3 or 4 words of overhead.) Java collection data structures similarly use far more memory than people realize.
These overheads are not GC-related per se. Rather they are direct and indirect consequences of design decisions in the Java language, JVM and class libraries. For example:
Each Java primitive object header1 reserves one word for the object's "identity hashcode" value, and one or more words for representing the object lock.
The representation of a String has to use a separate "array of characters" because of JVM limitations. Two of the three other fields are an attempt to make the substring operation less memory intensive.
The Java collection types use a lot of memory because collection elements cannot be directly chained. So for example, the overheads of a (hypothetical) singly linked list collection class in Java would be 6 words per list element. By contrast an optimal C/C++ linked list (i.e. with each element having a "next" pointer) has an overhead of one word per list element.
1 - In fact, the overheads are less than this on average. The JVM only "inflates" a lock following use & contention, and similar tricks are used for the identity hashcode. The fixed overhead is only a few bits. However, these bits add up to a measurably larger object header ... which is the real point here.
Michael Borgwardt is kind of right about if the application is bottlenecked on allocating memory. This is according to Amdahl's law.
However, I have used C++, Java, and VB .NET. In C++ there are powerful techniques available that allocate memory on the stack instead of the heap. Stack allocation is easily a hundreds of times faster than heap allocation. I would say that use of these techniques could remove maybe one allocation in eight, and use of writable strings one allocation in four.
It's no joke when people claim highly optimized C++ code can trounce the best possible Java code. It's the flat out truth.
Microsoft claims the overhead in using any of the .NET family of languages over C++ is about two to one. I believe that number is just about right for most things.
HOWEVER, managed environments carry a particular benefit in that when dealing with inferior programmers you don't have to worry about one module trashing another module's memory and the resulting crash being blamed on the wrong developer and the bug difficult to find.
At least as I read it, your real question is whether there have been significant developments in garbage collection or manual memory management since that paper was published that would invalidate its results. The answer to that is somewhat mixed. On one hand, the vendors who provide garbage collectors do tune them so their performance tends to improve over time. On the other hand, there hasn't been anything like a major breakthroughs such as major new garbage collection algorithms.
Manual heap managers generally improve over time as well. I doubt most are tuned with quite the regularity of garbage collectors, but in the course of 5 years, probably most have had at least a bit of work done.
In short, both have undoubtedly improved at least a little, but in neither case have there been major new algorithms that change the fundamental landscape. It's doubtful that current implementations will give a difference of exactly 17% as quoted in the article, but there's a pretty good chance that if you repeated the tests today, you'd still get a difference somewhere around 15-20% or so. The differences between then and now are probably smaller than the differences between some of the different algorithms they tested at that time.
I am not sure how relivent your question still is today. A performance critical application shouldn't spend a sigificant portion of its time doing object creation (as the micro-benchmark is very likely to do) and the performance on modern systems is more likely to be determined by how well the application fits into the CPUs cache, rather than how much main memory it uses.
BTW: There are lots of ticks you can do in C++ which support this which are not available in Java.
If you are worried about the cost of GC or object creation, you can take steps to minimise how many objects you create. This is generally a good idea where performance is critical in any language.
The cost of main memory isn't as much of an issue as it used to me. A machine with 48 GB is relatively cheap these days. An 8 core server with 48 GB of main memory can be leased for £9/day. Try hiring a developer for £9/d. ;) However, what is still relatively expensive is CPU cache memory. It is fairly hard to find a system with more than 16 MB of CPU cache. c.f. 48,000 MB of main memory. A system performs much better when an application is using its CPU cache and this is the amount of memory to consider if performance is critical.
First note that its now 2019 and a lot of things has improved.
As long as you dont trigger GC, allocation would be like as simple as incrementing a pointer. In C++ its much more if you dont implement your own mechanism to allocate in chunks.
And if you use smart shared pointers each change to refercence count will required locked increment (xaddl instruction) is slow itself and requires processors communicate to invalidate and resynch their cacheline.
What is more, with GC you get more locality with at least three ways. First when it allocates a new segment, it zero's memory and warms cachelines. Second it compacts heap and cause data to stay closer togeter and lastly all threads use its own heap.
In conclusion, although its hard to test and compare with every scenario and GC implementation ive read somewhere on SO that its proven GC performs better than manual memory management.
5% of execution time spent on GC? 10%? 25%?
Thanks.
This blog post has an interesting investigation into this area.
The posters conclusion? That the overhead was negligible for his example.
So the GC heap is so fast that in a real program, even in tight loops, you can use closures and delegates without even giving it a second’s thought (or even a few nanosecond’s thought). As always, work on a clean, safe design, then profile to find out where the overhead is.
It depends entirely on the application. The garbage collection is done as required, so the more often you allocate large amounts of memory which later becomes garbage, the more often it must run.
It could even go as low as 0% if you allocate everything up front and the never allocate any new objects.
In typical applications I would think the answer is very close to 0% of the time is spent in the garbage collector.
The overhead varies widely. It's not really practical to reduce the problem domain into "typical scenarios" because the overhead of GC (and related functions, like finalization) depend on several factors:
The GC flavor your application uses (impacts how your threads may be blocked during a GC).
Your allocation profile, including how often you allocate (GC triggers automatically when an allocation request needs more memory) and the lifetime profile of objects (gen 0 collections are fastest, gen 2 collections are slower, if you induce a lot of gen 2 collections your overhead will increase).
The lifetime profile of finalizable objects, because they must have their finalizers complete before they will be eligible for collection.
The impact of various points on each of those axes of relevancy can be analyzed (and there are probably more relevant areas I'm not recalling off the top of my head) -- so the problem is really "how can you reduce those axes of relevancy to a 'common scenario?'"
Basically, as others said, it depends. Or, "low enough that you shouldn't worry about it until it shows up on a profiler report."
In native C/C++ there is sometimes a large cost of allocating memory due to finding a block of free memory that is of the right size, there is also a none 0 cost of freeing memory due to having to linked the freed memory into the correct list of blocks, and combine small blocks into large blocks.
In .NET it is very quick to allocate a new object, but you pay the cost when the garbage collector runs. However to cost of garbage collection short lived object is as close to free as you can get.
I have always found that if the cost of garbage collection is a problem to you, then you are likely to have over bigger problems with the design of your software. Paging can be a big issue with any GC if you don’t have enough physical RAM, so you may not be able to just put all your data in RAM and depend on the OS to provide virtual memory as needed.
It really can vary. Look at this demonstration short-but-complete program that I wrote:
http://nomorehacks.wordpress.com/2008/11/27/forcing-the-garbage-collector/
that shows the effect of large gen2 garbage collections.
Yes, the Garbage Collector will spend some X% of time collecting when averaged over all applications everywhere. But that doesn't necessarily means that time is overhead. For overhead, you can really only count the time that would be left after releasing an equivalent amount of memory on an unmanaged platform.
With that in mind, the actual overhead is negative, but the Garbage collector will save time by release several chunks of memory in batches. That means fewer context switches and an overall improvement in efficiency.
Additionally, starting with .Net 4 the garbage collector does a lot of it's work on a different thread that doesn't interrupt your currently running code as much. As we work more and more with mutli-core machines where a core might even be sitting idle now and then, this is a big deal.
While monitoring our application in Perf Mon I noticed that the % of Time In GC is anywhere from 20 - 60% while our application is performing a long running process (varies between 30 seconds to 1.5 minutes). This seems a bit excessive to me. This raises two important questions.
Am I correct that this excessive?
How can I figure out why route causes GC spikes?
Yes, this does sound excessive. Reducing the amount of GC would probably be the single best step you could take to reducing the runtime of your application (if that is your goal).
A high "% time in GC" is typically caused by allocating and then throwing away thousands or millions of objects. A good way to find out what's going on is to use a memory profiler tool.
Microsoft provides the free CLR Profiler. This will show you every allocation, but will make your app run 10-60 times slower. You may need to run it on less input data so that it can finish analyzing in a reasonable amount of time.
A great commercial tool is SciTech's .NET Memory Profiler. This imposes much less runtime overhead, and there is a free trial available. By taking multiple snapshots while your process is running, you can find out what type of objects are being frequently allocated (and then destroyed).
Once you've identified the source of the allocations, you then need to examine the code and figure out how those allocations can be reduced. While there are no one-size-fits-all answers, some things I've encountered in the past include:
String.Split can create hundreds of small short-lived strings. If you're doing a lot of string manipulation, it can help to process the string by walking it character-by-character.
Creating arrays or lists of thousands of small classes (say, under 24 bytes in size) can be expensive; if those classes can be treated as value types, it can (sometimes) greatly improve things to change them to structs.
Creating thousands of small arrays can increase memory usage a lot (because each array has a small amount of overhead); sometimes these can be replaced with one large array and indexes into a sub-section of it.
Having a lot of finalizable objects (particularly if they're not being disposed) can put a lot of pressure on the garbage collector; ensure that you're correctly disposing all IDisposable objects, and note that your own types should (almost) never have finalizers.
Microsoft has an article with Garbage Collection Guidelines for improving performance.
Am I correct that this excessive?
Yes, you are correct
How can I figure out why route causes GC spikes?
1.- Do take a look at PerfView
PerfView is a performance-analysis tool that helps isolate CPU- and
memory-related performance issues.
See Also: Improving Managed Code Performance
2.- See if GC.Collect or GC.WaitForPendingFinalizers is being called anywhere in your code or third party library. The latter can cause high CPU utilization.
Another reason could be lots of gen-1 or gen-2 collections, each of which takes MUCH more time and is caused by hanging on to objects a longer time.
I've seen this happen in web apps when buggy objects hang onto actual page objects - forcing the page to live as long as the other objects referring to them.
Breaking the link between objects and pages (in this case) caused GC to drop to very low values. Our site now has 100+ hits/second and GC time is typically 1% or less.