I have a large 2GB file with 1.5 million listings to process. I am running a console app that performs some string manipulation then uploads each listing to the database.
I created a LINQ object and clear the object by assigning it to a new LinqObject() for each listing (loop).
When the object is complete, I add it to a list.
When the list reaches 100 objects, I submitAll on the entire list, clear the list, then repeat.
My memory usage continues to grow as the program runs. Is there anything I should be doing to keep memory usage down? I tried GC.collect. I think I want to use dispose..
Thanks in advance for looking.
It's normal for the memory usage of a program to increase when it's working. You should not try to force the garbage collector to reduce the memory usage to try to save resources, this will most likely waste resources instead.
Contrary to one's first reaction, high memory usage is not a performance problem as long as there are any free memory left at all. Having a lot of unused memory doesn't increase the performance a bit. If you try to reduce the memory usage only to keep it down, you are just wasting CPU time doing cleanup that is not needed.
If you are running out of free memory or if some other application needs it, the garbage collector will do the appropriate cleanup. In almost every situation the garbage collector will know much more about the current memory situatiuon than you can possibly anticipate when writing the code.
If you are using objects that implement the IDisposable interface, you should call the Dispose method to free unmanaged resources, but all other objects are handled by the garbage collector. Managed objects normally don't leak memory at all.
Do you need your memory usage to stay low? Absent an actual functional problem, high memory usage in and of itself is not an issue.
How large is the memory usage growing? It may be that .NET is just "settling" effectively.
It's not really clear exactly how you're doing this, but the general principle sounds okay. I suggest you take the database work out of the equation - just comment out whichever line would actually submit to the database. See how much memory that uses. Other than the StreamReader (or whatever) you shouldn't have anything else that needs disposing if you're not touching the database - just building batches of transformed objects and throwing them away.
Related
Using the Visual Studio Concurrency Visualizer I now see why I don't get any benefit switching to Parallel.For: only the 9% of the time the machine is busy executing the code, the rest is 71% synchronization and 17% memory management (1).
Checking all the orange stripes on the diagram below I discovered that GC is always involved (2).
After reading all these interesting topics...
Why do I have a lock here?
https://blog.marcgravell.com/2011/10/assault-by-gc.html
Prevent .NET Garbage collection for short period of time
https://devblogs.microsoft.com/premier-developer/understanding-different-gc-modes-with-concurrency-visualizer/
.. am I right assuming that all these threads need to play with a single memory management object and therefore removing the need to allocate objects on the heap my scenario will improve considerably? Like using structs instead of classes, array instead of dynamic lists, etc.?
I have a lot of work to do to bend my code in this direction. Just wanted to be sure before starting.
From your screenshot it seems like memory allocation is blocked while waiting for GC to complete. There are server and workstation GC modes, and it may be concurrent or not, but all options need to block threads at least a little while. I would check in more detail how often, and how much time you are spending in GC, and how often gen 0/1 and 2 is running.
I believe that each thread has a separate ephemeral segment it uses for allocations, so that it would not need to synchronize allocations, unless it needs a new segment, or the allocation is on the large object heap. But I'm unable to find a reference for this.
In any case, you will likely benefit from reducing the amount and size of allocations. If possible, use a object pool or memory pool to reuse memory. You might also benefit from increasing the amount of memory and checking the application for memory leaks. A general recommendation for memory is that there should be two types of allocations:
Small temporary allocations that only live for a short duration, like a temporary object that live for the duration of a method call.
Long lived allocations of any size that live for the duration of the "application".
If this pattern is followed almost all garbage should be collected in Gen 0/1, and gen 2 collections should be fairly rare.
It also depends a bit if you are allocating many small objects, or large chunks of memory. If the former you may consider using structs since these are stack allocated. If the later you also need to consider memory fragmentation, and this should also improve by using a memory pool that only allocates fixed sized chunks of memory.
Edit:
At the very simplest a object pool could be something like this:
public class ObjectPool<T>
{
private ConcurrentBag<T> pool = new ConcurrentBag<T>();
public T Get(Func<T> constructor) => pool.TryTake(out var result) ? result : constructor();
public void Return(T obj) => pool.Add(obj);
}
This assumes that the objects represent identical resources, like byte arrays of some fixed size. But there are also existing implementations:
.Net core MemoryPool
asp.Net core object pool
stack overflow question regarding object pools
Memory Management The Memory Management report shows the calls where memory management blocks occurred, along with the total blocking
times of each call stack. Use this information to identify areas that
have excessive paging or garbage collection issues.
Further more
Memory management time
These segments in the timeline are associated with blocking times that
are categorized as Memory Management. This scenario implies that a
thread is blocked by an event that is associated with a memory
management operation such as Paging. During this time, a thread has
been blocked in an API or kernel state that the Concurrency Visualizer
is counting as memory management. These include events such as paging
and memory allocation. Examine the associated call stacks and profile
reports to better understand the underlying reasons for blocks that
are categorized as Memory Management.
Yes, allocating less will likely have a large benefit on your resources and efficiency, but that is almost always the case on hot paths and thrashed applications
Heap allocations and particular Large Object Heap (LOB) allocations are costly, it also creates extra work for your The Garbage Collector and can fragment your memory causing even more inefficiency. The less you allocate, or reuse memory, or use the stack the better you are (in general).
This is also where you would learn to use a good memory profiler and get to know your garbage collector.
On saying that this would not be the only tool you would use to make your application less allocatey. A good memory profiler will go a long way, combined with learning how to read the results and affect changes based on the results.
Creating minimal allocation code is an artform, and one worth your learning
Also as #mjwills pointed out in the comments, you would run any change through your benchmark software as well, removing allocations at the cost of CPU time won't make sense. There are a lot of ways to speed up code, and low allocation is just one of a lot of approaches that may help.
Lastly, I would suggest following Marc Gravell and his blogs as a start (Mr DeAllocation), get to know your Garbage Collector and how the generations wortk, and tools like memory profilers and benchmarkers for performant silky smooth production code
I've read many questions on this network in regards to when one should use GC.Collect(), but none address this concern.
If we somehow manage to use it, as such as it frees a large number of memory, is this necessarily a good thing, rather than simply relying on the Garbage Collector to manage the memory?
Regardless of how idle my ASP.NET application stays, or how much time passes, or operations are done, the memory that gets freed when explicitly calling GC.Collect() is never addressed (only memory on top of it is, like it's some kind of a cache.)
One theory I have in mind, is that perhaps the reason those 3GB+ are not freed (but rather the application frees memory above that threshold) might be due to reusing parts of it, or maybe some other helpful optimization, at the cost of, of course, using that memory throughout the run-time of the application.
Could anyone share some insight on these concerns? I'm following best practices and disposing objects whenever given the chance. No unmanaged resource is left to be finalized - it's always disposed by me.
EDIT: The Garbage Collector used is the workstation one, not the server.
The garbage collector's job is to simulate a machine with infinite memory. If your system has enough unallocated memory to continue functioning, why should the collector spend CPU cycles performing the hard work of identifying unreferenced objects? That's a performance cost with zero benefit.
If left alone, the collector will run and free those 3GB when the system starts to run low on memory. Until that point it is dormant.
There is a possible memory leak in this statement for a very large string (tempText can grow as big as ~10mb).
string strXML = new string(tempText.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray());
Memory allocated for strXML doesn't get released even after exiting the function. And I've to call this function multiple times. Any possible solution, without having this string as a class variable?
I'm not very familiar with C# memory management, can someone shed some light on this issue?
The garbage collector doesn't collect objects the instance that their lifetime ends. It executes periodically, based on it's perceived need, to free up memory. The string will be eventually collected at some indeterminate point in time after it is long longer referenced by any rooted object.
When you build up a large object, it will remain in memory for much longer than other small objects.
Read up on the large object heap and generation 2 garbage collection... it gets technical but those two terms should suffice to point out what's going on here.
That's why you are not seeing the garbage collector reclaim that memory as quickly as you'd like.
In order to overcome this, either allocate work buffers once and reuse them, or work on the data in smaller chunks.
I have a business app that I have written, that effectively recurses through a directory structure looking for specific Excel files, and stores their addresses. It then loops through these files and parses them by creating a DocumentParser object for each file, this is done one at a time, and not async. The software seems to be very stable, so much so that the business would like to run it to recurse through a massive directory containing upwards of 10000 relevant Excel files.
My question is, as I am creating a new DocumentParser object each time, will the GC be effective enough to discard each of the objects when they go out of scope, ie when that Excel sheet has been parsed, or is there a way I can monitor this and where necessary manually do a GC? I've never had to deal with such large amounts of data before, generally only testing it on a maximum of 40-50 Excel files at a time.
Thanks.
The GC is a very complex piece of software. And the GC is at least the only one that knows when garbage collection is necessary. So my advice is to leave the GC on it's own.
Additionally: The GC will handle these masses objects. Perhaps you will recognize a decrease of performance. If this is a problem you can try to optimize your code. But not premature.
I would leave the GC to its business. 10,000 objects is not really much work for the GC. And it's likely the cost of the GC work will be much lower than the cost of the Excel work. So it's not worth complicating your design to tweak things for the GC. If you end up with so many files to process that your application can't finish in time, it's most likely going to be the speed of the Excel processing holding you up.
However one note which may be relevant: if the DocumentParser is using unmanaged memory in its work with the Excel file, you can use GC.Add/RemoveMemoryPressure to indicate to the GC the real added cost when opening the file. If you didn't write the DocumentParser yourself, the author may already be doing this.
The issue here is that you may have a managed object that costs something in the order of 100 bytes, which allocates a large amount of unmanaged memory when it does Excel work. The GC will have no way of knowing this, so these methods help notify the GC that there is more memory pressure than it was aware of. This may change its behaviour in how/when it decides to collect, which may lead to the application maintaining a lower memory footprint. If the application's memory usage balloons out over time, then you may start seeing some slow downs from length garbage collection and possibly paging on the machine (depending on how much memory you have). You'll want to keep an eye on its memory usage to make sure it's not leaking memory as it processes - a memory profiler may be helpful there.
You don't need to manually call the GC unless you are holding some very large resource which is not the case in your situation. The GC will tweak itself with every call and if you call it manually you will just disrupt its internal profiling data.
BTW GC can collect stuff not only when it goes out of scope but also after its last usage (i.e. while it is still in scope but the variable is not used anymore).
Yes and no - The GC is effective enough to release when it needs to, but you can't generally be sure when that is.
There is a way to force a GC collection but it's generally considered to be bad practise in production code because of the effects of forcing a stack walk when it's not required is worse then using a bit of extra memory until the GC decides it needs to free resources to allocate more objects.
I have a fewSortedList<>andSortedDictionary<>structures in my simulation code and I add millions of items in them over time. The problem is that the garbage collector does not deallocate quickly enough memory so there is a huge hit on the application's performance. My last option was to engage theGC.Collect()method so that I can reclaim that memory back. Has anyone got a different idea? I am aware of theFlyweightpattern which is another option but I would appreciate other suggestions that would not require huge refactoring of my code.
You are fighting the "There's no free lunch" principle. You cannot assume that stuffing millions of items in a list isn't going to affect perf. Only the SortedList<> should be a problem, it is going to start allocating memory in the Large Object Heap. That allocation isn't going to be freed soon, it takes a gen #2 collection to chuck stuff out of the LOH again. This delay should not otherwise affect the perf of your program.
One thing you can do is avoiding the multiple of copies of the internal array that SortedList<> will jam into the LOH when it keeps growing. Try to guess a good value for Capacity so it pre-allocates the large array up front.
Next, use Perfmon.exe or TaskMgr.exe and looks at the page fault delta of your program. It should be quite busy while you're allocating. If you see low values (100 or less) then you might have a problem with the paging file being fragmented. A common scourge on older machines that run XP. Defragging the disk and using SysInternals' PageDefrag utility can do extraordinary wonders.
I think the SortedList uses a array as backing field, which means that large SortedList get allocated on the Large object heap. The large object heap can get defragmentated, which can cause an out of memory exception while in principle there is still enough memory available.
See this link.
This might be your problem, as intermediate calls to GC.collect prevent the LOH from getting badly defragmented in some scenarios, which explains why calling it helps you reduce the problem.
The problem can be mitigated by splitting large objects into smaller fragments.
I'd start with doing some memory profiling on your application to make sure that the items you remove from those lists (which I assume is happening from the way your post is written) are actually properly released and not hanging around places.
What sort of performance hit are we talking and on what operating system? If I recall, GC will run when it's needed, not immediately or even "soon". So task manager showing high memory allocated to your application is not necessarily a problem. What happens if you put the machine under higher load (e.g. run several copies of your application)? Does memory get reclaimed faster in that scenario or are you starting to run out of memory?
I hope answers to these questions will help point you in a right direction.
Well, if you keep all of the items in those structures, the GC will never collect the resources because they still have references to them.
If you need the items in the structures to be collected, you must remove them from the data structure.
To clear the entire data structure try using Clear() and setting the data structure reference to null. If the data is still not getting collected fast enough, call CC.Collect().