I'm trying to insert data with big column values (1-25Mb) and after a couple seconds, one of my nodes dies, either by throwing an OOM or by being stuck in an endless GC loop.
It usually tries to flush CFs, but then it says Unable to reduce heap usage since there are no dirty column families.
Since the log advised me to reduce memtable/cache sizes, I tried to figure out what was using up all this memory in order to adapt my settings, so I ran nodetool flush / invalidaterowcache / invalidatekeycache and then triggered a GC through jconsole.
Unfortunately, my memory usage stayed high (>60%) even though the server is idling.
So, my problem is Why is the server running out of memory when inserting big values? and also, why isn't the server giving some memory back?
Edit
I did a heapdump and the heap is full of byte[], mainly referenced by of org.apache.cassandra.io.sstable.IndexSummary$KeyPosition.
I don't understand how this is possible since everything is supposed to have been flushed.
It seems to me that you hit the infamous memory fragmentation issue. I'm not sure whether Cassandra takes away some of the fragmentation issues, but generally, in .NET and potentially any Windows program, can run into this.
When you select anything above 85000 bytes (yes, odd number, but it's what it is), objects are stored in the Large Object Heap. The LOH gets GC'ed only as generation 2, but worse, it gets never compacted. The reason is partly caused by the way the OS is implemented.
Result: when you store objects of say 2MB, 5MB, 3MB, 2MB, 3MB and objects of 2MB get GC'ed you have potentially 4MB free. But if you then try to create a new object of 3MB, it cannot be placed there because of the fragmentation (2 holes of 2MB) and moves to the top of the heap. Eventually, this runs out of room. So: there can be enough memory available, but you will get an OOM regardless, due to this fragmentation.
This issue is mostly seen on 32 bit x86 applications on 64 bit (WOW64) and 32 bit Windows. 64 bit applications also have the fragmentation issue, but since virtual memory is much larger, you first hit paging the memory (becoming real slow) before you hit actual fragmentation issues.
If this is indeed the issue (you can check the fragmentation visually with VMMap and with WinDbg) you can solve it by creating a large pool of bytes and reuse your own pool, thus preventing fragmentation.
I investigated the heap dump with MAT and it turns out that the OutOfMemory happened because a lot of memory was used by Thrift.
Since I had to transfer big chunks of data for my column values, I changed those settings to 128, to "be safe":
thrift_framed_transport_size_in_mb
thrift_max_message_length_in_mb
But it turns out that Thrift allocates one byte[2 * thrift_max_message_length_in_mb] per receiving thread, and I had three of those. So I was using 768Mb just for receive buffers...
Changing the settings to 32 fixed my issue.
Related
Rewording my question in an attempt to make it On-Topic:
We have a client (only one client out of many) that is consistently getting an Out of Memory exception with our software. I feel like we've eliminated the usual suspects that would cause this and am looking for ideas of what other things (less standard causes) that might cause an OOM. Specifically, since this seems to be specific to a single customer, could it be caused by something wrong in the hardware, OS, or .Net install?
Here are the things I am aware of that cause an OOM and why I believe we've eliminated them as suspects:
1 - OOM caused by system running out of memory.
Why Not? Because the system has several GB available when these exceptions occur.
2 - OOM caused by process running out of memory due to over allocation or memory leaks.
Why Not? Because the process is using only about 100MB of memory at the time of the exceptions. We have monitored the memory usage for days (on the system in question) and have not noticed any significant increase in memory usage.
3 - OOM caused by running out of other system resources such as file handles, etc.
Why Not? The exceptions are happening, exclusively, during run-of-the-mill memory allocations, not while opening a file or connecting to a socket.
4 - OOM caused by attempting to allocate a large array with excessive memory fragmentation.
Why Not? The memory blocks that we are allocating are fairly small (640x480x2, for the most part). With so much memory available, I have trouble believing that it could be so fragmented that something like that would fail.
So, just to be clear, I am not asking "Why doesn't my code run?" My code does run, on all machines but one. I'm not asking anyone to debug my code. My question is: "What other possible causes, besides those we've eliminated, could be resulting in an Out of Memory exception?" Or, "Am I missing something that could have caused me to eliminate one of the known causes prematurely?"
As an FYI for anyone struggling with similar issues. I think we've finally hunted down the cause of this bug. Turns out the OpenGL drivers on certain cheaper on-board Intel graphics cards had a problem with the way we were writing bitmap data to the same texture ID over and over. I changed the code to delete the texture and allocate a new ID each time and the problem seems to have gone away.
I have been reading about out of memory for some time now and I figured out that in most cases out of memory exception (at least in .NET) isn't really caused by system actually running out of memory but rather system could not allocate chunks of requested memory block due to fragmentation.
What I don't really understand is I've been in a situation where I still get out of memory exception even if I try to allocate a large chunk of contiguous memory on application startup (eg: loading 100 images). Since the application has just started up, it is assumed that not much allocations / de-allocations have been done prior to that, so there should be many free contiguous blocks available. In that case why would the application still get hit by memory fragmentation issue?
Note that I'm also fairly certain that the issue was not caused by the system actually running out of memory quota allocated for my application because loading 100 images in my specific case only takes ~200 mb or so.
In my experience, Out of Memory mostly means poor object management. It's symptomatic of creating too many objects too fast and GC is having a hard time keeping up. Setting aside the few projects that take and never give memory back (like a SQL Server) out of memory can be prevented with caching and a well defined object life cycle.
Have simple C# console app which imports text data into SQL.
It takes around 300K in memory and 80% in CPU. There are 2Gb RAM available at any time and yet the Page Fault shows 500K.
The app is 32 bit and OS is either W2000 or XP 32 bit and .NET 3.5
Anyone can explain what could be the problem and how can I investigate this further?
EDIT: I am now certain that the page faults are related to the disk I/O (read). I commented out SQL part and the pure disk read generates that high number alone.
EDIT2: There are 200 hard faults/sec and 4000 soft faults/sec on average.
I wonder if the same would appear on W2008
First, how do you measure the memory the app is using? If you're looking at "working set" that's only the part that resides in physical memory. You should also take a look at the "VM Size" (or "Commit Size") where the actual virtual memory your process takes up.
If Windows kernel Balance Set Manager thinks that your app is inactive, or should be left behind to give other processes more power, it can decide to reduce the working set size. If working set size is smaller than what your application actually needs to work on, you could easily see a lot of page faults because it simply becomes a race between The Balance Set Manager and the application. Usually balance set manager monitors memory usage and can also decide to increase working set size accordingly. However, this might be prevented in certain circumstances like low physical free memory, high I/O (cache stress on physical memory), low process priorty, background/foreground status of the application etc.
It can simply be the behavior of .NET garbage collector due to vast amount of small memory blocks getting allocated and disposed in a very short time, causing a stress on both memory allocation and releasing. The "VM Size" could stay around the same size but behind the scenes it could be continously allocating/freeing memory, causing continous page faults.
Also know that the DLLs the process is using are also accounted for the process statistics. Not your app but one of the COM or .NET DLL you are using might be causing this behavior as well. You can deduce actual culprit by changing your application's behavior (e.g. removing DB access code and only leave object allocation code behind) to see which component is actually causing thrashing.
EDIT: About your question on GC impact on memory thrashing: The CLR actually grows the heap dynamically and gives the memory back to the OS as needed. That does not occur synchronously. GC runs behind the scenes and frees memory in large chunks to prevent hindering application performance. Say you are allocating many small objects and freeing them almost immediately. That causes many references to stay for a moment in memory before freeing. It is easy to imagine that it becomes like a head-to-head race between the garbage collector and the memory allocating code. While GC eventually catches up, the required new memory must be satisified from a "new memory", not the old one because old one is not freed up yet. Since actual memory we are working on stays around the same, balance set manager may not think of giving our process more memory because we're on the edge, always around the same physical memory size but constantly need "newly allocated memory" not "more memory", therefore page faults.
Page faults are normal. Memory gets swapped out and when you next access it that's a page fault and the system brings it back. This is by design.
I've got an app running on my machine right now with 500 million page faults. There's nothing to worry about!
Page faults means Memory issues
Consider increasing memory, if you have excessive page faults.
Have a large working set size.
The working set is the set of memory pages currently loaded in RAM. This is measured by Process\Working Set. A high value might indicate that you have loaded a number of assemblies.
Process\Working Set has no specific threshold value to watch, although a high or fluctuating value can indicate a memory shortage. A high or fluctuating value accompanied by a high rate of page faults clearly indicates that your server does not have enough memory.
Further reading:
Check memory under System Resources in following MSDN article:
http://msdn.microsoft.com/en-us/library/ff647791.aspx#scalenetchapt15_topic9
Please provide some code to investigate.
A possible answer to this I am currently testing it on my application. Break up your working set into smaller chunks and work with the chunks.
For instance I have a large list of objects (9000-30000). If I break up that list into a chunk of 500 or so at a time it should maintain the 500 objects in memory while I work on them.
You will want to increase or decrease the size of your chunk until you can work with it fast enough that the OS will maintain it in memory. This is theory I haven't fully tested it yet. But it should work.
After reading a few enlightening articles about memory in the .NET technology, Out of Memory does not refer to physical memory, 597499.
I thought I understood why a C# app would throw an out of memory exception -- until I started experimenting with two servers-- both are having 2.5 gigs of ram, windows server 2003 and identical programs running.
The only significant difference between the two being one has 7% hard drive storage left and the other more than 50%.
The server with 7% storage space left is consistently throwing an out of memory while the other is performing consistently well.
My app is a C# web application that process' hundreds of MBs of String object.
Why would this difference happen seeing that the most likely reason for the out of memory issue is out of contiguous virtual address space.
All I can think of is that you're exhausting the virtual memory. Sounds like you need to run a memory profiler on the app.
I've used the Red Gate profiler in similar situations in the past. You may be surprised how much memory your strings are actually using.
Is the paging file fragmentation different on each machine? High fragmentation could slow down paging operations and thus exacerbate memory issues. If the paging file is massively fragmented, sort it out e.g. bring the server off-line, set the paging file size to zero, defrag the drive, re-create the paging file.
It's hard to give any specific advice on how to deal with perf problems with your string handling without more detail of what you are doing.
Why would this difference happen
seeing that the most likely reason for
the out of memory issue is out of
contiguous virtual address space?
With 7% free hard disk your server is probably running out of space to page out memory from either your process or other processes, hence it has to keep everything in RAM and therefore you are unable to allocate additional memory more often than on the server with 50% free space.
What solutions do you guys propose?
Since you've already run a profiler and seen at least 600MB+ of usage with all the string data you need to start tackling this problem.
The obvious answer would be to not hold all that data in memory. If you are processing a large data set then load a bit, process it and then throw that bit away and load the next bit instead of loading it all up front.
If it's data you need to serve, look at a caching strategy like LRU (least recently used) and keep only the hottest data in memory but leave the rest on disk.
You could even offload the strings into a database (in-memory or disk-based) and let that handle the cache management for you.
A slighty left-of-field solution I've had to use in the past was simply compressing the string data in memory as it arrived and decompressing it again when needed using the SharpZipLib. It wasn't that slow surprisingly.
I would agree that your best bet is to use a memory profiler. I've used .Net Memory Profiler 3.5 and was able to diagnose the issue, which in my case were undisposed Regex statements. They have demo tutorials which will walk you through the process if you're not familiar.
As you your question, any single reference to the strings, the jagged array for instance, would still prevent the string from disposing. Without knowing more about your architecture, it would be tough to make a specific recommendation. I would suggest trying to optimize your app before extending memory though. It will come back to bite you later.
An OutOfMemoryException is more likely to indicate fragmentation in your page file - not that you are out of RAM or disk space.
It is generally (wrongly) assumed that the page file is used as a swap disk - that RAM overflow is written to the page file. All allocated memory is stored in the page file and only data that is under heavy usage is copied to RAM.
There's no simple code fix to this problem other than trying to reduce the memory footprint of your application. But if you really get desperate you can always try PageDefrag, which is a free application originally developed by SysInternals.
There is a few tricks to increase memory (I dont know if it works with a web-app, but it looks like it does):
"Out of memory? Easy ways to increase the memory available to your program"
http://blogs.msdn.com/b/calvin_hsia/archive/2010/09/27/10068359.aspx
What is the largest heap you have personally used in a managed environment such as Java or .NET? What were some of the performance issues you ran into, and did you end up getting a diminishing returns the larger the heap was?
I work on a 64-bit .Net system that typically uses 9-12 GB, and sometimes as much as 20GB. I have not seen any performance problems even while garbage collecting, and I have been looking hard as I was not expecting it to work so well.
An earlier version hung on to some objects for too long resulting in occasional GCs that freed up 3GB+. Even then, there was no noticeable impact on performance. The system is running on a 16-core server with 32GB RAM, which probably helps...
In .Net, on Windows 32-bit, You can only really get to about 1.4 GB of memory usage before things start getting really screwy (out of memory exceptions). This is due to a limitation in 32 bit windows that limits a single process to using more than 2 GB of RAM. There is /3GB switch you can put in your boot.ini, but that will only bring you a little bit further. If you want to use lots of memory, you should seriously consider running on a 64 bit version of windows.
I currently have a production application with 6 GB of memory. You'll need a 64-bit box as well for the JVM to be able to address that much.
The garbage collector is really the only thing (that I've found so far) where performance degrades with size, and then only if you manually kick off a System.GC, which forces the JVM to bring everything to a screeching halt as it traverses 6 GB worth of objects. Takes a good 20 seconds, too. The default GC behavior does not do this, BTW, you have to be dumb enough to make it do that. Also worth researching JVM tuning at this size.
You can also find things like distributed and clustered JVMs, sorry, don't have any good references as I didn't look into this option too closely, although I did find references to larger installations.
I am unsure what you mean by heap, but if you mean memory used, I have used quite a bit, 2GB+. I have a web app that does image processing and it requires loading 2 large scan files into memory to do analysis.
There were performance issues. Windows would swap out lots of ram, and then that would create a lot of page faults. There was never any need for anymore than 2 images at a time as all requests were gainst those images (I only allowed 1 session per image set at a time)
For instance, to setup the files for initial viewing would take about 5 seconds. Doing simple analysis and zooming would be fairly fast once in memory, in the order of .1 to .5 seconds.
I still had to optimize, so I ended up preparsung the files and chopping into smaller peices and worked only with the peices that were required by the user at the time.
I have used from 2GB to 5GB of memory in java, but usually when I get to more than 2GB I really start thinking about memory optimization. Diminishing returns can vary from not optimizing when it's necessary because you have a lot of memory, to not having memory available for the OS/Disk caches (which can help your application overall).
For Java, I recommend watching your memory usage per generation over time. Do you create a lot of temporary objects or have long-lasting objects that consume a lot of memory? A lot of optimization of memory can be done when knowing those things.