I want to create a function that makes my application allocate X RAM for Y seconds
(I know theres 1.2GB limit on objects).
Is there better way than this?
[MethodImpl(MethodImplOptions.NoOptimization)]
public void TakeRam(int X, int Y)
{
Byte[] doubleArray = new Byte[X];
System.Threading.Sleep(Y*60)
return
}
I would prefer to use unmanaged memory, like this:
IntPtr p = Marshal.AllocCoTaskMem(X);
Sleep(Y);
Marshal.FreeCoTaskMem(p);
Otherwise the CLR garbage collector may play tricks on you.
You have to KeepAlive your block of memory, otherwise the GC can deallocate it. I would even fill the memory (so that you are sure the memory was really allocated, and not reserved)
[MethodImpl(MethodImplOptions.NoOptimization)]
public void TakeRam(int X, int Y)
{
Byte[] doubleArray = new Byte[X];
for (int i = 0; i < X; i++)
{
doubleArray[i] = 0xFF;
}
System.Threading.Sleep(Y)
GC.KeepAlive(doubleArray);
}
I'll add that, at 64bits, the maximum size of an array is something less than 2GB, not 1.2GB.
Unless you write to memory it will not get allocated. Doing repeated garbage collections does not lower the memory usage significantly. The program grows to 1,176,272K with garbage collection, and 1,176,312 K without garbage collection; with this line commented out: GC.GetTotalMemory(true); When this line is commented out byteArray[i] = 99; the program grows to 4,464K
Changed the name of the array; it doesn't hold doubles.
Changed it to write to the memory
Changed the name of the sleep function.
You can see in Task Manager that the memory gets allocated to the running process:
This works:
using System;
using System.Runtime.CompilerServices;
namespace NewTest
{
class ProgramA
{
static void Main(string[] args)
{
TakeRam(1200000000, 3000000);
}
[MethodImpl(MethodImplOptions.NoOptimization)]
static public void TakeRam(long X, int Y)
{
Byte[] byteArray = new Byte[X];
for (int i = 0; i < X; i += 4096)
{
GC.GetTotalMemory(true);
byteArray[i] = 99;
}
System.Threading.Thread.Sleep(Y);
}
}
}
Related
I am trying to improve the usability of an open source C# API that wraps a C library. The underlying library pulls multiplexed 2D data from a server over a network connection. In C, the samples come out as a pointer to the data (many types are supported), e.g. float*. The pull function returns the number of data points (frames * channels, but channels is known and never changes) so that the client knows how much new data is being passed. It is up to the client to allocate enough memory behind these pointers. For example, if one wants to pull floats the function signature is something like:
long pull_floats(float *floatbuf);
and floatbuf better have sizeof(float)*nChannels*nMoreFramesThanIWillEverGet bytes behind it.
In order to accommodate this, the C# wrapper currently uses 2D arrays, e.g. float[,]. The way it is meant to be used is a literal mirror to the C method---to allocate more memory than one ever expects to these arrays and return the number of data points so that the client knows how many frames of data have just come in. The underlying dll handler has a signature like:
[DllImport(libname, CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi, ExactSpelling = true)]
public static extern uint pull_floats(IntPtr obj, float[,] data_buffer);
And the C# wrapper itself has a definition like:
int PullFloats(float[,] floatbuf)
{
// DllHandler has the DllImport code
// Obj is the class with the handle to the C library
uint res = DllHandler.pull_floats(Obj, floatbuf);
return res/floatbuf.GetLength(1);
}
The C++ wrapper for this library is idiomatic. There, the client supplies a vector<vector<T>>& to the call and in a loop, each frame gets pushed into the multiplexed data container. Something like:
void pull_floats_cpp(std::vector<std::vector<float>>& floatbuf)
{
std::vector<float> frame;
floatbuf.clear();
while(pull_float_cpp(frame)) // C++ function to pull only one frame at a time
{
floatbuf.push_back(frame); // (memory may be allocated here)
}
}
This works because in C++ you can pun a reference to a std::vector to a primitive type like float*. That is, the vector frame from above goes into a wrapper like:
void pull_float_cpp(std:vector<float>& frame)
{
frame.resize(channel_count); // memory may be allocated here as well...
pull_float_c(&frame[0]);
}
where pull_float_c has a signature like:
void pull_float_c(float* frame);
I would like to do something similar in the C# API. Ideally the wrapper method would have a signature like:
void PullFloats(List<List<float>> floatbuf);
instead of
int PullFloats(float[,] floatbuf);
so that clients don't have work with 2D arrays and (more importantly) don't have to keep track of the number of frames they get. That should be inherent to the dimensions of the containing object so that clients can use enumeration patterns and foreach. But, unlike C++ std::vector, you can't pun a List to an array. As far as I know ToArray allocates memory and does a copy so that not only is there memory being allocated, but the new data doesn't go into the List of Lists that the array was built from.
I hope that the psuedocode + explanation of this problem is clear. Any suggestions for how to tackle it in an elegant C# way is much appreciated. Or, if someone can assure me that this is simply a rift between C and C# that may not be breached without imitating C-style memory management, at least I would know not to think about this any more.
Could a MemoryStream or a Span help here?
I came up with a pretty satisfactory way to wrap pre-allocated arrays in Lists. Please anyone let me know if there is a better way to do this, but according to this I think this is about as good as it gets---if the answer is to make a List out of an array, anyway. According to my debugger, 100,000 iterations of 5000 or so floats at a time,takes less than 12 seconds (which is far better than the underlying library demands in practice, but worse than I would like to see), the memory use stays flat at around 12 Mb (no copies), and the GC isn't called until the program exits:
using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
namespace ListArrayTest
{
[StructLayout(LayoutKind.Explicit, Pack = 2)]
public class GenericDataBuffer
{
[FieldOffset(0)]
public int _numberOfBytes;
[FieldOffset(8)]
private readonly byte[] _byteBuffer;
[FieldOffset(8)]
private readonly float[] _floatBuffer;
[FieldOffset(8)]
private readonly int[] _intBuffer;
public byte[] ByteBuffer => _byteBuffer;
public float[] FloatBuffer => _floatBuffer;
public int[] IntBuffer => _intBuffer;
public GenericDataBuffer(int sizeToAllocateInBytes)
{
int aligned4Bytes = sizeToAllocateInBytes % 4;
sizeToAllocateInBytes = (aligned4Bytes == 0) ? sizeToAllocateInBytes : sizeToAllocateInBytes + 4 - aligned4Bytes;
// Allocating the byteBuffer is co-allocating the floatBuffer and the intBuffer
_byteBuffer = new byte[sizeToAllocateInBytes];
_numberOfBytes = _byteBuffer.Length;
}
public static implicit operator byte[](GenericDataBuffer genericDataBuffer)
{
return genericDataBuffer._byteBuffer;
}
public static implicit operator float[](GenericDataBuffer genericDataBuffer)
{
return genericDataBuffer._floatBuffer;
}
public static implicit operator int[](GenericDataBuffer genericDataBuffer)
{
return genericDataBuffer._intBuffer;
}
}
public class ListArrayTest<T>
{
private readonly Random _random = new();
const int _channels = 10;
const int _maxFrames = 500;
private readonly T[,] _array = new T[_maxFrames, _channels];
private readonly GenericDataBuffer _genericDataBuffer;
int _currentFrameCount;
public int CurrentFrameCount => _currentFrameCount;
// generate 'data' to pull
public void PushValues()
{
int frames = _random.Next(_maxFrames);
if (frames == 0) frames++;
for (int ch = 0; ch < _array.GetLength(1); ch++)
{
for (int i = 0; i < frames; i++)
{
switch (_array[0, 0]) // in real life this is done with type enumerators
{
case float: // only implementing float to be concise
_array[i, ch] = (T)(object)(float)i;
break;
}
}
}
_currentFrameCount = frames;
}
private void CopyFrame(int frameIndex)
{
for (int ch = 0; ch < _channels; ch++)
switch (_array[0, 0]) // in real life this is done with type enumerators
{
case float: // only implementing float to be concise
_genericDataBuffer.FloatBuffer[ch] = (float)(object)_array[frameIndex, ch];
break;
}
}
private void PullFrame(List<T> frame, int frameIndex)
{
frame.Clear();
CopyFrame(frameIndex);
for (int ch = 0; ch < _channels; ch++)
{
switch (frame)
{
case List<float>: // only implementing float to be concise
frame.Add((T)(object)BitConverter.ToSingle(_genericDataBuffer, ch * 4));
break;
}
}
}
public void PullChunk(List<List<T>> list)
{
list.Clear();
List<T> frame = new();
int frameIndex = 0;
while (frameIndex != _currentFrameCount)
{
PullFrame(frame, frameIndex);
list.Add(frame);
frameIndex++;
}
}
public ListArrayTest()
{
switch (_array[0, 0])
{
case float:
_genericDataBuffer = new(_channels * 4);
break;
}
}
}
internal class Program
{
static void Main(string[] args)
{
ListArrayTest<float> listArrayTest = new();
List<List<float>> chunk = new();
for (int i = 0; i < 100; i++)
{
listArrayTest.PushValues();
listArrayTest.PullChunk(chunk);
Console.WriteLine($"{i}: first value: {chunk[0][0]}");
}
}
}
}
Update
...and, using a nifty trick I found from Mark Heath (https://github.com/markheath), I can effectively type pun List<List<T>> back to a T* the same way as does the C++ API with std::vector<std::vector<T>> (see class GenericDataBuffer). It is a lot more complicated under the hood since one must be so verbose with type casting in C#, but it compiles without complaint and it works like a charm. Here is the blog post I stole the idea from: https://www.markheath.net/post/wavebuffer-casting-byte-arrays-to-float.
This also lets me ditch the need for clients being responsible to pre-allocate, at the cost of (as in the C++ wrapper) of having to do a bit of dynamic allocation internally. According to the debugger the GC doesn't get called and the memory stays flat, so I guess the Lists allocations are not relying on digging into the heap.
I've notice in my test code that using the ConcurrentQueue<> somehow does not release resources after Dequeueing and eventually I run out of memory. Or the Garbage collection is not happening frequently enough.
Here is a snippet of the code. I know the ConcurrentQueue<> store references and yes, I do want create a new object each time so if the enqueueing is faster than dequeueing, memory will continue to rise. Also a screenshot of the memory usage. For testing, I sent through 5000 byte arrays with 500000 elements each.
There is a similar question asked:
ConcurrentQueue holds object's reference or value? "out of memory" exception
and everything mentioned in that post is what I experienced ... except that the memory won't release after dequeueing, even when the Queue is emptied.
I would appreciate any thoughts/insights to this.
ConcurrentQueue<byte[]> TestQueue = new ConcurrentQueue<byte[]>();
Task EnqTask = Task.Factory.StartNew(() =>
{
for (int i = 0; i < ObjCount; i++)
{
byte[] InData = new byte[ObjSize];
InData[0] = (byte)i; //used to show different array object
TestQueue.Enqueue(InData);
System.Threading.Thread.Sleep(20);
}
});
Task DeqTask = Task.Factory.StartNew(() =>
{
int Count = 0;
while (Count < ObjCount)
{
byte[] OutData;
if (TestQueue.TryDequeue(out OutData))
{
OutData[1] = 0xFF; //just do something with the data
Count++;
}
System.Threading.Thread.Sleep(40);
}
Picture of memory
When going through the CLR/CLI specs and memory models etc, I noticed the wording around atomic reads/writes according to the ECMA CLI spec:
A conforming CLI shall guarantee that read and write access to
properly aligned memory locations no larger than the native word size
(the size of type native int) is atomic when all the write accesses to
a location are the same size.
Specifically the phrase 'properly aligned memory' caught my eye. I wondered if I could somehow get torn reads with a long type on a 64-bit system with some trickery. So I wrote the following test-case:
unsafe class Program {
const int NUM_ITERATIONS = 200000000;
const long STARTING_VALUE = 0x100000000L + 123L;
const int NUM_LONGS = 200;
private static int prevLongWriteIndex = 0;
private static long* misalignedLongPtr = (long*) GetMisalignedHeapLongs(NUM_LONGS);
public static long SharedState {
get {
Thread.MemoryBarrier();
return misalignedLongPtr[prevLongWriteIndex % NUM_LONGS];
}
set {
var myIndex = Interlocked.Increment(ref prevLongWriteIndex) % NUM_LONGS;
misalignedLongPtr[myIndex] = value;
}
}
static unsafe void Main(string[] args) {
Thread writerThread = new Thread(WriterThreadEntry);
Thread readerThread = new Thread(ReaderThreadEntry);
writerThread.Start();
readerThread.Start();
writerThread.Join();
readerThread.Join();
Console.WriteLine("Done");
Console.ReadKey();
}
private static IntPtr GetMisalignedHeapLongs(int count) {
const int ALIGNMENT = 7;
IntPtr reservedMemory = Marshal.AllocHGlobal(new IntPtr(sizeof(long) * count + ALIGNMENT - 1));
long allocationOffset = (long) reservedMemory % ALIGNMENT;
if (allocationOffset == 0L) return reservedMemory;
return reservedMemory + (int) (ALIGNMENT - allocationOffset);
}
private static void WriterThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
SharedState = STARTING_VALUE + i;
}
}
private static void ReaderThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
var sharedStateLocal = SharedState;
if (sharedStateLocal < STARTING_VALUE) Console.WriteLine("Torn read detected: " + sharedStateLocal);
}
}
}
However, no matter how many times I run the program I never legitimately see the line "Torn read detected!". So why not?
I allocated multiple longs in a single block in the hopes that at least one of them would spill between two cache lines; and the 'start point' for the first long should be misaligned (unless I'm misunderstanding something).
Also I know that the nature of multithreading errors means they can be hard to force, and that my 'test program' isn't as rigorous as it could be, but I've run the program almost 30 times now with no results- each with 200000000 iterations.
There are a number of flaws in this program that hides torn reads. Reasoning about the behavior of unsynchronized threads is never simple, and hard to explain, the odds for accidental synchronization are always high.
var myIndex = Interlocked.Increment(ref prevLongWriteIndex) % NUM_LONGS;
Nothing very subtle about Interlocked, unfortunately it affects the reader thread a great deal as well. Pretty hard to see, but you can use Stopwatch to time the execution of the threads. You'll see that Interlocked on the writer slows down the reader by a factor of ~2. Enough to affect the timing of the reader and not repro the problem, accidental synchronization.
Simplest way to eliminate the hazard and maximize the odds of detecting a torn read is to just always read and write from the same memory location. Fix:
var myIndex = 0;
if (sharedStateLocal < STARTING_VALUE)
This test doesn't help much to detect torn reads, there are many that simply don't trigger the test. Having too many binary zeros in the STARTING_VALUE make it extra unlikely. A good alternative that maximizes the odds for detection is to alternate between 1 and -1, ensuring the byte values are always different and making the test very simple. Thus:
private static void WriterThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
SharedState = 1;
SharedState = -1;
}
}
private static void ReaderThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
var sharedStateLocal = SharedState;
if (Math.Abs(sharedStateLocal) != 1) {
Console.WriteLine("Torn read detected: " + sharedStateLocal);
}
}
}
That quickly gets you several pages of torn reads in the console in 32-bit mode. To get them in 64-bit as well you need to do extra work to get the variable mis-aligned. It needs to straddle the L1 cache-line boundary so the processor has to perform two reads and writes, like it does in 32-bit mode. Fix:
private static IntPtr GetMisalignedHeapLongs(int count) {
const int ALIGNMENT = -1;
IntPtr reservedMemory = Marshal.AllocHGlobal(new IntPtr(sizeof(long) * count + 64 + 15));
long cachelineStart = 64 * (((long)reservedMemory + 63) / 64);
long misalignedAddr = cachelineStart + ALIGNMENT;
if (misalignedAddr < (long)reservedMemory) misalignedAddr += 64;
return new IntPtr(misalignedAddr);
}
Any ALIGNMENT value between -1 and -7 will now produce torn reads in 64-bit mode as well.
I have a function similar to the following:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void SetVariable<T>(T newValue) where T : struct {
// I know by this point that T is blittable (i.e. only unmanaged value types)
// varPtr is a void*, and is where I want to copy newValue to
*varPtr = newValue; // This won't work, but is basically what I want to do
}
I saw Marshal.StructureToIntPtr(), but it seems quite slow, and this is performance-sensitive code. If I knew the type T I could just declare varPtr as a T*, but... Well, I don't.
Either way, I'm after the fastest possible way to do this. 'Safety' is not a concern: By this point in the code, I know that the size of the struct T will fit exactly in to the memory pointed to by varPtr.
One answer is to reimplement native memcpy instead in C#, making use of the same optimizing tricks that native memcpy attempts to do. You can see Microsoft doing this in their own source. See the Buffer.cs file in the Microsoft Reference Source:
// This is tricky to get right AND fast, so lets make it useful for the whole Fx.
// E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
internal unsafe static void Memcpy(byte* dest, byte* src, int len) {
// This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
// Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
// possible yet and so we have this implementation here for now.
switch (len)
{
case 0:
return;
case 1:
*dest = *src;
return;
case 2:
*(short *)dest = *(short *)src;
return;
case 3:
*(short *)dest = *(short *)src;
*(dest + 2) = *(src + 2);
return;
case 4:
*(int *)dest = *(int *)src;
return;
...
Its interesting to note that they natively implement memcpy for all sizes up to 512; most of the sizes use pointer aliasing tricks to get the VM to emit instructions that operate on differing sizes. Only at 512 do they finally drop into invoking the native memcpy:
// P/Invoke into the native version for large lengths
if (len >= 512)
{
_Memcpy(dest, src, len);
return;
}
Presumably, native memcpy is even faster since it can be hand optimized to use SSE/MMX instructions to perform the copy.
As per BenVoigt's suggestion, I tried a few options. For all these tests I compiled with Any CPU architecture, on a standard VS2013 Release build, and ran the test outside of the IDE. Before each test was measured, the methods DoTestA() and DoTestB() were run multiple times to allow the JIT warmup.
First, I compared Marshal.StructToPtr to a byte-by-byte loop with various struct sizes. I've shown the code below using a SixtyFourByteStruct:
private unsafe static void DoTestA() {
fixed (SixtyFourByteStruct* fixedStruct = &structToCopy) {
byte* structStart = (byte*) fixedStruct;
byte* targetStart = (byte*) unmanagedTarget;
for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(SixtyFourByteStruct); ++structPtr, ++targetPtr) {
*targetPtr = *structPtr;
}
}
}
private static void DoTestB() {
Marshal.StructureToPtr(structToCopy, unmanagedTarget, false);
}
And the results:
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 82ns 0ns 22,000ns 21,917ns ! 41.017ms
B 137ns 0ns 38,700ns 38,562ns ! 68.834ms
As you can see, the manual loop is faster (as I suspected). The results are similar for a sixteen-byte and four-byte struct, with the difference being more pronounced the smaller the struct goes.
So now, to try the manual copy vs using P/Invoke and memcpy:
private unsafe static void DoTestA() {
fixed (FourByteStruct* fixedStruct = &structToCopy) {
byte* structStart = (byte*) fixedStruct;
byte* targetStart = (byte*) unmanagedTarget;
for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(FourByteStruct); ++structPtr, ++targetPtr) {
*targetPtr = *structPtr;
}
}
}
private unsafe static void DoTestB() {
fixed (FourByteStruct* fixedStruct = &structToCopy) {
memcpy(unmanagedTarget, (IntPtr) fixedStruct, new UIntPtr((uint) sizeof(FourByteStruct)));
}
}
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 61ns 0ns 28,000ns 27,938ns ! 30.736ms
B 84ns 0ns 45,900ns 45,815ns ! 42.216ms
So, it seems that the manual copy is still better in my case. Like before, the results were pretty similar for 4/16/64 byte structs (though the gap was <10ns for 64-byte size).
It occurred to me that I was only testing structures that fit on a cache line (I have a standard x86_64 CPU). So I tried a 128-byte structure, and it swung the balance in the favour of memcpy:
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 104ns 0ns 48,300ns 48,195ns ! 52.150ms
B 84ns 0ns 38,400ns 38,315ns ! 42.284ms
Anyway, the conclusion to all that is that the byte-by-byte copy seems the fastest for any struct of size <=64 bytes on an x86_64 CPU on my machine. Take it as you will (and maybe someone will spot an inefficiency in my code anyway).
FYI. I'm posting how I leveraged the accepted answer for others' benefit as there's a twist when accessing the method via reflection because it's overloaded.
public static class Buffer
{
public unsafe delegate void MemcpyDelegate(byte* dest, byte* src, int len);
public static readonly MemcpyDelegate Memcpy;
static Buffer()
{
var methods = typeof (System.Buffer).GetMethods(BindingFlags.Static | BindingFlags.NonPublic).Where(m=>m.Name == "Memcpy");
var memcpy = methods.First(mi => mi.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] {typeof (byte*), typeof (byte*), typeof (int)}));
Memcpy = (MemcpyDelegate) memcpy.CreateDelegate(typeof (MemcpyDelegate));
}
}
Usage:
public static unsafe void MemcpyExample()
{
int src = 12345;
int dst = 0;
Buffer.Memcpy((byte*) &dst, (byte*) &src, sizeof (int));
System.Diagnostics.Debug.Assert(dst==12345);
}
public void SetVariable<T>(T newValue) where T : struct
You cannot use generics to accomplish this the fast way. The compiler doesn't take your pretty blue eyes as a guarantee that T is actually blittable, the constraint isn't good enough. You should use overloads:
public unsafe void SetVariable(int newValue) {
*(int*)varPtr = newValue;
}
public unsafe void SetVariable(double newValue) {
*(double*)varPtr = newValue;
}
public unsafe void SetVariable(Point newValue) {
*(Point*)varPtr = newValue;
}
// etc...
Which might be inconvenient, but blindingly fast. It compiles to single MOV instruction with no method call overhead in Release mode. The fastest it could be.
And the back-up case, the profiler will tell you when you need to overload:
public unsafe void SetVariable<T>(T newValue) {
Marshal.StructureToPtr(newValue, (IntPtr)varPtr, false);
}
I've read many articles about GC, and about "do no care about objects" paradigm, but i did a test for proove it.
So idea is: i'm creating a lot of large objects stored in local functions, and I suspect that after all tasks are done it will clean the memory itself. But GC didn't. So test code:
class Program
{
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = int.MaxValue/10000000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb();
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed%(count/100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0*completed/count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1),GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public int[] Arr = Enumerable.Range(1,10*1024*1024).ToArray(); // 50MB
}
so in my case app eat ~2GB of RAM, but when I'm clicking on keyboard and launching GC.Collect it free occuped memory up to normal size of 20mb.
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
In your example there is no need to explicitly call GC.Collect()
If you bring it up in the task manager or Performance Monitor you will see the GC working as it runs. GC is called when needed by the OS (when it is trying to allocate and doesn't have memory it will call GC to free some up).
That being said since your objects ( greater than 85000 bytes) are going onto the large object heap, LOH, you need to watch out for large object heap fragmentation. I've modified your code so show how you can fragment the LOH. Which will give an out of memory exception even though the memory is available, just not contiguous memory. As of .NET 4.5.1 you can set a flag to request that LOH to be compacted.
I modified your code to show an example of this here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace GCTesting
{
class Program
{
static int fragLOHbyIncrementing = 1000;
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = 2000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb( fragLOHbyIncrementing++ );
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed % (count / 100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0 * completed / count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1), GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public Dumb(int incr)
{
try
{
DumbAllocation(incr);
}
catch (OutOfMemoryException)
{
Console.WriteLine("Out of memory, trying to compact the LOH.");
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
try // try again
{
DumbAllocation(incr);
Console.WriteLine("compacting the LOH worked to free up memory.");
}
catch (OutOfMemoryException)
{
Console.WriteLine("compaction of LOH failed to free memory.");
throw;
}
}
}
private void DumbAllocation(int incr)
{
Arr = Enumerable.Range(1, (10 * 1024 * 1024) + incr).ToArray();
}
public int[] Arr;
}
}
The .NET runtime will garbage collect without your call to the GC. However, the GC methods are exposed so that GC collections can be timed with the user experience (load screens, waiting for downloads, etc).
Use GC methods isn't always a bad idea, but if you need to ask then it likely is. :-)
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
You can avoid it. Just don't call it. The next time you try to do an allocation, the GC will likely kick in and take care of this for you.
Few things I can think of that may be influencing this, but none for sure :(
One possible effect is that GC doesn't kick in right away... the large objects are on the collection queue - but haven't been cleaned up yet. Specifically calling GC.Collect forces collection right there and that's where you see the difference. Otherwise it would've just happened at some point later.
Second reason i can think of is that GC may collect objects, but not necessarily release memory to OS. Hence you'd continue seeing high memory usage even though it's free internally and available for allocation.
The garbage collection is clever and decide when the time right to collect your objects. This is done by heuristics and you must read about that. The garbage collection makes his job very good. Are the 2GB a problem for yout system or you just wondering about the behaviour?
Whenever you call GC.Collect() don't forget the call GC.WaitingForPendingFinalizer. This avoids unwanted aging of objects with finalizer.