Memory Issues with ConcurrentQueue<> in .Net/C# - c#

I've notice in my test code that using the ConcurrentQueue<> somehow does not release resources after Dequeueing and eventually I run out of memory. Or the Garbage collection is not happening frequently enough.
Here is a snippet of the code. I know the ConcurrentQueue<> store references and yes, I do want create a new object each time so if the enqueueing is faster than dequeueing, memory will continue to rise. Also a screenshot of the memory usage. For testing, I sent through 5000 byte arrays with 500000 elements each.
There is a similar question asked:
ConcurrentQueue holds object's reference or value? "out of memory" exception
and everything mentioned in that post is what I experienced ... except that the memory won't release after dequeueing, even when the Queue is emptied.
I would appreciate any thoughts/insights to this.
ConcurrentQueue<byte[]> TestQueue = new ConcurrentQueue<byte[]>();
Task EnqTask = Task.Factory.StartNew(() =>
{
for (int i = 0; i < ObjCount; i++)
{
byte[] InData = new byte[ObjSize];
InData[0] = (byte)i; //used to show different array object
TestQueue.Enqueue(InData);
System.Threading.Thread.Sleep(20);
}
});
Task DeqTask = Task.Factory.StartNew(() =>
{
int Count = 0;
while (Count < ObjCount)
{
byte[] OutData;
if (TestQueue.TryDequeue(out OutData))
{
OutData[1] = 0xFF; //just do something with the data
Count++;
}
System.Threading.Thread.Sleep(40);
}
Picture of memory

Related

I need to access global (unmanaged) memory in c#

I am pretty new in c# and could need some help on a audio project.
My audio input buffers call a method when they are filled up. In the method I marshall this buffers to a local float[], pass this to a function, where some audio processing is done. After processing the function throws back the manipulated float[], which I pass by Marshall.copy to the audio aout buffer. it works but it is is pretty hard to get the audioprocessing done fast enough, to pass the result back to the method without ending in ugly glitches. If I enlarge the audio buffers it gets better but I get untolarable high latency in the signal chain.
One problem is the GC. My DSP Routine is doing some FFT and the methods frequently need allocate local variables. I think this is slowing down my process a lot.
So I need a way to allogate once (and re-access) a few pieces of umanaged memory, keep this mem fpr the entire runtime and just reference to it from the methods.
I found e.g:
IntPtr hglobal = Marshal.AllocHGlobal(8192);
Marshal.FreeHGlobal(hglobal);
SO what I tried is to define a global static class "Globasl" with static member and asigned that IntPtr to that.
Globals.mem1 = hglobal;
From within any nother method I can access this now by
e.g.
int[] f = new int[2];
f[0] = 111;
f[1] = 222;
Marshal.Copy(f, 0, Globals.mem1, 2);
Now comes my problem:
If I want to access this int[] from the example above in another method, how could I do this?
thank you for your fast help.
I was little unprecise seems, sorry
my audiodevice driver throws a buffer filled event which I catch, (in pseudocode since I dont have access to my home desktop right now). looks like:
void buffer (....)
{
byte[] buf = new byte[asiobuffersize];
marshall.copy(asioinbuffers, 0, buf, asiobufferlenth);
buf= manipulate(buf);
marshall.copy(buf, 0, asiooutbuffer, asiobufferlenth);
}
the manipulate function is doing some conversions from byte to float then some math (FFT) and backtransform to byte and looks like e.g.
private byte[] manipulate(byte[] buf, Complex[] filter)
{
float bu = convertTofloat(buf); //conversion from byte to audio float here
Complex[] inbuf = new Complex[bu.Length];
Complex[] midbuf = new Complex[bu.Length];
Complex[] mid2buf = new Complex[bu.Length];
Complex[] outbuf = new Complex[bu.Length];
for ( n....)
{
inbuf[n]= bu[n]; //Copy to Complex
}
midbuf=FFT(inbuf); //Doing FFT transform
for (n....)
{
mid2buf[n]=midbuf[n]*filter[n]; // Multiply with filter
}
outbuf=iFFT(mid2buf) //inverse iFFT transform
byte bu = convertTobyte(float); //conversion from float back to audio byte
return bu;
}
here I expect my speed issue to be. So I thought the problem could be solved if the manipulating funktion could just "get" a fixed piece of unmanaged memory where (once pre-created) sits fixed all those those variables (Complex e.g.) and pre allocated menory, so that I must not create new ones each time the function is called. I first expected the reason for my glitches in wrong FFT or math but it happens in kind of "sharp" time distances of few seconds, so it is not connected to audio signal issues like clipping. I think the isseue happens, when the GC is doing some serious work and eats me exactly this few miliseconds missing to get the outbut buffer filled in time.
acceptable.
I really doubt the issues you are experiencing are induced by your managed buffer creation/copying. Instead I think your problem is that you have your data capture logic coupled with your DSP logic. Usually, captured data resides in a circular buffer, where the data is rewritten after some period, so you should be fetching this data as soon as posible.
The problem is that you don't fetch the next available data block until after your DSP is done; you already know FFT operations are really CPU intensive! If you have a processing peak, you may not be able to retrieve data before it is rewritten by the capture driver.
One possibility to address your issue is try to increase, if posible, the size and/or available amount of capture buffers. This buys you more time before the captured data is rewritten. The other possibility, and is the one that I favor is decoupling your processing stage from your capture stage, in this way, if new data is available while you are busy performing your DSP computations, you can still grab it and buffer it almost as soon as it becomes available. You become much more resilient to garbage collection induced pauses or computing peaks inside your manipulate method.
This would involve creating two threads: the capture thread and the processing thread. You would also need an "Event" that signals the processing thread that new data is available, and a queue, that will serve as a dynamic, expandable buffer.
The capture thread would look something like this:
// the following are class members
AutoResetEvent _bufQueueEvent = new AutoResetEvent(false);
Queue<byte[]> _bufQueue = new Queue<byte[]>;
bool _keepCaptureThreadAlive;
Thread _captureThread;
void CaptureLoop() {
while( _keepCaptureThreadAlive ) {
byte[] asioinbuffers = WaitForBuffer();
byte[] buf = new byte[asiobuffers.Length]
marshall.copy(asioinbuffers, 0, buf, asiobufferlenth);
lock( _bufQueue ) {
_bufQueue.Enqueue(buf);
}
_bufQueueEvent.Set(); // notify processing thread new data is available
}
}
void StartCaptureThread() {
_keepCaptureThreadAlive = true;
_captureThread = new Thread(new(ThreadStart(CaptureLoop));
_captureThread.Name = "CaptureThread";
_captureThread.IsBackground = true;
_captureThread.Start();
}
void StopCaptureThread() {
_keepCaptureThreadAlive = false;
_captureThread.Join(); // wait until the thread exits.
}
The processing thread would look something like this
// the following are class members
bool _keepProcessingThreadAlive;
Thread _processingThread;
void ProcessingLoop() {
while( _keepProcessingThreadAlive ) {
_bufQueueEvent.WaitOne(); // thread will sleep until fresh data is available
if( !_keepProcessingThreadAlive ) {
break; // check if thread is being waken for termination.
}
int queueCount;
lock( _bufQueue ) {
queueCount = _bufQueue.Count;
}
for( int i = 0; i < queueCount; i++ ) {
byte[] buffIn;
lock( _bufQueue ) {
// only lock during dequeue operation, this way the capture thread Will
// be able to enqueue fresh data even if we are still doing DSP processing
buffIn = _bufQueue.Dequeue();
}
byte[] buffOut = manipulate(buffIn); // you are safe if this stage takes more time than normal, you will still get the incoming data
// additional logic using manipulate() return value
...
}
}
}
void StartProcessingThread() {
_keepProcessingThreadAlive = true;
_processingThread = new Thread(new(ThreadStart(ProcessingLoop));
_processingThread.Name = "ProcessingThread";
_processingThread.IsBackground = true;
_processingThread.Start();
}
void StopProcessingThread() {
_keepProcessingThreadAlive = false;
_bufQueueEvent.Set(); // wake up thread in case it is waiting for data
_processingThread.Join();
}
At my job we also perform a lot of DSP and this pattern has really helped us with the kind of issues you are experiencing.

Enumerating ConcurrentDictionary after adding and removing entries causes high CPU

My program had a problem wherein it would continue to use more CPU than usual after handling a massive amount of data. I traced the problem back to a piece of code that enumerates over a ConcurrentDictionary.
I knew this would cause high CPU usage when enumerating the dictionary while it held millions of entries, however I raised an eyebrow when the CPU usage didn't drop back down to it's original level after the dictionary had emptied. It seems like the dictionary was doing a lot more work behind the scenes after the mass add-then-remove, but why? When enumerating the dictionary before anything is added to it, CPU usage stays firmly at 0%.
Why is this happening? Is there anything I can do to prevent it from happening?
Here's a code snippet that reproduces the problem:
var dict = new ConcurrentDictionary<string, object>();
const int count = 3000000; // High CPU.
//const int count = 100; // No issues.
for (int i = 0; i < count; i++)
{
dict.TryAdd(i.ToString(), null); // Add lots of entries.
}
for (int i = 0; i < count; i++)
{
object nullObj;
dict.TryRemove(i.ToString(), out nullObj); // Then remove them all.
}
GC.Collect();
Debug.Assert(dict.IsEmpty);
Console.WriteLine("Enumerating");
while (true)
{
// Enumerate the dictionary. Takes up lots of CPU despite being empty,
// but only if lots of entries were added then removed.
foreach (var kvp in dict) { /* Empty. */ }
Thread.Sleep(50);
}

Some cases when necessary to call GC.Collect manualy

I've read many articles about GC, and about "do no care about objects" paradigm, but i did a test for proove it.
So idea is: i'm creating a lot of large objects stored in local functions, and I suspect that after all tasks are done it will clean the memory itself. But GC didn't. So test code:
class Program
{
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = int.MaxValue/10000000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb();
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed%(count/100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0*completed/count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1),GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public int[] Arr = Enumerable.Range(1,10*1024*1024).ToArray(); // 50MB
}
so in my case app eat ~2GB of RAM, but when I'm clicking on keyboard and launching GC.Collect it free occuped memory up to normal size of 20mb.
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
In your example there is no need to explicitly call GC.Collect()
If you bring it up in the task manager or Performance Monitor you will see the GC working as it runs. GC is called when needed by the OS (when it is trying to allocate and doesn't have memory it will call GC to free some up).
That being said since your objects ( greater than 85000 bytes) are going onto the large object heap, LOH, you need to watch out for large object heap fragmentation. I've modified your code so show how you can fragment the LOH. Which will give an out of memory exception even though the memory is available, just not contiguous memory. As of .NET 4.5.1 you can set a flag to request that LOH to be compacted.
I modified your code to show an example of this here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace GCTesting
{
class Program
{
static int fragLOHbyIncrementing = 1000;
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = 2000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb( fragLOHbyIncrementing++ );
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed % (count / 100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0 * completed / count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1), GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public Dumb(int incr)
{
try
{
DumbAllocation(incr);
}
catch (OutOfMemoryException)
{
Console.WriteLine("Out of memory, trying to compact the LOH.");
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
try // try again
{
DumbAllocation(incr);
Console.WriteLine("compacting the LOH worked to free up memory.");
}
catch (OutOfMemoryException)
{
Console.WriteLine("compaction of LOH failed to free memory.");
throw;
}
}
}
private void DumbAllocation(int incr)
{
Arr = Enumerable.Range(1, (10 * 1024 * 1024) + incr).ToArray();
}
public int[] Arr;
}
}
The .NET runtime will garbage collect without your call to the GC. However, the GC methods are exposed so that GC collections can be timed with the user experience (load screens, waiting for downloads, etc).
Use GC methods isn't always a bad idea, but if you need to ask then it likely is. :-)
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
You can avoid it. Just don't call it. The next time you try to do an allocation, the GC will likely kick in and take care of this for you.
Few things I can think of that may be influencing this, but none for sure :(
One possible effect is that GC doesn't kick in right away... the large objects are on the collection queue - but haven't been cleaned up yet. Specifically calling GC.Collect forces collection right there and that's where you see the difference. Otherwise it would've just happened at some point later.
Second reason i can think of is that GC may collect objects, but not necessarily release memory to OS. Hence you'd continue seeing high memory usage even though it's free internally and available for allocation.
The garbage collection is clever and decide when the time right to collect your objects. This is done by heuristics and you must read about that. The garbage collection makes his job very good. Are the 2GB a problem for yout system or you just wondering about the behaviour?
Whenever you call GC.Collect() don't forget the call GC.WaitingForPendingFinalizer. This avoids unwanted aging of objects with finalizer.

Why does the c# garbage collector not keep trying to free memory until a request can be satisfied?

Consider the code below:
using System;
namespace memoryEater
{
internal class Program
{
private static void Main(string[] args)
{
Console.WriteLine("alloc 1");
var big1 = new BigObject();
Console.WriteLine("alloc 2");
var big2 = new BigObject();
Console.WriteLine("null 1");
big1 = null;
//GC.Collect();
Console.WriteLine("alloc3");
big1 = new BigObject();
Console.WriteLine("done");
Console.Read();
}
}
public class BigObject
{
private const uint OneMeg = 1024 * 1024;
private static int _idCnt;
private readonly int _myId;
private byte[][] _bigArray;
public BigObject()
{
_myId = _idCnt++;
Console.WriteLine("BigObject {0} creating... ", _myId);
_bigArray = new byte[700][];
for (int i = 0; i < 700; i++)
{
_bigArray[i] = new byte[OneMeg];
}
for (int j = 0; j < 700; j++)
{
for (int i = 0; i < OneMeg; i++)
{
_bigArray[j][i] = (byte)i;
}
}
Console.WriteLine("done");
}
~BigObject()
{
Console.WriteLine("BigObject {0} finalised", _myId);
}
}
}
I have a class, BigObject, which creates a 700MiB array in its constructor, and has a finalise method which does nothing other than print to console. In Main, I create two of these objects, free one, and then create a third.
If this is compiled for 32 bit (so as to limit memory to 2 gigs), an out of memory exception is thrown when creating the third BigObject. This is because, when memory is requested for the third time, the request cannot be satisfied and so the garbage collector runs. However, the first BigObject, which is ready to be collected, has a finaliser method so instead of being collected is placed on the finalisation queue and is finalised. The garbage collecter then halts and the exception is thrown. However, if the call to GC.Collect is uncommented, or the finalise method is removed, the code will run fine.
My question is, why does the garbage collector not do everything it can to satisfy the request for memory? If it ran twice (once to finalise and again to free) the above code would work fine. Shouldn't the garbage collector continue to finalise and collect until no more memory can be free'd before throwing the exception, and is there any way to configure it to behave this way (either in code or through Visual Studio)?
Its undeterministic when GC will work and try to reclaim memory.
If you add this line after big1 = null . However you should be carefult about forcing GC to collect. Its not recommended unless you know what you are doing.
GC.Collect();
GC.WaitForPendingFinalizers();
Best Practice for Forcing Garbage Collection in C#
When should I use GC.SuppressFinalize()?
Garbage collection in .NET (generations)
I guess its because the time the finalizer executes during garbage collection is undefined. Resources are not guaranteed to be released at any specific time (unless calling a Close method or a Dispose method.), also the order that finalizers are run is random so you could have a finalizer on another object waiting, while your object waits for that.

Why is concurrent modification of arrays so slow?

I was writing a program to illustrate the effects of cache contention in multithreaded programs. My first cut was to create an array of long and show how modifying adjacent items causes contention. Here's the program.
const long maxCount = 500000000;
const int numThreads = 4;
const int Multiplier = 1;
static void DoIt()
{
long[] c = new long[Multiplier * numThreads];
var threads = new Thread[numThreads];
// Create the threads
for (int i = 0; i < numThreads; ++i)
{
threads[i] = new Thread((s) =>
{
int x = (int)s;
while (c[x] > 0)
{
--c[x];
}
});
}
// start threads
var sw = Stopwatch.StartNew();
for (int i = 0; i < numThreads; ++i)
{
int z = Multiplier * i;
c[z] = maxCount;
threads[i].Start(z);
}
// Wait for 500 ms and then access the counters.
// This just proves that the threads are actually updating the counters.
Thread.Sleep(500);
for (int i = 0; i < numThreads; ++i)
{
Console.WriteLine(c[Multiplier * i]);
}
// Wait for threads to stop
for (int i = 0; i < numThreads; ++i)
{
threads[i].Join();
}
sw.Stop();
Console.WriteLine();
Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}
I'm running Visual Studio 2010, program compiled in Release mode, .NET 4.0 target, "Any CPU", and executed in the 64-bit runtime without the debugger attached (Ctrl+F5).
That program runs in about 1,700 ms on my system, with a single thread. With two threads, it takes over 25 seconds. Figuring that the difference was cache contention, I set Multipler = 8 and ran again. The result is 12 seconds, so contention was at least part of the problem.
Increasing Multiplier beyond 8 doesn't improve performance.
For comparison, a similar program that doesn't use an array takes only about 2,200 ms with two threads when the variables are adjacent. When I separate the variables, the two thread version runs in the same amount of time as the single-threaded version.
If the problem was array indexing overhead, you'd expect it to show up in the single-threaded version. It looks to me like there's some kind of mutual exclusion going on when modifying the array, but I don't know what it is.
Looking at the generated IL isn't very enlightening. Nor was viewing the disassembly. The disassembly does show a couple of calls to (I think) the runtime library, but I wasn't able to step into them.
I'm not proficient with windbg or other low-level debugging tools these days. It's been a really long time since I needed them. So I'm stumped.
My only hypothesis right now is that the runtime code is setting a "dirty" flag on every write. It seems like something like that would be required in order to support throwing an exception if the array is modified while it's being enumerated. But I readily admit that I have no direct evidence to back up that hypothesis.
Can anybody tell me what is causing this big slowdown?
You've got false sharing. I wrote an article about it here

Categories

Resources