My program had a problem wherein it would continue to use more CPU than usual after handling a massive amount of data. I traced the problem back to a piece of code that enumerates over a ConcurrentDictionary.
I knew this would cause high CPU usage when enumerating the dictionary while it held millions of entries, however I raised an eyebrow when the CPU usage didn't drop back down to it's original level after the dictionary had emptied. It seems like the dictionary was doing a lot more work behind the scenes after the mass add-then-remove, but why? When enumerating the dictionary before anything is added to it, CPU usage stays firmly at 0%.
Why is this happening? Is there anything I can do to prevent it from happening?
Here's a code snippet that reproduces the problem:
var dict = new ConcurrentDictionary<string, object>();
const int count = 3000000; // High CPU.
//const int count = 100; // No issues.
for (int i = 0; i < count; i++)
{
dict.TryAdd(i.ToString(), null); // Add lots of entries.
}
for (int i = 0; i < count; i++)
{
object nullObj;
dict.TryRemove(i.ToString(), out nullObj); // Then remove them all.
}
GC.Collect();
Debug.Assert(dict.IsEmpty);
Console.WriteLine("Enumerating");
while (true)
{
// Enumerate the dictionary. Takes up lots of CPU despite being empty,
// but only if lots of entries were added then removed.
foreach (var kvp in dict) { /* Empty. */ }
Thread.Sleep(50);
}
Related
I've notice in my test code that using the ConcurrentQueue<> somehow does not release resources after Dequeueing and eventually I run out of memory. Or the Garbage collection is not happening frequently enough.
Here is a snippet of the code. I know the ConcurrentQueue<> store references and yes, I do want create a new object each time so if the enqueueing is faster than dequeueing, memory will continue to rise. Also a screenshot of the memory usage. For testing, I sent through 5000 byte arrays with 500000 elements each.
There is a similar question asked:
ConcurrentQueue holds object's reference or value? "out of memory" exception
and everything mentioned in that post is what I experienced ... except that the memory won't release after dequeueing, even when the Queue is emptied.
I would appreciate any thoughts/insights to this.
ConcurrentQueue<byte[]> TestQueue = new ConcurrentQueue<byte[]>();
Task EnqTask = Task.Factory.StartNew(() =>
{
for (int i = 0; i < ObjCount; i++)
{
byte[] InData = new byte[ObjSize];
InData[0] = (byte)i; //used to show different array object
TestQueue.Enqueue(InData);
System.Threading.Thread.Sleep(20);
}
});
Task DeqTask = Task.Factory.StartNew(() =>
{
int Count = 0;
while (Count < ObjCount)
{
byte[] OutData;
if (TestQueue.TryDequeue(out OutData))
{
OutData[1] = 0xFF; //just do something with the data
Count++;
}
System.Threading.Thread.Sleep(40);
}
Picture of memory
I've read many articles about GC, and about "do no care about objects" paradigm, but i did a test for proove it.
So idea is: i'm creating a lot of large objects stored in local functions, and I suspect that after all tasks are done it will clean the memory itself. But GC didn't. So test code:
class Program
{
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = int.MaxValue/10000000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb();
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed%(count/100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0*completed/count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1),GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public int[] Arr = Enumerable.Range(1,10*1024*1024).ToArray(); // 50MB
}
so in my case app eat ~2GB of RAM, but when I'm clicking on keyboard and launching GC.Collect it free occuped memory up to normal size of 20mb.
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
In your example there is no need to explicitly call GC.Collect()
If you bring it up in the task manager or Performance Monitor you will see the GC working as it runs. GC is called when needed by the OS (when it is trying to allocate and doesn't have memory it will call GC to free some up).
That being said since your objects ( greater than 85000 bytes) are going onto the large object heap, LOH, you need to watch out for large object heap fragmentation. I've modified your code so show how you can fragment the LOH. Which will give an out of memory exception even though the memory is available, just not contiguous memory. As of .NET 4.5.1 you can set a flag to request that LOH to be compacted.
I modified your code to show an example of this here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace GCTesting
{
class Program
{
static int fragLOHbyIncrementing = 1000;
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = 2000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb( fragLOHbyIncrementing++ );
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed % (count / 100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0 * completed / count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1), GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public Dumb(int incr)
{
try
{
DumbAllocation(incr);
}
catch (OutOfMemoryException)
{
Console.WriteLine("Out of memory, trying to compact the LOH.");
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
try // try again
{
DumbAllocation(incr);
Console.WriteLine("compacting the LOH worked to free up memory.");
}
catch (OutOfMemoryException)
{
Console.WriteLine("compaction of LOH failed to free memory.");
throw;
}
}
}
private void DumbAllocation(int incr)
{
Arr = Enumerable.Range(1, (10 * 1024 * 1024) + incr).ToArray();
}
public int[] Arr;
}
}
The .NET runtime will garbage collect without your call to the GC. However, the GC methods are exposed so that GC collections can be timed with the user experience (load screens, waiting for downloads, etc).
Use GC methods isn't always a bad idea, but if you need to ask then it likely is. :-)
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
You can avoid it. Just don't call it. The next time you try to do an allocation, the GC will likely kick in and take care of this for you.
Few things I can think of that may be influencing this, but none for sure :(
One possible effect is that GC doesn't kick in right away... the large objects are on the collection queue - but haven't been cleaned up yet. Specifically calling GC.Collect forces collection right there and that's where you see the difference. Otherwise it would've just happened at some point later.
Second reason i can think of is that GC may collect objects, but not necessarily release memory to OS. Hence you'd continue seeing high memory usage even though it's free internally and available for allocation.
The garbage collection is clever and decide when the time right to collect your objects. This is done by heuristics and you must read about that. The garbage collection makes his job very good. Are the 2GB a problem for yout system or you just wondering about the behaviour?
Whenever you call GC.Collect() don't forget the call GC.WaitingForPendingFinalizer. This avoids unwanted aging of objects with finalizer.
I'm writing trading software. My system produces "sequential items" (1,2,3...) that need to be processed (in my application each item is an order for execution and id is so called internal order id).
I have many threads that can produce items (many strategies), but it is guaranteed that each item will be produced exactly once.
I have only one Processor(order executor) that should
whenever new item is available submit it for execution (place order)
in callback receive result and update it "item result" (get response from stock exchange)
Each producer should
submit "sequintial pack" of items for execution (for example submit "1234,1235,1236,1237")
block until results for all items is available. when results for all items is available - process
Note that:
items can be submitted from different thread in parallel
i neeed minimal latency and minimum "locks"
it's nice to have a code that easy to port to c++
at any time I have pretty "limited" number of "live" id's. For example i can not have at the same time live id's "1 and 10000". Because whenever new id is created - it must be processed and clean-up. So I always have set of id close to each other (for example ~9900-10000). So it's likely makes sense to use cycle-array for implementation
If you can suggest something - please suggest. I'm adding my strange implementation below, but it is not necessary to read it.
This is my prolem implementation:
private Dictionary<uint, AutoResetEvent> transactionsEvents = new Dictionary<uint, AutoResetEvent>();
private Dictionary<uint, TransactionResult> transactionsResults = new Dictionary<uint, TransactionResult> ();
public void IssueOrders(List<OrderAction> actions)
{
int count = actions.Count;
if (count == 0)
{
return;
}
uint finishUserId = (uint) apiTransactions.counter.Next(count);
uint startUserId = finishUserId + 1 - (uint) count;
AutoResetEvent[] events = new AutoResetEvent[count];
for (int i = 0; i < count; i++)
{
var action = actions[i];
uint userId = startUserId + (uint) i;
action.UserId = userId;
var e = new AutoResetEvent(false);
events[i] = e;
transactionsEvents[userId] = e;
}
for (int i = 0; i < count; i++)
{
var action = actions[i];
apiTransactions.ScheduleOrderAction(action);
}
WaitHandle.WaitAll(events);
// now all answers are available, need to apply information
foreach (var action in actions)
{
UpdateActionWithResult(action, transactionsResults[action.UserId]);
transactionsResults.Remove(action.UserId);
}
}
ScheduleOrderAction adds item to BlockingCollection. Processor executes items from BlockingCollection puts results to transactionsResults and raise corresponding events.
There are a lot of prolems with my implementation:
I access (and modify) Dictionaries from different threads. For example when one thread remove item from transactionResults another thread (Processor) may add item to it.
I do not want to switch to ConcurentDictionaries because even Dictionary is already too expensive for me (in terms of speed)
I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?
namespace ParallelData
{
public partial class ParallelData : Form
{
public ParallelData()
{
InitializeComponent();
}
private static readonly char[] Separators = { ',', ' ' };
private static void ProcessFile()
{
var lines = File.ReadLines("BigData.csv");
var numbers = ProcessRawNumbers(lines);
var rowTotal = new List<double>();
var totalElements = 0;
foreach (var values in numbers)
{
var sumOfRow = values.Sum();
rowTotal.Add(sumOfRow);
totalElements += values.Count;
}
MessageBox.Show(totalElements.ToString());
}
private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
/*System.Threading.Tasks.*/
Parallel.ForEach(lines, line =>
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
});
return numbers;
}
private static List<double> ProcessLine(string line)
{
var list = new List<double>();
foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
{
double i;
if (Double.TryParse(s, out i))
{
list.Add(i);
}
}
return list;
}
private void button2_Click(object sender, EventArgs e)
{
ProcessFile();
}
}
}
I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.
Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.
One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split...etc.) in your Parallel.Foreach, if the rows order is not important.
In general you should try to avoid having disk access on multiple threads. The disk is a bottleneck and will block, so might impact performance.
If the size of the lines in the file is not an issue, you should probably read the entire file in first, and then process in parallel.
If the file is too large to do that or it's not practical, then you could use BlockingCollection to load it. Use one thread to read the file and populate the BlockingCollection and then Parallel.ForEach to process the items in it. BlockingCollection allows you to specify the max size of the collection, so it will only read more lines from the file as what's already in the collection is processed and removed.
static void Main(string[] args)
{
string filename = #"c:\vs\temp\test.txt";
int maxEntries = 2;
var c = new BlockingCollection<String>(maxEntries);
var taskAdding = Task.Factory.StartNew(delegate
{
var lines = File.ReadLines(filename);
foreach (var line in lines)
{
c.Add(line); // when there are maxEntries items
// in the collection, this line
// and thread will block until
// the processing thread removes
// an item
}
c.CompleteAdding(); // this tells the collection there's
// nothing more to be added, so the
// enumerator in the other thread can
// end
});
while (c.Count < 1)
{
// this is here simply to give the adding thread time to
// spin up in this much simplified sample
}
Parallel.ForEach(c.GetConsumingEnumerable(), i =>
{
// NOTE: GetConsumingEnumerable() removes items from the
// collection as it enumerates over it, this frees up
// the space in the collection for the other thread
// to write more lines from the file
Console.WriteLine(i);
});
Console.ReadLine();
}
As with some of the others, though, I have to ask the question: Is this something you really need to try optimizing through parallelization, or would a single-threaded solution perform well enough? Multithreading adds a lot of complexity and it's sometimes not worth it.
What kind of performance are you seeing that you want to improve upon?
I checked those lines on my computer and it looks like using Parallel to read csv file without any cpu-expensive computation make no sense. It takes more time to run this in parallel than in one thread. Here are my result:
For code above:
2699ms 2712ms (Checked twice just to confirm results)
Then with:
private static IEnumerable<List<double>> ProcessRawNumbers2(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
foreach(var line in lines)
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
}
return numbers;
}
Gives me: 2075ms 2106ms
So I think that if those numbers in csv does not require to be computed somehow (with some extensive calculation or so) in program and then stored in program, than it make no sense to use parallelism in such case as this add some overhead to it.
I was writing a program to illustrate the effects of cache contention in multithreaded programs. My first cut was to create an array of long and show how modifying adjacent items causes contention. Here's the program.
const long maxCount = 500000000;
const int numThreads = 4;
const int Multiplier = 1;
static void DoIt()
{
long[] c = new long[Multiplier * numThreads];
var threads = new Thread[numThreads];
// Create the threads
for (int i = 0; i < numThreads; ++i)
{
threads[i] = new Thread((s) =>
{
int x = (int)s;
while (c[x] > 0)
{
--c[x];
}
});
}
// start threads
var sw = Stopwatch.StartNew();
for (int i = 0; i < numThreads; ++i)
{
int z = Multiplier * i;
c[z] = maxCount;
threads[i].Start(z);
}
// Wait for 500 ms and then access the counters.
// This just proves that the threads are actually updating the counters.
Thread.Sleep(500);
for (int i = 0; i < numThreads; ++i)
{
Console.WriteLine(c[Multiplier * i]);
}
// Wait for threads to stop
for (int i = 0; i < numThreads; ++i)
{
threads[i].Join();
}
sw.Stop();
Console.WriteLine();
Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}
I'm running Visual Studio 2010, program compiled in Release mode, .NET 4.0 target, "Any CPU", and executed in the 64-bit runtime without the debugger attached (Ctrl+F5).
That program runs in about 1,700 ms on my system, with a single thread. With two threads, it takes over 25 seconds. Figuring that the difference was cache contention, I set Multipler = 8 and ran again. The result is 12 seconds, so contention was at least part of the problem.
Increasing Multiplier beyond 8 doesn't improve performance.
For comparison, a similar program that doesn't use an array takes only about 2,200 ms with two threads when the variables are adjacent. When I separate the variables, the two thread version runs in the same amount of time as the single-threaded version.
If the problem was array indexing overhead, you'd expect it to show up in the single-threaded version. It looks to me like there's some kind of mutual exclusion going on when modifying the array, but I don't know what it is.
Looking at the generated IL isn't very enlightening. Nor was viewing the disassembly. The disassembly does show a couple of calls to (I think) the runtime library, but I wasn't able to step into them.
I'm not proficient with windbg or other low-level debugging tools these days. It's been a really long time since I needed them. So I'm stumped.
My only hypothesis right now is that the runtime code is setting a "dirty" flag on every write. It seems like something like that would be required in order to support throwing an exception if the array is modified while it's being enumerated. But I readily admit that I have no direct evidence to back up that hypothesis.
Can anybody tell me what is causing this big slowdown?
You've got false sharing. I wrote an article about it here