I'm using a datatable to hold a running last 1000 log messages in FIFO methodology. I add items into the datatable and remove first in row after the size grows to 1000 items. However, while the datatable doesn't exceed 1000 items the memory drops over time.
Sample:
DataTable dtLog = new DataTable();
for (int nLoop = 0; nLoop < 10000; nLoop++)
{
oLog LogType = new LogType();
oLog.Name = "Message number " + nLoop;
dtLog.Rows.Add( oLog);
if (dtLog.Rows.Count > 1000)
dtLog.Rows.RemoveAt(0);
}
So the messages are removed from the datatable, but the memory doesn't seem to get released. I would expect the memory to be released...?
Or perhaps there's a better way to do a running log using something other than datatables?
I can't speak to the memory leak part of your question as the Memory Management and Garbage Collection in .net makes that a hard thing to investigate.
But, what I can do is suggest that unless you have to, you should never use DataTables in .Net.
Now, "never" is a pretty strong claim! That sort of thing needs backing up with good reasons.
So,. what are those reasons? ... memory usage.
I created this .net fiddle: https://dotnetfiddle.net/wOtjw1
using System;
using System.Collections.Generic;
using System.Xml;
using System.Data;
public class DataObject
{
public string Name { get; set; }
}
public class Program
{
public static void Main()
{
Queue();
}
public static void DataTable()
{
var dataTable = new DataTable();
dataTable.Columns.Add("Name", typeof(string));
for (int nLoop = 0; nLoop < 10000; nLoop++)
{
var dataObject = new DataObject();
dataObject.Name = "Message number " + nLoop;
dataTable.Rows.Add(dataObject);
if (dataTable.Rows.Count > 1000)
dataTable.Rows.RemoveAt(0);
}
}
public static void Queue()
{
var queue = new Queue<DataObject>();
for (int nLoop = 0; nLoop < 10000; nLoop++)
{
var dataObject = new DataObject();
dataObject.Name = "Message number " + nLoop;
queue.Enqueue(dataObject);
if (queue.Count > 1000)
queue.Dequeue();
}
}
}
Run it twice, once with the DataTable method, once with the Queue method.
Look at the memory usage .net fiddle reports each time:
DataTable Memory: 2.74Mb
Queue Memory: 1.46Mb
It's almost half the memory usage! And all we did was stop using DataTables.
.Net DataTables are notoriously memory hungry. They have fairly good reasons for that, they can store lots of complex schema information and can track changes etc.
That's great, but ... do you need those features?
No? Dump the DT, use something under System.Collections(.Generic).
Whenever you modify/delete a row from DataTable the old/deleted data is still kept by the DataTable until you call DataTable.AcceptChanges
When AcceptChanges is called, any DataRow object still in edit mode successfully ends its edits. The DataRowState also changes: all Added and Modified rows become Unchanged, and Deleted rows are removed.
There is no memory leak because that is as designed.
As an alternative you can use a circular buffer which would fit better than a queue.
Your memory is released but it is not so easy to see. There is a lack of tools (except Windbg with SOS) to show the currently allocated memory minus dead objects. Windbg has for this the !DumpHeap -live option to display only live objects.
I have tried the fiddle from AndyJ https://dotnetfiddle.net/wOtjw1
First I needed to create a memory dump with DataTable to have a stable baseline. MemAnalyzer https://github.com/Alois-xx/MemAnalyzer is the right tool for that.
MemAnalyzer.exe -procdump -ma DataTableMemoryLeak.exe DataTable.dmp
This expects procdump from SysInternals in your path.
Now you can run the program with the queue implementation and compare the allocation metrics on the managed heap:
C>MemAnalyzer.exe -f DataTable.dmp -pid2 20792 -dtn 3
Delta(Bytes) Delta(Instances) Instances Instances2 Allocated(Bytes) Allocated2(Bytes) AvgSize(Bytes) AvgSize2(Bytes) Type
-176,624 -10,008 10,014 6 194,232 17,608 19 2934 System.Object[]
-680,000 -10,000 10,000 0 680,000 0 68 System.Data.DataRow
-7,514 -88 20,273 20,185 749,040 741,526 36 36 System.String
-918,294 -20,392 60,734 40,342 1,932,650 1,014,356 Managed Heap(Allocated)!
-917,472 0 0 0 1,954,980 1,037,508 Managed Heap(TotalSize)
This shows that we have 917KB more memory allocated with the DataTable approach and that 10K DataRow instances are still floating around on the managed heap. But are these numbers correct?
No.
Because most objects are already dead but no full GC did happen before we did take a memory dump these objects are still reported as alive. The fix is to tell MemAnalyzer to consider only rooted (live) objects like Windbg does it with the -live option:
C>MemAnalyzer.exe -f DataTable.dmp -pid2 20792 -dts 5 -live
Delta(Bytes) Delta(Instances) Instances Instances2 Allocated(Bytes) Allocated2(Bytes) AvgSize(Bytes) AvgSize2(Bytes) Type
-68,000 -1,000 1,000 0 68,000 0 68 System.Data.DataRow
-36,960 -8 8 0 36,960 0 4620 System.Data.RBTree+Node<System.Data.DataRow>[]
-16,564 -5 10 5 34,140 17,576 3414 3515 System.Object[]
-4,120 -2 2 0 4,120 0 2060 System.Data.DataRow[]
-4,104 -1 19 18 4,716 612 248 34 System.String[]
-141,056 -1,285 1,576 291 169,898 28,842 Managed Heap(Allocated)!
-917,472 0 0 0 1,954,980 1,037,508 Managed Heap(TotalSize)
The DataTable approach still needs 141,056 bytes more memory because of the extra DataRow, object[] and System.Data.RBTree+Node[] instances. Measuring only the Working set is not enough because the managed heap is lazy deallocated. The GC can keep large amounts of memory if it thinks that the next memory spike is not far away. Measuring committed memory is therefore a nearly meaningless metric except if your (very low hanging) goal is to fix only memory leaks of GB in size.
The correct way to measure things is to measure the sum of
Unmanaged Heap
Allocated Managed Heap
Memory Mapped Files
Page File baked Memory Mapped File (Shareable Memory)
Private Bytes
This is actually exactly what MemAnalyzer does with the -vmmap switch which expexct vmmap from Sysinternals in its path.
MemAnalyzer -pid ddd -vmmap
This way you can also track unmanaged memory leaks or file mapping leaks as well. The return value of MemAnalyzer is the total allocated memory in KB.
If -vmmap is used it will report the sum of the above points.
If vmmap is not present it will only report the allocated managed heap.
If -live is added then only rooted managed objects are reported.
I did write the tool because there are no tools out there to my knowledge which make it easy to look at memory leaks in a holistic way. I always want to know if I leak memory regardless if it is managed, unmanaged or something else.
By writing the diff output to a CSV file you can create easily Pivot diff charts like the one above.
MemAnalyzer.exe -f DataTable.dmp -pid2 20792 -live -o ExcelDiff.csv
That should give you some ideas how to track allocation metrics in a much more accurate way.
Related
I want to know if Array.Resize deletes the old allocated Array, and if yes when?
I assumed that it deletes it as soon as the values are copied.
But my teacher says that it only does so at the end of the program, meaning that the memory could be full with the old allocated values.
Is that so?
The old Array is not used in my code after the resize, this should call the GC, shouldn't it?
When objects are garbage-collected is a nondeterministic process and you shouldn’t care for that too much.
However what is deterministic is when the array is eligible for GC. This is when it gets out of scope, or more specific, when there are no more references to it. This happens for example when you’re outside the method that contains the array. Being marked for GC however won’t delete it, there needs to be some memory pressure on the GC which will make the GC clean up resources.
HimBromBeere and erikallen already explained what happens. We can also easily verify this experimentally.
Consider the following code:
static void Main(string[] args)
{
byte[] a = new byte[] { };
long total = 0;
Console.WriteLine("Iteration | curent array size (KB) | total allocations (KB) | private memory size (KB)");
for (int i = 1; i < Int32.MaxValue; i++ )
{
Array.Resize(ref a, i);
total += i;
if (i % 10000 == 0)
{
Console.WriteLine(i.ToString().PadLeft(9) +
(i / 1024).ToString().PadLeft(25) +
(total / 1024).ToString().PadLeft(25) +
(Process.GetCurrentProcess().PrivateMemorySize64 / 1024).ToString().PadLeft(27));
}
}
}
which yields the following result:
Iteration | curent array size (KB) | total allocations (KB) | private memory size (KB)
10000 9 48833 10560
20000 19 195322 10924
30000 29 439467 10976
40000 39 781269 11040
50000 48 1220727 11040
60000 58 1757841 11056
70000 68 2392612 11080
80000 78 3125039 11144
90000 87 3955122 14192
...
If all old arrays were kept im memory, we'd need around 4 GB (column total allocations) after 90000 iterations , but memory usage stays at a low 14 MB (column private memory size).
The old array will be considered unreachable by the GC and will be freed at some unspecified point in time, just as all other objects that become unreachable.
An object is elegible for collection when the GC determines that the object is not reachable anymore. Therefore, the original array can be collected if there is no "usable* reference left to reach it.
When the GC decides to collect the object itself is an alltogether different matter and it is up to the GC to decide; it might very well not collect it at all during the whole lifetime of your app simply because there is no memory pressure that requires it.
Example:
private Blah[] Frob()
{
var someArray = new Blah[] { .... }
//somework
return (Blah[])Array.Resive(someArray, size);
}
In this case, the object referenced bysomeArray will be eligible for collection once Frob returns, because the array is no longer reachable. Its a locally initialized object that can not be reached in any way.
However, in this example:
private Frob[] Foo()
{
var someArray = GetArrayOfFrobs()
//somework
return (Blah[])Array.Resive(someArray, size);
}
The object referenced by someArray will be eligible for collection depending on what GetArrayOfFrobs acutally does. If GetArrayOfFrobs returns an array that is cached somewhere or its part of the state of some other reachable object, then the GC will not mark it as collectible.
In any case, in a managed environment like .NET it’s not methods who decide if a managed object is “freed” or not as you seem to believe based on your question; it’s the GC and it does a pretty good job, so don’t fret about it.
I've created a simple test application which allocates 100mb using binary arrays.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
namespace VirtualMemoryUsage
{
class Program
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder();
sb.AppendLine($"IsServerGC = {System.Runtime.GCSettings.IsServerGC.ToString()}");
const int tenMegabyte = 1024 * 1024 * 10;
long allocatedMemory = 0;
List<byte[]> memory = new List<byte[]>();
for (int i = 0; i < 10; i++)
{
//alloc 10 mb memory
memory.Add(new byte[tenMegabyte]);
allocatedMemory += tenMegabyte;
}
sb.AppendLine($"Allocated memory: {PrettifyByte(allocatedMemory)}");
sb.AppendLine($"VirtualMemorySize64: {PrettifyByte(Process.GetCurrentProcess().VirtualMemorySize64)}");
sb.AppendLine($"PrivateMemorySize64: {PrettifyByte(Process.GetCurrentProcess().PrivateMemorySize64)}");
sb.AppendLine();
Console.WriteLine(sb.ToString());
Console.ReadLine();
}
private static object PrettifyByte(long allocatedMemory)
{
string[] sizes = { "B", "KB", "MB", "GB", "TB" };
int order = 0;
while (allocatedMemory >= 1024 && order < sizes.Length - 1)
{
order++;
allocatedMemory = allocatedMemory / 1024;
}
return $"{allocatedMemory:0.##} {sizes[order]}";
}
}
}
Note: For this test it is important to set gcserver to true in the app.config
<runtime>
<gcServer enabled="true"/>
</runtime>
This then will show the amount of PrivateMemorySize64 and VirtualMemorySize64 allocated by the process.
While PrivateMemorySize64 remains similar on different computers, VirtualMemorySize64 varies quite a bit.
What is the reason for this differences in VirtualMemorySize64 when the same amount of memory is allocated? Is there any documentation about this?
Wow, you're lucky. On my machine, the last line says 17 GB!
Allocated memory: 100M
VirtualMemorySize64: 17679M
PrivateMemorySize64: 302M
While PrivateMemorySize64 remains similar on different computers [...]
Private bytes are the bytes that belong to your program only. It can hardly be influenced by something else. It contains what is on your heap and inaccessible by someone else.
Why is that 302 MB and not just 100 MB? SysInternals VMMap is a good tool to break down that value:
The colors and sizes of private bytes say:
violet (7.5 MB): image files, i.e. DLLs that are not shareable
orange (11.2 MB): heap (non-.NET)
green (103 MB): managed heap
orange (464 kB): stack
yellow (161 MB): private data, e.g. TEB and PEB
brown (36 MB): page table
As you can see, .NET has just 3 MB overhead in the managed heap. The rest is other stuff that needs to be done for any process.
A debugger or a profiler can help in breaking down the managed heap:
0:013> .loadby sos clr
0:013> !dumpheap -stat
[...]
000007fedac16878 258 11370 System.String
000007fed9bafb38 243 11664 System.Diagnostics.ThreadInfo
000007fedac16ef0 34 38928 System.Object[]
000007fed9bac9c0 510 138720 System.Diagnostics.NtProcessInfoHelper+SystemProcessInformation
000007fedabcfa28 1 191712 System.Int64[]
0000000000a46290 305 736732 Free
000007fedac1bb20 13 104858425 System.Byte[]
Total 1679 objects
So you can see there are some strings and other objects that .NET needs "by default".
What is the reason for this differences in VirtualMemorySize64 when the same amount of memory is allocated?
0:013> !address -summary
[...]
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 58 7fb`adfae000 ( 7.983 TB) 99.79%
MEM_RESERVE 61 4`3d0fc000 ( 16.954 GB) 98.11% 0.21%
MEM_COMMIT 322 0`14f46000 ( 335.273 MB) 1.89% 0.00%
Only 335 MB are committed. That's memory that can actually be used. The 16.954 GB are just reserved. They cannot be used at the moment. They are neither in RAM nor on disk in the page file. Allocating reserved memory is super fast. I've seen that 17 GB value very often, especially in ASP.NET crash dumps.
Looking at details in VMMap again
we can see that the 17 GB are just allocated in one block. A comment on your question said: "When the system runs out of memory, the garbage collector fires and releases the busy one." However, to release a VirtualAlloc()'d block by VirtualFree(), that block must not be logically empty, i.e. there should not be a single .NET object inside - and that's unlikely. So it will stay there forever.
What are possible benefits? It's a single contiguous block of memory. If you need a new byte[4G]() now, it would just work.
Finally, the likely reason is: it's done because it doesn't hurt, neither RAM nor disk. And when needed, it can be commited at a later point in time.
Is there any documentation about this?
That's unlikely. The GC implementation in detail could change with the next version of .NET. I think Microsoft does not document that, otherwise people would complain if the behavior changed.
There are people who have written blog posts like this one that tells us that some values might depend on the number of processors for example.
0:013> !eeheap -gc
Number of GC Heaps: 4
[...]
What we see here is that .NET creates as many heaps as processors. That's good for gabage collection, since every processor can collect one heap independently.
The metrics you are using is not allocated memory, but memory used by the process. One - private, another - shared with other processes on your machine. Real amount of memory, used by the process varies depending on both amount of available memory and other processes running.
Edit: answer by Thomas Weller provides much more details on that subject than my Microsoft links
It does not necessarily represent the amount of allocations performed by your application. If you want to get estimate of the allocated memory (not including .NET framework libraries and memory pagination overhead, etc) you can use
long memory = GC.GetTotalMemory(true);
where true parameter tells GC to perform garbage collection first (it doesn't have to). Unused, but not collected memory is accounted for in the values you asked about. If system has enough memory, it might not be collected until it's needed. Here you can find additional information on how GC works.
I wrote a naive Parallel.For() loop in C#, shown below. I also did the same work using a regular for() loop to compare single-thread vs. multi-thread. The single thread version took about five seconds every time I ran it. The parallel version took about three seconds at first, but if I ran it about four times, it would slow down dramatically. Most often it took about thirty seconds. One time it took eighty seconds. If I restarted the program, the parallel version would start out fast again, but slow down after three or four parallel runs. Sometimes the parallel runs would speed up again to the original three seconds then slow down.
I wrote another Parallel.For() loop for computing Mandelbrot set members (discarding the results) because I figured that the problem might be related to memory issues allocating and manipulating a large array. The Parallel.For() implementation of this second problem does indeed execute faster than the single-thread version every time, and the times are consistent too.
What data should I be looking at to understand to understand why my first naive program slows down after a number of runs? Is there something in Perfmon I should be looking at? I still suspect it is memory related, but I allocate the array outside the timer. I also tried a GC.Collect() at the end of each run, but that didn't seem help, not consistently anyway. Might it be an alignment issue with cache somewhere on the processor? How would I figure that out? Is there anything else that might be the cause?
JR
const int _meg = 1024 * 1024;
const int _len = 1024 * _meg;
private void ParallelArray() {
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
lblStart.Content = DateTime.Now.ToString();
s.Start();
Parallel.For(0,
_len,
i => {
stuff[i % _meg] = i;
}
);
s.Stop();
lblResult.Content = DateTime.Now.ToString();
lblDiff.Content = s.ElapsedMilliseconds.ToString();
}
I have profiled your code and it indeed looks strange. There should be no deviations. It is not an allocation issue (GC is fine and you are allocating only one array per run).
The problem can be reproduced on my Haswell CPU where the parallel version suddenly takes much longer to execute. I have CLR version 4.0.30319.34209 FX452RTMGDR.
On x64 it works fine and has no issues. Only x86 builds seem to suffer from it.
I have profiled it with the Windows Performance Toolkit and have found that it looks like a CLR issue where the TPL tries to find the next workitem. Sometimes it happens that the call
System.Threading.Tasks.RangeWorker.FindNewWork(Int64 ByRef, Int64 ByRef)
System.Threading.Tasks.Parallel+<>c__DisplayClassf`1[[System.__Canon, mscorlib]].<ForWorker>b__c()
System.Threading.Tasks.Task.InnerInvoke()
System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)
System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)
System.Threading.Tasks.Task.InnerInvoke()
seems to "hang" in the clr itself.
clr!COMInterlocked::ExchangeAdd64+0x4d
When I compare the sampled stacks with a slow and fast run I find:
ntdll.dll!__RtlUserThreadStart -52%
kernel32.dll!BaseThreadInitThunk -52%
ntdll.dll!_RtlUserThreadStart -52%
clr.dll!Thread::intermediateThreadProc -48%
clr.dll!ThreadpoolMgr::ExecuteWorkRequest -48%
clr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem -48%
clr.dll!ManagedThreadBase_FullTransitionWithAD -48%
clr.dll!ManagedThreadBase_DispatchOuter -48%
clr.dll!ManagedThreadBase_DispatchMiddle -48%
clr.dll!ManagedThreadBase_DispatchInner -48%
clr.dll!QueueUserWorkItemManagedCallback -48%
clr.dll!MethodDescCallSite::CallTargetWorker -48%
clr.dll!CallDescrWorkerWithHandler -48%
mscorlib.ni.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteEntry(Boolean) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.TaskByRef) -48%
mscorlib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext System.Threading.ContextCallback System.Object Boolean) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecutionContextCallback(System.Object) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.Execute() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke() -48%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0 -24%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0<itself> -24%
...
clr.dll!COMInterlocked::ExchangeAdd64 +50%
In the dysfunctional case most of the time (50%) is spent in clr.dll!COMInterlocked::ExchangeAdd64. This method was compiled with FPO since the stacks were broken in the middle to get more performance. I have thought that such code is not allowed in the Windows Code base because it makes profiling harder. Looks like the optimizations have gone too far.
When I single step with the debugger to the actual exachange operation
eax=01c761bf ebx=01c761cf ecx=00000000 edx=00000000 esi=00000000 edi=0274047c
eip=747ca4bd esp=050bf6fc ebp=01c761bf iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
clr!COMInterlocked::ExchangeAdd64+0x49:
747ca4bd f00fc70f lock cmpxchg8b qword ptr [edi] ds:002b:0274047c=0000000001c761bf
cmpxchg8b compares EDX:EAX=1c761bf with the memory location and if the values equal copy the new value of ECX:EBX=1c761cf to the memory location. When you look at the registers you find that at index 0x1c761bf = 29.843.903 all values are not equal. Looks like there is an race condition (or excessive contention) when incrementing the global loop counter which surfaces only when your method body does so little work that it pops out.
Congrats you have found a real bug in the .NET Framework! You should report it at the connect website to make them aware of this issue.
To be absolutely sure that it is not another issue you can try the parallel loop with an empty delegate:
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
Parallel.For(0,_len, i => {});
s.Stop();
System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
This does also repro the issue. It is therefore definitely a CLR issue. Normally we at SO tell people to not try to write lock free code since it is very hard to get right. But even the smartest guys at MS seem to get it wrong sometimes ....
Update:
I have opened a bug report here: https://connect.microsoft.com/VisualStudio/feedbackdetail/view/969699/parallel-for-causes-random-slowdowns-in-x86-processes
Based on your program, I wrote a program to reproduce the problem. I think it is related to .NET large object heap and how Parallel.For implemented.
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
//ParallelArray();
SingleFor();
}
const int _meg = 1024 * 1024;
const int _len = 1024 * _meg;
static void ParallelArray()
{
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
Parallel.For(0,
_len,
i =>
{
stuff[i % _meg] = i;
}
);
s.Stop();
System.Console.WriteLine( s.ElapsedMilliseconds.ToString());
}
static void SingleFor()
{
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
for (int i = 0; i < _len; i++){
stuff[i % _meg] = i;
}
s.Stop();
System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
}
}
I compiled with VS2013, release version, and run it without debugger. If the function ParallelArray() is called in the main loop, the result I got is:
1631
1510
51302
1874
45243
2045
1587
1976
44257
1635
if the function SingleFor() is called, the result is:
898
901
897
897
897
898
897
897
899
898
I go through some documentation on MSDN about Parallel.For, this caught my attention: Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. As in the Parallel for loop, we're using a shared variable stuff.
This article Parallel Aggregation explain how .NET deal with this case:The Parallel Aggregation pattern uses unshared, local variables that are merged at the end of the computation to give the final result. Using unshared, local variables for partial, locally calculated results is how the steps of a loop can become independent of each other. Parallel aggregation demonstrates the principle that it's usually better to make changes to your algorithm than to add synchronization primitives to an existing algorithm. This means it creates local copies of data instead of using locks to guard the shared variable, and at the end, these 10 partitions needs to be combined together; this brings performance penalties.
When I run the test program with Parall.For, I used the the process explore to count the threads, it has 11 threads, so Parallel.For create 10 partitions for the loops, which mean it creates 10 local copies with size 100K, these object will be placed on Large Object Heap.
There are two different types of heaps in .NET. The Small Object Heap (SOH) and the Large Object Heap (LOH). If the object size is large than 85,000 bytes, it is in LOH. When doing GC, .NET treat the 2 heaps differently.
As it is explained in this blog: No More Memory Fragmentation on the .NET Large Object Heap: One of the key differences between the heaps is that the SOH compacts memory and hence reduces the chance of memory fragmentation dramatically while the LOH does not employ compaction. As a result, excessive usage of the LOH may result in memory fragmentation that can become severe enough to cause problems in applications.
As you're allocating big arrays with size > 85,000 continuously, when the LOH becomes in memory fragmentation, the performance goes down.
If you're using .NET 4.5.1, you can set GCSettings.LargeObjectHeapCompactionMode to CompactOnce to make LOH compact after GC.Collect().
Another good article to understand this problem is: Large Object Heap Uncovered
Further investigation is needed, but I don't have time now.
I was running some test, to see the how my logging would perform is instead of doing File.AppendAllText I would first write to a memory stream and then copy to file. So, just to see how fast memory operation is I did this..
private void button1_Click(object sender, EventArgs e)
{
using (var memFile = new System.IO.MemoryStream())
{
using (var bw = new System.IO.BinaryWriter(memFile))
{
for (int i = 0; i < Int32.MaxValue; i++)
{
bw.Write(i.ToString() + Environment.NewLine);
}
bw.Flush();
}
memFile.CopyTo(new System.IO.FileStream(System.IO.Path.Combine("C", "memWriteWithBinaryTest.log"), System.IO.FileMode.OpenOrCreate));
}
}
When i reached 25413324 I got a Exception of type 'System.OutOfMemoryException' was thrown. even though my Process Explorer says I have about 700Mb of free ram???
Here are the screen shots (just in case)
Process Explorer
Here's the winform
EDIT : For the sake of more objects being created on heap, I rewrote the bw.write to this
bw.Write(i);
First of all, you run out of memory because you accumulate data in the MemoryStream, instead of writing it directly to the FileStream. Use the FileStream directly and you won't need much RAM at all (but you will have to keep the file open).
The amount of physical memory unused is not directly relevant to this exception, as strange as that might sound.
What matters is:
that you have a contiguous chunk of memory available in the process' virtual address space
that the system commit does not exceed the total RAM size + page file size
When you ask the Windows memory manager to allocate you some RAM, it needs to check not how much is available, but how much it has promised to make available to every other process. Such promising is done through commits. To commit some memory means that the memory manager offered you a guarantee that it will be available when you finally make use of it.
So, it can be that the physical RAM is completely used up, but your allocation request still succeeds. Why? Because there is lots of space available in the page file. When you actually start using the RAM you got through such an allocation, the memory manager will just simply page out something else. So 0 physical RAM != allocations will fail.
The opposite can happen too; an allocation can fail despite having some unused physical RAM. Your process sees memory through the so-called virtual address space. When your process reads memory at address 0x12340000, that's a virtual address. It might map to RAM at 0x78650000, or at 0x000000AB12340000 (running a 32-bit process on a 64-bit OS), it might point to something that only exists in the page file, or it might not even point at anything at all.
When you want to allocate a block of memory with contiguous addresses, it's in this virtual address space that the RAM needs to be contiguous. For a 32-bit process, you only get 2GB or 3GB of usable address space, so it's not too hard to use it up in such a way that no contiguous chunk of a sufficient size exists, despite there being both free physical RAM and enough total unused virtual address space.
This can be caused by memory fragmentation.
Large objects go onto the large object heap and they don't get moved around to make room for things. This can cause fragmentation where you have gaps in the available memory, which can cause out-of-memory when you try to allocate an object larger than any of the blocks of available memory.
See here for more details.
Any object larger than 85,000 bytes will be placed on the large object heap, except for arrays of doubles for which the threshold is just 1000 doubles (or 8000 bytes).
Also note that 32-bit .Net programs are limited to a maximum of 2GB per object and somewhat less than 4GB overall (perhaps as low as 3GB depending on the OS).
You should not be using a BinaryWriter to write text to a file. Use a TextWriter instead.
Now you are using:
for (int i = 0; i < Int32.MaxValue; i++)
This will write at least 3 bytes per write (number representation and newline). Times that by Int32.MaxValue and you need at least 6GB of memory already seeing that you are writing it to a MemoryStream.
Looking further at your code, you are going to write the MemoryStream to a file in any way. So you can simply do the following:
for (int i = 0; i < int.MaxValue; i++)
{
File.AppendAllText("filename.log", i.ToString() + Environment.Newline);
}
or write to an open TextWriter:
TextWriter writer = File.AppendText("filename.log");
for (int i = 0; i < int.MaxValue; i++)
{
writer.WriteLine(i);
}
If you want some memory buffer, which IMO is a bad idea for logging as you will lose the last bit of the writes during a crash, you can using the following the create the TextWriter:
StreamWriter(string path, bool append, Encoding encoding, int bufferSize)
and pass a 'biggish' number for bufferSize. The default is 1024.
To answer the question, you get an out of memory exception due to the MemoryStream resizing and at some point it gets to big to fit into memory (which was discussed in another answer).
I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?
Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.
[update]
Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes.
Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.
Well I did a small test and I would definitely recommend using Memory Mapped Files.
I Created a File containing 350M double values (2.6 GB as many mentioned before) and then tested the time it takes to map the file to memory and then access any of the elements.
In all my tests in my laptop (Win7, .Net 4.0, Core2 Duo 2.0 GHz, 4GB RAM) it took less than a second to map the file and at that point accessing any of the elements took virtually 0ms (all time is in the validation of the index).
Then I decided to go through all 350M numbers and the whole process took about 3 minutes (paging included) so if in your case you have to iterate they may be another option.
Nevertheless I wrapped the access, just for example purposes there a lot conditions you should check before using this code, and it looks like this
public class Storage<T> : IDisposable, IEnumerable<T> where T : struct
{
MemoryMappedFile mappedFile;
MemoryMappedViewAccessor accesor;
long elementSize;
long numberOfElements;
public Storage(string filePath)
{
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
mappedFile = MemoryMappedFile.CreateFromFile(filePath);
accesor = mappedFile.CreateViewAccessor(0, info.Length);
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
}
public long Length
{
get
{
return numberOfElements;
}
}
public T this[long index]
{
get
{
if (index < 0 || index > numberOfElements)
{
throw new ArgumentOutOfRangeException();
}
T value = default(T);
accesor.Read<T>(index * elementSize, out value);
return value;
}
}
public void Dispose()
{
if (accesor != null)
{
accesor.Dispose();
accesor = null;
}
if (mappedFile != null)
{
mappedFile.Dispose();
mappedFile = null;
}
}
public IEnumerator<T> GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
public static T[] GetArray(string filePath)
{
T[] elements;
int elementSize;
long numberOfElements;
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
using (MemoryMappedFile mappedFile = MemoryMappedFile.CreateFromFile(filePath))
{
using(MemoryMappedViewAccessor accesor = mappedFile.CreateViewAccessor(0, info.Length))
{
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
elements = new T[numberOfElements];
if (numberOfElements > int.MaxValue)
{
//you will need to split the array
}
else
{
accesor.ReadArray<T>(0, elements, 0, (int)numberOfElements);
}
}
}
return elements;
}
}
Here is an example of how you can use the class
Stopwatch watch = Stopwatch.StartNew();
using (Storage<double> helper = new Storage<double>("Storage.bin"))
{
Console.WriteLine("Initialization Time: {0}", watch.ElapsedMilliseconds);
string item;
long index;
Console.Write("Item to show: ");
while (!string.IsNullOrWhiteSpace((item = Console.ReadLine())))
{
if (long.TryParse(item, out index) && index >= 0 && index < helper.Length)
{
watch.Reset();
watch.Start();
double value = helper[index];
Console.WriteLine("Access Time: {0}", watch.ElapsedMilliseconds);
Console.WriteLine("Item: {0}", value);
}
else
{
Console.Write("Invalid index");
}
Console.Write("Item to show: ");
}
}
UPDATE I added a static method to load all data in a file to an array. Obviously this approach takes more time initially (on my laptop takes between 1 and 2 min) but after that access performance is what you expect from .Net. This method should be useful if you have to access data frequently.
Usage is pretty simple
double[] helper = Storage<double>.GetArray("Storage.bin");
HTH
It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.
(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)
Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.
The first question you have presumably already answered is "does this have to be precalculated?". Is there some algorithm you can use that will make it possible to calculate the required values on demand to avoid this problem? Assuming not...
That is only 2.6GB of data - on a 64 bit processor you'll have no problem with a tiny amount of data like that. But if you're running on a 5 year old computer with a 10 year old OS then it's a non-starter, as that much data will immediately fill the available working set for a 32-bit application.
One approach that would be obvious in C++ would be to use a memory-mapped file. This makes the data appear to your application as if it is in RAM, but the OS actually pages bits of it in only as it is accessed, so very little real RAM is used. I'm not sure if you could do this directly from C#, but you could easily enough do it in C++/CLI and then access it from C#.
Alternatively, assuming the question "do you need all of it in RAM simultaneously" has been answered with "yes", then you can't go for any kind of virtualisation approach, so...
Loading in multiple threads won't help - you are going to be I/O bound, so you'll have n threads waiting for data (and asking the hard drive to seek between the chunks they are reading) rather than one thread waiitng for data (which is being read sequentially, with no seeks). So threads will just cause more seeking and thus may well make it slower. (The only case where splitting the data up might help is if you split it to different physical disks so different chunks of data can be read in parallel - don't do this in software; buy a RAID array)
The only place where multithreading may help is to make the load happen in the background while the rest of your application starts up, and allow the user to start using the portion of the data that is already loaded while the rest of the buffer fills, so the user (hopefully) doesn't have to wait much while the data is loading.
So, you're back to loading the data into one massive array in a single thread...
However, you may be able to speed this up considerably by compressing the data. There are a couple of general approaches woth considering:
If you know something about the data, you may be able to invent an encoding scheme that makes the data smaller (and therefore faster to load). e.g. if the values tend to be close to each other (e.g. imagine the data points that describe a sine wave - the values range from very small to very large, but each value is only ever a small increment from the last) you may be able to represent the 'deltas' in a float without losing the accuracy of the original double values, halving the data size. If there is any symmetry or repetition to the data you may be able to exploit it (e.g. imagine storing all the positions to describe a whole circle, versus storing one quadrant and using a bit of trivial and fast maths to reflect it 4 times - an easy way to quarter the amount of data I/O). Any reduction in data size would give a corresponding reduction in load time. In addition, many of these schemes would allow the data to remain "encoded" in RAM, so you'd use far less RAM but still be able to quickly fetch the data when it was needed.
Alternatively, you can very easily wrap your stream with a generic compression algorithm such as Deflate. This may not work, but usually the cost of decompressing the data on the CPU is less than the I/O time that you save by loading less source data, so the net result is that it loads significantly faster. And of course, save a load of disk space too.
In typical case, loading speed will be limited by speed of storage you're loading data from--i.e. hard drive.
If you want it to be faster, you'll need to use faster storage, f.e. multiple hard drives joined in a RAID scheme.
If your data can be reasonably compressed, do that. Try to find algorithm which will use exactly as much CPU power as you have---less than that and your external storage speed will be limiting factor; more than that and your CPU speed will be limiting factor. If your compression algorithm can use multiple cores, then multithreading can be useful.
If your data are somehow predictable, you might want to come up with custom compression scheme. F.e. if consecutive numbers are close to each other, you might want to store differences between numbers---this might help compression efficiency.
Do you really need double precision? Maybe floats will do the job? Maybe you don't need full range of doubles? For example if you need full 53 bits of mantissa precision, but need only to store numbers between -1.0 and 1.0, you can try to chop few bits per number by not storing exponents in full range.
Making this parallel would be a bad idea unless you're running on a SSD. The limiting factor is going to be the disk IO--and if you run two threads the head is going to be jumping back and forth between the two areas being read. This will slow it down a lot more than any possible speedup from parallelization.
Remember that drives are MECHANICAL devices and insanely slow compared to the processor. If you can do a million instructions in order to avoid a single head seek you will still come out ahead.
Also, once the file is on disk make sure to defrag the disk to ensure it's in one contiguous block.
That does not sound like a good idea to me. 350,000,000 * 8 bytes = 2,800,000,000 bytes. Even if you manage to avoid the OutOfMemoryException the process may be swapping in/out of the page file anyway. You might as well leave the data in the file and load smaller chucks as they are needed. The point is that just because you can allocate that much memory does not mean you should.
With a suitable disk configuration, splitting into multiple files across disks would make sense - and reading each file in a separate thread would then work nicely (if you've some stripyness - RAID whatever :) - then it could make sense to read from a single file with multiple threads).
I think you're on a hiding to nothing attempting this with a single physical disk, though.
Just saw this : .NET 4.0 has support for memory mapped files. That would be a very fast way to do it, and no support required for parallelization etc.