How to maximize DDR3 memory data transfer rate?

How to maximize DDR3 memory data transfer rate? - c#

I am trying to measure DDR3 memory data transfer rate through a test. According to the CPU spec. maximum theoritical bandwidth is 51.2 GB/s. This should be the combined bandwidth of four channels, meaning 12.8 GB/channel. However, this is a theoretical limit and I am curious of how to further increase the practical limit in this post. In the below described test scenario I achieve a ~14 GB/s data transfer rate which I believe may be a close approximation when killing most of the throuhgput boost of the CPU L1, L2, and L3 caches.
Update 20/3 2014: This assumption of killing the L1-L3 caches is wrong. The harware prefetching of the memory controller will analyze the data accesses pattern and since it sequential, it will have an easy task of prefetching data into the CPU caches.
Specific questions follow at the bottom but mainly I am interested in a) a verifications of the assumptions leading up to this result, and b) if there is a better way measuring memory bandwith in .NET.
I have constructed a test in C# on .NET as a starter. Although .NET is not ideal from a memory allocation perspective, I think it is doable for this test (please let me know if you disagree and why). The test is to allocate an int64 array and fill it with integers. This array should have data aligned in memory. Then I simply loop this array using as many threads as I have cores on the machine and read the int64 value from the array and set it to a local public field in the test class. Since the result field is public, I should avoid compiler optimising away stuff in the loop. Futhermore, and this may be a weak assumption, I think the result stays in the register and not written to memory until it is over written again. Between each read of an element in the array I use an variable Step offset of 10, 100, and 1000 in the array in order to not be able to fetch many references in the same cache block (64 byte).
Reading the Int64 from the array should mean a lookup read of 8 bytes and then the read of the actual value another 8 byte. Since data is fetched from memory in 64 byte cache line, each read in the array should correspond to a 64 byte read from RAM each time in the loop given that the read data is not located in any CPU caches.
Here is how I initiallize the data array:
_longArray = new long[Config.NbrOfCores][];
for (int threadId = 0; threadId < Config.NbrOfCores; threadId++)
{
_longArray[threadId] = new long[Config.NmbrOfRequests];
for (int i = 0; i < Config.NmbrOfRequests; i++)
_longArray[threadId][i] = i;
}
And here is the actual test:
GC.Collect();
timer.Start();
Parallel.For(0, Config.NbrOfCores, threadId =>
{
var intArrayPerThread = _longArray[threadId];
for (int redo = 0; redo < Config.NbrOfRedos; redo++)
for (long i = 0; i < Config.NmbrOfRequests; i += Config.Step)
_result = intArrayPerThread[i];
});
timer.Stop();
Since the data summary is quite important for the result I give this info too (can be skipped if you trust me...)
var timetakenInSec = timer.ElapsedMilliseconds / (double)1000;
long totalNbrOfRequest = Config.NmbrOfRequests / Config.Step * Config.NbrOfCores*Config.NbrOfRedos;
var throughput_ReqPerSec = totalNbrOfRequest / timetakenInSec;
var throughput_BytesPerSec = throughput_ReqPerSec * byteSizePerRequest;
var timeTakenPerRequestInNanos = Math.Round(1e6 * timer.ElapsedMilliseconds / totalNbrOfRequest, 1);
var resultMReqPerSec = Math.Round(throughput_ReqPerSec/1e6, 1);
var resultGBPerSec = Math.Round(throughput_BytesPerSec/1073741824, 1);
var resultTimeTakenInSec = Math.Round(timetakenInSec, 1);
Neglecting to give you the actual output rendering code I get the following result:
Step 10: Throughput: 570,3 MReq/s and 34 GB/s (64B), Timetaken/request: 1,8 ns/req, Total TimeTaken: 12624 msec, Total Requests: 7 200 000 000
Step 100: Throughput: 462,0 MReq/s and 27,5 GB/s (64B), Timetaken/request: 2,2 ns/req, Total TimeTaken: 15586 msec, Total Requests: 7 200 000 000
Step 1000: Throughput: 236,6 MReq/s and 14,1 GB/s (64B), Timetaken/request: 4,2 ns/req, Total TimeTaken: 30430 msec, Total Requests: 7 200 000 000
Using 12 threads instead of 6 (since the CPU is hyper threaded) I get pretty much the same throughput (as expected I think): 32.9 / 30.2 / 15.5 GB/s .
As can be seen, throughput drops as the step increases which I think is normal. Partly I think it is due to that the 12 MB L3 cache forces mores cache misses and partly it may be the Memory Controllers prefetch mechanism that is not working as well when the reads are so far apart. I further believe that the step 1000 result is the closest one to the actual practical memory speed since it should kill most of the CPU caches and "hopefully" kill the prefetch mechanism. Futher more I am assuming that most of the overhead in this loop is the memory fetch operation and not something else.
hardware for this test is:
Intel Core I7-3930 (specs: CPU breif, more detailed, and really detailed spec ) using 32 GB total of DDR3-1600 memories.
Open questions
Am I correct in the assumptions made above?
Is there a way to increase the use of the memory bandwidth? For instance by doing it in C/C++ instead and spread out memory allocation more on heap enabling all four memory channels to be used.
Is there a better way to measure the memory data transfer?
Much obliged for input on this. I know it is a complex area under the hood...
All code here is available for download at https://github.com/Toby999/ThroughputTest. Feel free to contact me at an forwarding email tobytemporary[at]gmail.com.

The decrease in throughput as you increase step is likely caused by the memory prefetching not working well anymore if you don't stride linearly through memory.
Things you can do to improve the speed:
The test speed will be artificially bound by the loop itself taking up CPU cycles. As Roy shows, more speed can be achieved by unfolding the loop.
You should get rid of boundary checking (with "unchecked")
Instead of using Parallel.For, use Thread.Start and pin each thread you start on a separate core (using the code from here: Set thread processor affinity in Microsoft .Net)
Make sure all threads start at the same time, so you don't measure any stragglers (you can do this by spinning on a memory address that you Interlock.Exchange to a new value when all threads are running and spinning)
On a NUMA machine (for example a 2 Socket Modern Xeon), you may have to take extra steps to allocate memory on the NUMA node that a thread will live on. To do this, you need to PInvoke VirtualAllocExNuma
Speaking of memory allocations, using Large Pages should provide yet another boost
While .NET isn't the easiest framework to use for this type of testing, it IS possible to coax it into doing what you want.

Reported RAM results (128 MB) for my bus8thread64.exe benchmark on an i7 3820 with max memory bandwidth of 51.2 GB/s, vary from 15.6 with 1 thread, 28.1 with 2 threads to 38.7 at 8 threads. Code is:
void inc1word(IDEF data1[], IDEF ands[], int n)
{
int i, j;
for(j=0; j<passes1; j++)
{
for (i=0; i<wordsToTest; i=i+64)
{
ands[n] = ands[n] & data1[i ] & data1[i+1 ] & data1[i+2 ] & data1[i+3 ]
& data1[i+4 ] & data1[i+5 ] & data1[i+6 ] & data1[i+7 ]
& data1[i+8 ] & data1[i+9 ] & data1[i+10] & data1[i+11]
& data1[i+12] & data1[i+13] & data1[i+14] & data1[i+15]
& data1[i+16] & data1[i+17] & data1[i+18] & data1[i+19]
& data1[i+20] & data1[i+21] & data1[i+22] & data1[i+23]
& data1[i+24] & data1[i+25] & data1[i+26] & data1[i+27]
& data1[i+28] & data1[i+29] & data1[i+30] & data1[i+31]
& data1[i+32] & data1[i+33] & data1[i+34] & data1[i+35]
& data1[i+36] & data1[i+37] & data1[i+38] & data1[i+39]
& data1[i+40] & data1[i+41] & data1[i+42] & data1[i+43]
& data1[i+44] & data1[i+45] & data1[i+46] & data1[i+47]
& data1[i+48] & data1[i+49] & data1[i+50] & data1[i+51]
& data1[i+52] & data1[i+53] & data1[i+54] & data1[i+55]
& data1[i+56] & data1[i+57] & data1[i+58] & data1[i+59]
& data1[i+60] & data1[i+61] & data1[i+62] & data1[i+63];
}
}
}
This also measures burst reading speeds, where max DTR, based on this, is 46.9 GB/s. Benchmark and source code are in:
http://www.roylongbottom.org.uk/quadcore.zip
For results with interesting speeds using L3 caches are in:
http://www.roylongbottom.org.uk/busspd2k%20results.htm#anchor8Thread

C/C++ would give a more accurate metric of memory performance as .NET can sometimes do some weird things with memory handling and won't give you an accurate picture since it doesn't use compiler intrinsics or SIMD instructions.
There's no guarantee that the CLR is going to give you anything capable of truly benchmarking your RAM. I'm sure there's probably software already written to do this. Ah, yes, PassMark makes something: http://www.bandwidthtest.net/memory_bandwidth.htm
That's probably your best bet as making benchmarking software is pretty much all they do.
Also, nice processor btw, I have the same one in one of my machines ;)
UPDATE (2/20/2014):
I remember seeing some code in the XNA Framework that did some heavy duty optimizations in C# that may give you exactly what you want. Have you tried using "unsafe" code and pointers?

Related

Is there a way to get live GPU performance Information independent of AMD,NVIDIA or Intel GPU in C#?

I am trying to gather performance information in C#. For RAM, CPU and other PC components i am using PerformanceCounters but for GPU there is not really one that gives me information like
% current GPU utilization
and
% current VRam utilization
For NVIDIA there is (in some cases) a performance counter category "NVIDIA GPU" that has all the important counters, but in some cases (e.g. MX250) or AMD GPUs there is not a Category of such kind.
I have tried using the "GPU-Engine" performance counter category, but i dont know how to interpret the data gathered with NextSample() (the NextValue() of the "Utilization Percentage" counter is just always 0, even if my GPU is at 15% or somthing).
The Information given here helped understanding why i need NextSample() but doesnt gave information about the correct way to calculate with rawValue.
i have tried using this:
var sample = gpuCounter.NextSample();
Task.Delay(2000).Wait()
var sample2 = gpuCounter.NextSample();
var ticks = ((sample2.TimeStamp - sample.TimeStamp) / sample.CounterFrequency) * 100000;
double calc = Math.Abs(raw2 - raw) / ticks;
Is there a way to gather live performance information for AMD and Intel in C# ?
Or does anybody know how to calculate the correct utilization value with the rawValue of the NextSample() from the "Utilization Percentage" counter from "GPU-Engine" category?
Any help welcome.
Thanks in advance.
Update:
This calculation returns the same value as NextValue() for the
"Utilization Percentage" but the value of NextValue() is way to low
(if it is not 0).
My idea is that the instance of the "Utilization Percentage" counter
is something like only for one process, this would also be the reason
for alot of 0s.
Can anybody confirm this?
Update2:
if i add up all NextValues of all Counter Instances from (for example some kind of 3D instances) i get an pretty close to correct value, but you need to call NextValue() function twice in an interval more than 1s and take the second value or the results are useless(from my experience). The Problem: this leads to delayed results and so it is hard to say if it works

Full CPU usage for Parallel.For loops

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:
ProcessFrame(double[,] frame)
{
int width = frame.GetLength(1);
int height = frame.GetLength(0);
byte[,] result = new byte[height, width];
Parallel.For(0, height, row =>
{
for(var col = 0; col < width; ++col)
ManipulatePixel(frame[row, col]);
});
}
Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.
I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.
Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:
using (Process process = Process.GetCurrentProcess())
process.PriorityClass = ProcessPriorityClass.RealTime;
Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:
<runtime>
<Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
<GCCpuGroup enabled="true"></GCCpuGroup>
<gcServer enabled="true"></gcServer>
</runtime>
but this actually makes it run even slower.
EDIT:
The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:
ProcessFrame(double[,] frame)
{
byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
Parallel.For(0, frame.GetLength(0), row =>
{
for(var col = 0; col < frame.GetLength(1); ++col)
ManipulatePixel(frame[row, col]);
});
}
Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you #Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.
Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.
2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
See here:
https://github.com/dotnet/coreclr/issues/4059
Why are multi-dimensional arrays in .NET slower than normal arrays?
Multi-dimensional array vs. One-dimensional
So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

Efficiency matters ( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work
The BEST, yet a rather hard way, if ultimate performance is a MUST :
in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).
Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )
A CHEAP, few-seconds step :
Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.
Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )
Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

Here's what I ended up doing, mostly based on Dai's answer:
made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.
I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.
Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:
private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
Debug.Assert(originalImg != null);
Debug.Assert(originalImg.Length != 0);
Debug.Assert(scaledImg != null);
Debug.Assert(scaledImg.Length == originalImg.Length);
ushort min = limits.Item1;
ushort max = limits.Item2;
int width = originalImg.GetLength(1);
int height = originalImg.GetLength(0);
Parallel.For(0, height, row =>
{
for (var col = 0; col < width; ++col)
{
ushort value = originalImg[row, col];
if (value < min)
scaledImg[row, col] = 0;
else if (value > max)
scaledImg[row, col] = 255;
else
scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
}
});
}
This is just one step and some others are much more complex but the approach would be similar.
Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.
It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

Why does a .net program use more VirtualMemorySize64 when the computer has more ram

I've created a simple test application which allocates 100mb using binary arrays.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
namespace VirtualMemoryUsage
{
class Program
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder();
sb.AppendLine($"IsServerGC = {System.Runtime.GCSettings.IsServerGC.ToString()}");
const int tenMegabyte = 1024 * 1024 * 10;
long allocatedMemory = 0;
List<byte[]> memory = new List<byte[]>();
for (int i = 0; i < 10; i++)
{
//alloc 10 mb memory
memory.Add(new byte[tenMegabyte]);
allocatedMemory += tenMegabyte;
}
sb.AppendLine($"Allocated memory: {PrettifyByte(allocatedMemory)}");
sb.AppendLine($"VirtualMemorySize64: {PrettifyByte(Process.GetCurrentProcess().VirtualMemorySize64)}");
sb.AppendLine($"PrivateMemorySize64: {PrettifyByte(Process.GetCurrentProcess().PrivateMemorySize64)}");
sb.AppendLine();
Console.WriteLine(sb.ToString());
Console.ReadLine();
}
private static object PrettifyByte(long allocatedMemory)
{
string[] sizes = { "B", "KB", "MB", "GB", "TB" };
int order = 0;
while (allocatedMemory >= 1024 && order < sizes.Length - 1)
{
order++;
allocatedMemory = allocatedMemory / 1024;
}
return $"{allocatedMemory:0.##} {sizes[order]}";
}
}
}
Note: For this test it is important to set gcserver to true in the app.config
<runtime>
<gcServer enabled="true"/>
</runtime>
This then will show the amount of PrivateMemorySize64 and VirtualMemorySize64 allocated by the process.
While PrivateMemorySize64 remains similar on different computers, VirtualMemorySize64 varies quite a bit.
What is the reason for this differences in VirtualMemorySize64 when the same amount of memory is allocated? Is there any documentation about this?

Wow, you're lucky. On my machine, the last line says 17 GB!
Allocated memory: 100M
VirtualMemorySize64: 17679M
PrivateMemorySize64: 302M
While PrivateMemorySize64 remains similar on different computers [...]
Private bytes are the bytes that belong to your program only. It can hardly be influenced by something else. It contains what is on your heap and inaccessible by someone else.
Why is that 302 MB and not just 100 MB? SysInternals VMMap is a good tool to break down that value:
The colors and sizes of private bytes say:
violet (7.5 MB): image files, i.e. DLLs that are not shareable
orange (11.2 MB): heap (non-.NET)
green (103 MB): managed heap
orange (464 kB): stack
yellow (161 MB): private data, e.g. TEB and PEB
brown (36 MB): page table
As you can see, .NET has just 3 MB overhead in the managed heap. The rest is other stuff that needs to be done for any process.
A debugger or a profiler can help in breaking down the managed heap:
0:013> .loadby sos clr
0:013> !dumpheap -stat
[...]
000007fedac16878 258 11370 System.String
000007fed9bafb38 243 11664 System.Diagnostics.ThreadInfo
000007fedac16ef0 34 38928 System.Object[]
000007fed9bac9c0 510 138720 System.Diagnostics.NtProcessInfoHelper+SystemProcessInformation
000007fedabcfa28 1 191712 System.Int64[]
0000000000a46290 305 736732 Free
000007fedac1bb20 13 104858425 System.Byte[]
Total 1679 objects
So you can see there are some strings and other objects that .NET needs "by default".
What is the reason for this differences in VirtualMemorySize64 when the same amount of memory is allocated?
0:013> !address -summary
[...]
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 58 7fb`adfae000 ( 7.983 TB) 99.79%
MEM_RESERVE 61 4`3d0fc000 ( 16.954 GB) 98.11% 0.21%
MEM_COMMIT 322 0`14f46000 ( 335.273 MB) 1.89% 0.00%
Only 335 MB are committed. That's memory that can actually be used. The 16.954 GB are just reserved. They cannot be used at the moment. They are neither in RAM nor on disk in the page file. Allocating reserved memory is super fast. I've seen that 17 GB value very often, especially in ASP.NET crash dumps.
Looking at details in VMMap again
we can see that the 17 GB are just allocated in one block. A comment on your question said: "When the system runs out of memory, the garbage collector fires and releases the busy one." However, to release a VirtualAlloc()'d block by VirtualFree(), that block must not be logically empty, i.e. there should not be a single .NET object inside - and that's unlikely. So it will stay there forever.
What are possible benefits? It's a single contiguous block of memory. If you need a new byte[4G]() now, it would just work.
Finally, the likely reason is: it's done because it doesn't hurt, neither RAM nor disk. And when needed, it can be commited at a later point in time.
Is there any documentation about this?
That's unlikely. The GC implementation in detail could change with the next version of .NET. I think Microsoft does not document that, otherwise people would complain if the behavior changed.
There are people who have written blog posts like this one that tells us that some values might depend on the number of processors for example.
0:013> !eeheap -gc
Number of GC Heaps: 4
[...]
What we see here is that .NET creates as many heaps as processors. That's good for gabage collection, since every processor can collect one heap independently.

The metrics you are using is not allocated memory, but memory used by the process. One - private, another - shared with other processes on your machine. Real amount of memory, used by the process varies depending on both amount of available memory and other processes running.
Edit: answer by Thomas Weller provides much more details on that subject than my Microsoft links
It does not necessarily represent the amount of allocations performed by your application. If you want to get estimate of the allocated memory (not including .NET framework libraries and memory pagination overhead, etc) you can use
long memory = GC.GetTotalMemory(true);
where true parameter tells GC to perform garbage collection first (it doesn't have to). Unused, but not collected memory is accounted for in the values you asked about. If system has enough memory, it might not be collected until it's needed. Here you can find additional information on how GC works.

degradation of parallel loop performance

I am currently optimizing data processing logic for parallel execution.
I have noticed, that as the core count increases - data processing performance does not necessary increases the way I suppose it should.
Here is the test code:
Console.WriteLine($"{DateTime.Now}: Data processing start");
double lastElapsedMs = 0;
for (int i = 1; i <= Environment.ProcessorCount; i++)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
ProccessData(i); // main processing method
watch.Stop();
double elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"{DateTime.Now}: Core count: {i}, Elapsed: {elapsedMs}ms");
lastElapsedMs = elapsedMs;
}
Console.WriteLine($"{DateTime.Now}: Data processing end");
and
public static void ProccessData(int coreCount)
{
// First part is data preparation.
// splitting 1 collection into smaller chunks, depending on core count
////////////////
// combinations = collection of data
var length = combinations.Length;
int chuncSize = length / coreCount;
int[][][] chunked = new int[coreCount][][];
for (int i = 0; i < coreCount; i++)
{
int skip = i * chuncSize;
int take = chuncSize;
int diff = (length - skip) - take;
if (diff < chuncSize)
{
take = take + diff;
}
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
}
// Second part is itteration. 1 chunk of data processed per core.
////////////////
Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
{
var chunk = chunked[chunkIndex];
int chunkLength = chunk.Length;
// itterate data inside chunk
for (int idx = 0; idx < chunkLength; idx++)
{
// additional processing logic here for single data
}
});
}
The results are following:
As you can see from the result set - by using 2 cores instead of 1 - you can get almost ideal increase of performance (given the fact, that 1 core is running at 4700Mhz, but 2 cores run at 4600Mhz each).
After that, when the data was supposed to be processed in parallel on 3 cores, I was expecting to see the performance increase by 33% when compared to 2 core execution. The actual is 21.62% increase.
Next, as the core count increases - the degradation of "parallel" execution performance continues to increase.
In the end, when we have 12 cores results - the difference between actual and ideal results is more than twice as big (96442ms vs 39610ms)!
I have certainly not expected the difference to be as big. I have an Intel 8700k processor. 6 physical cores and 6 logical - total 12 Threads. 1 core running at 4700Mhz in turbo mode, 2C 4600, 3C 4500, 4C 4400, 5-6C 4400, 6C 4300.
If it matters - I have done additional observations in Core-temp:
when 1 core processing was running - 1 out of 6 cores was busy 50%
when 2 core processing was running - 2 out of 6 cores were busy 50%
when 3 core processing was running - 3 out of 6 cores were busy 50%
when 4 core processing was running - 4 out of 6 cores were busy 50%
when 5 core processing was running - 5 out of 6 cores were busy 50%
when 6 core processing was running - 6 out of 6 cores were busy 50%
when 7 core processing was running - 5 out of 6 cores were busy 50%, 1 core 100%
when 8 core processing was running - 4 out of 6 cores were busy 50%, 2 cores 100%
when 9 core processing was running - 3 out of 6 cores were busy 50%, 3 cores 100%
when 10 core processing was running - 2 out of 6 cores were busy 50%, 4 cores 100%
when 11 core processing was running - 1 out of 6 cores were busy 50%, 5 cores 100%
when 12 core processing was running - all 6 cores at 100%
I can certainly see that the end result should not be as performant as ideal result, because frequency per core decreases, but still.. Is there a good explanation why my code performs so badly at 12 cores? Is this generalized situation on every machine or perhaps a limitation of my PC?
.net core 2 used for tests
Edit: Sorry forgot to mention that data chunking can be optimized since I have done it as a draft solution. Nevertheless, splitting is done within 1 second time, so it's maximum 1000-2000ms added to the result execution time.
Edit2: I have just got rid of all the chunking logic and removed MaxDegreeOfParallelism property. The data is processed as is, in parallel. The execution time is now 94196ms, which is basically the same time as before, excluding chunking time. Seems like .net is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation since nothing of what I have done increased the performance outside of bonds of an error margin.

Yeah Ahmdahl's law. Performance speedup is never linear with the number of cores thrown at the problem.
Also reciprocals...

Your chunking code is run once no matter how many processors you have, yet it still is dependent on the number of processors.
Especially the Skip/Take part followed by the double ToArray() call seems very much in need of optimization. See How to copy part of an array to another array in C#? on how to copy an array without traversing the whole thing multiple times.
That should do a lot for your performance coming closer to what you'd expect. That said, the work of branching out and combining the results will always degrade the performance of parallel execution. "Maximum parallelism" is not exactly something to strife for. There is always a sweet spot where the parallelization outweighs the hit from preparing for it. You need to find that. Or let .NET take care of that by leaving out the manual override for cores.

As nvoigt pointed out, the chunking code is run on a single core and is slow. Look at these two lines:
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
SkipTake inside a loop is a Schlemiel the Painter performance issue. Use a different method
sub is already a perfectly good array, why make another copy on the next line? Array allocation isn't 0 cost.
I think ArraySegment is a good fit for this problem instead of making array copies. At the least, you can ToArray an ArraySegment more efficiently than what you're currently doing.

CPU and/or RAM productivity when working with big integers

yesterday I was solving one exam problem, when found something very interesting (at least for me). The program is for factorials (very big ones) and the result is how much zeroes there are on the end of the number (in some cases 2500 zeros..). So I did what I could, but found that when enter number like 100 000 it takes exactly 1;30 - 1;33min to output the result. I thought its because of my CPU (it is not very fast). I've sent the .exe to some of my friends to try it because they have very good PCs when we are talking about performance - exactly the same result (1;33min).
My question is why is the time to solve the task the same. I know there are better ways to write my core so it wouldn't take so long, but this is very important for me to understand as a beginner programmer.
So here is my code:
static void Main()
{
int num = int.Parse(Console.ReadLine()),
zeroCounter = 0;
BigInteger fact = 1;
var startTime = DateTime.Now;
Console.WriteLine();
for (int i = 1; i <= num; i++)
{
fact *= i;
Console.Write("\r{0}", DateTime.Now - startTime);
}
BigInteger factTarget = fact;
while (factTarget % 10 == 0)
{
factTarget /= 10;
zeroCounter++;
Console.Write("\r{0}", DateTime.Now - startTime);
}
Console.WriteLine();
Console.WriteLine("Result is number with {0} zeros.", zeroCounter);
Console.WriteLine();
Console.WriteLine("Finished for: {0}", DateTime.Now - startTime);
Console.WriteLine();
Console.WriteLine("\nPres any key to exit...");
Console.ReadKey();
}
I am very sorry If this is the wrong place to ask, I did my best to find what I was looking for before I post this.

The thing that I notice immediately about your code is that you have included Console.WriteLine() statements in your computational loops.
The fact is, I/O is much slower for a computer to handle than computations, even under ideal conditions. And I wouldn't say that the Windows console window is a particularly efficient implementation of that particular kind of I/O. Furthermore, I/O tends to be less dependent on CPU and memory differences from machine to machine.
In other words, it seems very likely to me that you are primarily measuring I/O throughput and not computational throughput, and so it's not surprising to see consistent results between machines.
For what it's worth, when I run your example on my laptop, if I disable the output I can complete the computation in about a minute. I get something closer to your 1:30 time if I use the code as-is.
EDIT:
I recommend the answer from Hans Passant as well. Memory I/O is still I/O and is, as I describe above, much less variable from machine to machine than CPU speed. It's my hope that the above general-purpose description gives ideas for where the difference could be (without access to each of the machines in question, there's not really any way to know for sure what is the cause), but Hans's answer provides some very good detail about the memory I/O issue in particular and is very much worth reading.

now the time is 00:01:23.5856140
The speed of this program is determined by the bandwidth of the RAM in your machine. It is a design-constant and unrelated to the speed of the processor. RAM plays a role here because of the very large number of digits in the factorial, they don't fit the CPU caches anymore. And the memory access pattern for a BigInteger multiplication is very unfriendly, all digits are required to multiply a number.
Your program takes 57 seconds on my laptop, I know it has PC3-12800 RAM. Which has a peak transfer rate of 12800 MB/sec, give or take the CAS latency (I don't know mine). So we can calculate the RAM speed on your and your friend's machine:
1:23 = 83 sec, 57/83 x 12800 = 8790 MB/sec.
Which is a pretty close match for PC3-8500. A run-of-the-mill RAM speed very common in white-box machines, the kind you'd get from a vendor like Dell. Your friend's fast PC is a bit of a toaster, break it to him gently :)
Fwiw, why the highly upvoted post doesn't have much of an affect on the speed can use an explanation as well. The console window that your program uses is owned by another process. Conhost.exe, you can see it back in the Processes tab of Taskman.exe. It takes care of scrolling and painting the window, under the hood your program uses process-interop to tell it to update the window.
This happens while your program is running, on another thread, so your program is only bogged-down when it firehoses Conhost.exe, sending updates faster than it can handle. So, at the start of your program you are still fast and will get bogged down. But not when the number of digits starts to grow large and your multiplications start to get slow. Overall, the slowdown is not that great.

What happens is that the core or the processor the pc has a fixed size of internal buses, that store the data. The speed of RAM is 10-1000 times slower than Processor. There also something called Cache memory, but the size of cache memory is dam small. So Whats so large size of RAM you have in your pc, it will be still slow and take time. Coz when it reaches High Numbers, the numbers take time to Read and Write to and from Memory.
Plus writing each time to the screen eats up some time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.