I am rewriting a high performance C++ application to C#. The C# app is noticeably slower than the C++ original. Profiling tells me that the C# app spends most time in accessing array elements. Hence I create a simple array access benchmark. I get completely different results than others doing a similiar comparison.
The C++ code:
#include <limits>
#include <stdio.h>
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int main(void)
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int xRepLen = 100 * 1000;
int xRepCount = 1000;
unsigned short * xArray = new unsigned short[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArray[xIdx] = xIdx % USHRT_MAX;
int * xResults = new int[xRepLen];
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
{
unsigned short xValToBreach = (xArray[xIdx] + 25) % USHRT_MAX;
xResults[xIdx] = 0;
for (int xIdx2 = xIdx + 1; xIdx2 < xRepLen; xIdx2++)
if (xArray[xIdx2] >= xValToBreach)
{
xResults[xIdx] = xIdx2; break;
}
if (xResults[xIdx] == 0)
xResults[xIdx] = INT_MAX;
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "Elasped miliseconds " << duration;
getchar();
}
The C# code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace arrayBenchmarkCs
{
class Program
{
public static void benchCs()
{
unsafe
{
int xRepLen = 100 * 1000;
int xRepCount = 1000;
ushort[] xArr = new ushort[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArr[xIdx] = (ushort)(xIdx % 0xffff);
int[] xResults = new int[xRepLen];
Stopwatch xSw = new Stopwatch(); xSw.Start();
fixed (ushort * xArrayStart = & xArr [0])
{
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, go find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
ushort * xArrayEnd = xArrayStart + xRepLen;
for (ushort* xPtr = xArrayStart; xPtr != xArrayEnd; xPtr++)
{
ushort xValToBreach = (ushort)((*xPtr + 25) % 0xffff);
int xResult = -1;
for (ushort * xPtr2 = xPtr + 1; xPtr2 != xArrayEnd; xPtr2++)
if ( *xPtr2 >= xValToBreach)
{
xResult = (int)(xPtr2 - xArrayStart);
break;
}
if (xResult == -1)
xResult = int.MaxValue;
// save result
xResults[xPtr - xArrayStart] = xResult;
}
}
} // fixed
xSw.Stop();
Console.WriteLine("Elapsed miliseconds: " + (xSw.ElapsedMilliseconds.ToString("0"));
}
}
static void Main(string[] args)
{
benchCs();
Console.ReadKey();
}
}
}
On my work computer (i7-3770), the C++ version is approx 2x faster than the C# version. On my home computer (i7-5820K) the C++ is 1.5x faster than the C# version. Both are measured in Release. I hoped that by using pointers in C# I would avoid the array boundary checking and the performance would be the same in both languages.
So my questions are the following:
home come others are finding C# to be of the same speed as C++?
how can I get C# performance to the C++ level if not via pointers?
what could be the driver of different speedups on different computers?
Any hint is much appreciated,
Daniel
You won't get this kind of hardcore number crunching to C++ speed. Using pointer arithmetic and unsafe code gets you some of the way there (it's almost half as slow again if you remove the unsafe and fixed parts). C# isn't compiled to native code, and the code that it is running is full of extra checks and stuff.
If you're willing to go unsafe then really there's nothing stopping you coding your C++ performance-critical stuff into a mixed-mode assembly, and calling that from your C# glue code.
C++ code is not working the same way as C#. The inner loop is different. There are 4 memory operations xResults[xIdx] and just 1 in c#.
I was shocked, that performance of C# code depends so much on framework version.
What's even more interesting C# on .net core 3.1 outperformed C++ by 5%. With other frameworks I checked C# was 30-50% slower then C++
Related
In high school I've studied basic C/C++ (it basically was C with cin and cout - those were the only C++ things, so I'd rather say I've been studying C in my high school time)
Now that I'm in college, I have to transition to C#.
I'm currently trying to make this C program in C#.
int main()
{
int array[100];
for (int i = 0; i < 5; i++)
{
cin >> array[i];
}
for (int i = 0; i < 5; i++)
{
cout << array[i] << " ";
}
}
Here is how I tried to write it using C#:
using System;
using System.Linq;
namespace ConsoleApp2
{
class Program
{
static void Main(string[] args)
{
int[] array = new int[100];
for (int i = 0; i < 3; i++)
{
array[i] = Convert.ToInt32(Console.ReadLine());
}
for (int i = 0; i < 3; i++)
{
Console.WriteLine(array[i]);
}
}
}
}
and it's kind of similar but it has a problem that I don't know how to fix.
In C, building that program, I could've entered data like this:
In C#, I can't, because I'm getting an exception:
I kind of understand what's happening - in C#, using Console.ReadLine() reads that line as a string, and then it tries to convert into an integer.
That does fail because of the spaces between the digits - "1 2 3".
But in C this works, because cin works differently, I guess. (I don't know how though and at this point I don't care anymore, I probably won't be using C/C++ anymore).
How can I replicate this behavior of the C program in the C# program so I can enter data as I did in the pic?
The main problem here is that you're using Console.ReadLine instead of Console.ReadKey. cin >> array[i] will attempt to get the next int from stdin (thanks #UnholySheep), not the next line, so the equivalent of it would be Console.ReadKey().KeyChar with manual type conversion.
Now, there are multiple ways of converting a char to an int in C#, but casting isn't one as it would result in wrong conversions (in C# (int) '1' == 49, proof). My preferred way is int.TryParse(char.ToString(), out int number), but some people use different methods.
Edit: This will only handle single digit input correctly. If you need multi digit input, see the end of the answer.
This means, your code turns into this:
var array = new int[5];
for (var i = 0; i < 5; i++)
{
var read = Console.ReadKey().KeyChar.ToString();
if (int.TryParse(read, out var number))
array[i] = number;
Console.Write(" ");
}
foreach (var element in array)
{
Console.Write("${element} ");
}
Note that I emulated the space between your input by writing an empty space in between every Console.ReadKey call. I also decided to foreach over the array as well as used string interpolation ($"{...} ...") to print the element with a space next to it.
If you want to manually enter the spaces, you can leave out the Console.Write(" ");, but then be aware that the array will contain a number only every other element, and the rest will contain 0 (default value of int)
To correctly handle multi digit input, you can either:
Use Console.ReadLine and live with the fact that multiple inputs need to come on different lines
or
Use some botched together method to read the next int from the Console. Something like this:
public static int GetNextInt()
{
var totalRead = new List<char>();
while(true)
{
var read = Console.ReadKey().KeyChar;
if (!char.IsDigit(read))
{
if (totalRead.Count > 0)
break;
}
else
totalRead.Add(read);
}
if (totalRead.Count == 0)
return -1;
return int.Parse(new string(totalRead.ToArray()));
}
This will try it's hardest to get the next int. Some test cases are:
+---------+-------------------+
| Input | Parsed int |
+---------+-------------------+
| "123\n" | 123 |
| "123a" | 123 |
| "a123a" | 123 |
| "a\n" | Continues parsing |
| "\n" | Continues parsing |
+---------+-------------------+
There is currently some redundant code, but you can remove on or the other line depending on what you want.
Do you want the current functionality?
Then you can safely remove if (totalRead.Count == 0) return -1;
Do you want the method to return -1 if the first entered character isn't a number?
The you can safely remove if (totalRead.Count > 0) but keep the break.
My goal with this wasn't to 1:1 copy the C++ functionality of std::cin >> ..., but to come close to it.
This does exactly what you want, what you need is Console.Read() function :
public static void Main(string[] args)
{
int[] array = new int[100];
int index = 0;
for (int i = 0; i < 3; i++)
{
string word = string.Empty;
bool isNew = true, assigned = false;
while (isNew)
{
char read = (char)Console.Read();
if (!char.IsDigit(read))
{
if (isNew && assigned)
{
array[index++] = int.Parse(word);
isNew = false;
}
}
else
{
word += read;
assigned = true;
}
}
}
for (int i = 0; i < 3; i++)
{
Console.Write(array[i] + " ");
}
Console.WriteLine();
}
You can also take a look at :
https://nakov.com/blog/2011/11/23/cin-class-for-csharp-read-from-console-nakov-io-cin/
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
I am converting my C# library to C++. I am using a C# Dictionary variable across the app, and when I tried using std::map instead with string as key in both scenarios, I felt drastic difference in performance.
C# Dictionary took 0.022717 seconds with the below code. C++ map takes around 3 seconds.
C# Dictionary:
Stopwatch stopWatch = new Stopwatch();
Dictionary<string, int> dict = new Dictionary<string, int>();
stopWatch.Start();
for (int i = 0; i < 100000; i++)
{
dict.Add(i.ToString(), i);
}
stopWatch.Stop();
var op = stopWatch.Elapsed.TotalSeconds.ToString();
C++ map:
#include <iostream>
#include <map>
#include <string>
#include <chrono>
using namespace std;
int main()
{
std::map<std::string, int> objMap;
tm* timetr = new tm();
time_t t1 = time(NULL);
localtime_s(timetr, &t1);
for (size_t i = 0; i < 100000; i++)
{
objMap.emplace(std::to_string(i), i);
}
tm* timetr2 = new tm();
time_t t2 = time(NULL);
localtime_s(timetr2, &t2);
time_t tt = t2 - t1;
cout << tt;
string sss = "";
cin >> sss;
}
Why is there such a difference? What should be an equivalent alternative to achieve the same results?
Add my two cents here.
C# dictionary is a HashMap, but C++ std::map is a Red-Black trees. HashMap performs better than tree. If you want to use HashMap in c++, please use std::unordered_map.
Not sure 100% this is the reason, but you can find it out after switch to std::unordered_map.
So I am looking at this question and the general consensus is that uint cast version is more efficient than range check with 0. Since the code is also in MS's implementation of List I assume it is a real optimization. However I have failed to produce a code sample that results in better performance for the uint version. I have tried different tests and there is something missing or some other part of my code is dwarfing the time for the checks. My last attempt looks like this:
class TestType
{
public TestType(int size)
{
MaxSize = size;
Random rand = new Random(100);
for (int i = 0; i < MaxIterations; i++)
{
indexes[i] = rand.Next(0, MaxSize);
}
}
public const int MaxIterations = 10000000;
private int MaxSize;
private int[] indexes = new int[MaxIterations];
public void Test()
{
var timer = new Stopwatch();
int inRange = 0;
int outOfRange = 0;
timer.Start();
for (int i = 0; i < MaxIterations; i++)
{
int x = indexes[i];
if (x < 0 || x > MaxSize)
{
throw new Exception();
}
inRange += indexes[x];
}
timer.Stop();
Console.WriteLine("Comparision 1: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");
inRange = 0;
outOfRange = 0;
timer.Reset();
timer.Start();
for (int i = 0; i < MaxIterations; i++)
{
int x = indexes[i];
if ((uint)x > (uint)MaxSize)
{
throw new Exception();
}
inRange += indexes[x];
}
timer.Stop();
Console.WriteLine("Comparision 2: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");
}
}
class Program
{
static void Main()
{
TestType t = new TestType(TestType.MaxIterations);
t.Test();
TestType t2 = new TestType(TestType.MaxIterations);
t2.Test();
TestType t3 = new TestType(TestType.MaxIterations);
t3.Test();
}
}
The code is a bit of a mess because I tried many things to make uint check perform faster like moving the compared variable into a field of a class, generating random index access and so on but in every case the result seems to be the same for both versions. So is this change applicable on modern x86 processors and can someone demonstrate it somehow?
Note that I am not asking for someone to fix my sample or explain what is wrong with it. I just want to see the case where the optimization does work.
if (x < 0 || x > MaxSize)
The comparison is performed by the CMP processor instruction (Compare). You'll want to take a look at Agner Fog's instruction tables document (PDF), it list the cost of instructions. Find your processor back in the list, then locate the CMP instruction.
For mine, Haswell, CMP takes 1 cycle of latency and 0.25 cycles of throughput.
A fractional cost like that could use an explanation, Haswell has 4 integer execution units that can execute instructions at the same time. When a program contains enough integer operations, like CMP, without an interdependency then they can all execute at the same time. In effect making the program 4 times faster. You don't always manage to keep all 4 of them busy at the same time with your code, it is actually pretty rare. But you do keep 2 of them busy in this case. Or in other words, two comparisons take just as long as single one, 1 cycle.
There are other factors at play that make the execution time identical. One thing helps is that the processor can predict the branch very well, it can speculatively execute x > MaxSize in spite of the short-circuit evaluation. And it will in fact end up using the result since the branch is never taken.
And the true bottleneck in this code is the array indexing, accessing memory is one of the slowest thing the processor can do. So the "fast" version of the code isn't faster even though it provides more opportunity to allow the processor to concurrently execute instructions. It isn't much of an opportunity today anyway, a processor has too many execution units to keep busy. Otherwise the feature that makes HyperThreading work. In both cases the processor bogs down at the same rate.
On my machine, I have to write code that occupies more than 4 engines to make it slower. Silly code like this:
if (x < 0 || x > MaxSize || x > 10000000 || x > 20000000 || x > 3000000) {
outOfRange++;
}
else {
inRange++;
}
Using 5 compares, now I can a difference, 61 vs 47 msec. Or in other words, this is a way to count the number of integer engines in the processor. Hehe :)
So this is a micro-optimization that probably used to pay off a decade ago. It doesn't anymore. Scratch it off your list of things to worry about :)
I would suggest attempting code which does not throw an exception when the index is out of range. Exceptions are incredibly expensive and can completely throw off your bench results.
The code below does a timed-average bench for 1,000 iterations of 1,000,000 results.
using System;
using System.Diagnostics;
namespace BenchTest
{
class Program
{
const int LoopCount = 1000000;
const int AverageCount = 1000;
static void Main(string[] args)
{
Console.WriteLine("Starting Benchmark");
RunTest();
Console.WriteLine("Finished Benchmark");
Console.Write("Press any key to exit...");
Console.ReadKey();
}
static void RunTest()
{
int cursorRow = Console.CursorTop; int cursorCol = Console.CursorLeft;
long totalTime1 = 0; long totalTime2 = 0;
long invalidOperationCount1 = 0; long invalidOperationCount2 = 0;
for (int i = 0; i < AverageCount; i++)
{
Console.SetCursorPosition(cursorCol, cursorRow);
Console.WriteLine("Running iteration: {0}/{1}", i + 1, AverageCount);
int[] indexArgs = RandomFill(LoopCount, int.MinValue, int.MaxValue);
int[] sizeArgs = RandomFill(LoopCount, 0, int.MaxValue);
totalTime1 += RunLoop(TestMethod1, indexArgs, sizeArgs, ref invalidOperationCount1);
totalTime2 += RunLoop(TestMethod2, indexArgs, sizeArgs, ref invalidOperationCount2);
}
PrintResult("Test 1", TimeSpan.FromTicks(totalTime1 / AverageCount), invalidOperationCount1);
PrintResult("Test 2", TimeSpan.FromTicks(totalTime2 / AverageCount), invalidOperationCount2);
}
static void PrintResult(string testName, TimeSpan averageTime, long invalidOperationCount)
{
Console.WriteLine(testName);
Console.WriteLine(" Average Time: {0}", averageTime);
Console.WriteLine(" Invalid Operations: {0} ({1})", invalidOperationCount, (invalidOperationCount / (double)(AverageCount * LoopCount)).ToString("P3"));
}
static long RunLoop(Func<int, int, int> testMethod, int[] indexArgs, int[] sizeArgs, ref long invalidOperationCount)
{
Stopwatch sw = new Stopwatch();
Console.Write("Running {0} sub-iterations", LoopCount);
sw.Start();
long startTickCount = sw.ElapsedTicks;
for (int i = 0; i < LoopCount; i++)
{
invalidOperationCount += testMethod(indexArgs[i], sizeArgs[i]);
}
sw.Stop();
long stopTickCount = sw.ElapsedTicks;
long elapsedTickCount = stopTickCount - startTickCount;
Console.WriteLine(" - Time Taken: {0}", new TimeSpan(elapsedTickCount));
return elapsedTickCount;
}
static int[] RandomFill(int size, int minValue, int maxValue)
{
int[] randomArray = new int[size];
Random rng = new Random();
for (int i = 0; i < size; i++)
{
randomArray[i] = rng.Next(minValue, maxValue);
}
return randomArray;
}
static int TestMethod1(int index, int size)
{
return (index < 0 || index >= size) ? 1 : 0;
}
static int TestMethod2(int index, int size)
{
return ((uint)(index) >= (uint)(size)) ? 1 : 0;
}
}
}
You aren't comparing like with like.
The code you were talking about not only saved one branch by using the optimisation, but also 4 bytes of CIL in a small method.
In a small method 4 bytes can be the difference in being inlined and not being inlined.
And if the method calling that method is also written to be small, then that can mean two (or more) method calls are jitted as one piece of inline code.
And maybe some of it is then, because it is inline and available for analysis by the jitter, optimised further again.
The real difference is not between index < 0 || index >= _size and (uint)index >= (uint)_size, but between code that has repeated efforts to minimise the method body size and code that does not. Look for example at how another method is used to throw the exception if necessary, further shaving off a couple of bytes of CIL.
(And no, that's not to say that I think all methods should be written like that, but there certainly can be performance differences when one does).
How big is instance of following class after constructor is called?
I guess this can be written generally as size = nx + c, where x = 4
in x86, and x = 8 in x64. n = ? c = ?
Is there some method in .NET which can return this number?
class Node
{
byte[][] a;
int[] b;
List<Node> c;
public Node()
{
a = new byte[3][];
b = new int[3];
c = new List<Node>(0);
}
}
First of all this depends on environment where this program is compiled and run, but if you fix some variables you can get pretty good guess.
Answer to 2) is NO, there is no function that will give you requested answer for any object given as argument.
In solving 1) you have two approaches:
Try to perform some tests to find out
Analyze the object and do the math
Test approach
First take a look at these:
what-is-the-memory-overhead-of-a-net-object
Overhead of a .NET array?
C# List size vs double[] size
Method you need is this:
const int Size = 100000;
private static void InstanceOverheadTest()
{
object[] array = new object[Size];
long initialMemory = GC.GetTotalMemory(true);
for (int i = 0; i < Size; i++)
{
array[i] = new Node();
}
long finalMemory = GC.GetTotalMemory(true);
GC.KeepAlive(array);
long total = finalMemory - initialMemory;
Console.WriteLine("Measured size of each element: {0:0.000} bytes",
((double)total) / Size);
}
On my Windows 7 machine, VS 2012, .NET 4.5, x86 (32 bit) result is 96.000. When changed to x64 result is 176.000.
Do the math approach
Do the math approach can be written as a function that will give you result, but is specific for your Node class, and it is only valid before other operations on your object are performed. Also notice that this is made in 32-bit program and also note that this number can change with framework implementation and version. This is just example how you can give pretty good guess about object size in some moment if object is simple enough. Array and List overhead constants are taken from Overhead of a .NET array? and C# List size vs double[] size
public const int PointerSize32 = 4;
public const int ValueArrayOverhead32 = 12;
public const int RefArrayOverhead32 = 16;
public const int ListOverhead32 = 32;
private static int instanceOverheadAssume32()
{
int sa = RefArrayOverhead32 + 3 * PointerSize32;
int sb = ValueArrayOverhead32 + 3 * sizeof(int);
int sc = ListOverhead32;
return 3 * PointerSize32 + sa + sb + sc;
}
This will also return 96 so I assume that method is correct.