How big is instance of following class after constructor is called?
I guess this can be written generally as size = nx + c, where x = 4
in x86, and x = 8 in x64. n = ? c = ?
Is there some method in .NET which can return this number?
class Node
{
byte[][] a;
int[] b;
List<Node> c;
public Node()
{
a = new byte[3][];
b = new int[3];
c = new List<Node>(0);
}
}
First of all this depends on environment where this program is compiled and run, but if you fix some variables you can get pretty good guess.
Answer to 2) is NO, there is no function that will give you requested answer for any object given as argument.
In solving 1) you have two approaches:
Try to perform some tests to find out
Analyze the object and do the math
Test approach
First take a look at these:
what-is-the-memory-overhead-of-a-net-object
Overhead of a .NET array?
C# List size vs double[] size
Method you need is this:
const int Size = 100000;
private static void InstanceOverheadTest()
{
object[] array = new object[Size];
long initialMemory = GC.GetTotalMemory(true);
for (int i = 0; i < Size; i++)
{
array[i] = new Node();
}
long finalMemory = GC.GetTotalMemory(true);
GC.KeepAlive(array);
long total = finalMemory - initialMemory;
Console.WriteLine("Measured size of each element: {0:0.000} bytes",
((double)total) / Size);
}
On my Windows 7 machine, VS 2012, .NET 4.5, x86 (32 bit) result is 96.000. When changed to x64 result is 176.000.
Do the math approach
Do the math approach can be written as a function that will give you result, but is specific for your Node class, and it is only valid before other operations on your object are performed. Also notice that this is made in 32-bit program and also note that this number can change with framework implementation and version. This is just example how you can give pretty good guess about object size in some moment if object is simple enough. Array and List overhead constants are taken from Overhead of a .NET array? and C# List size vs double[] size
public const int PointerSize32 = 4;
public const int ValueArrayOverhead32 = 12;
public const int RefArrayOverhead32 = 16;
public const int ListOverhead32 = 32;
private static int instanceOverheadAssume32()
{
int sa = RefArrayOverhead32 + 3 * PointerSize32;
int sb = ValueArrayOverhead32 + 3 * sizeof(int);
int sc = ListOverhead32;
return 3 * PointerSize32 + sa + sb + sc;
}
This will also return 96 so I assume that method is correct.
Related
Backgorund
Hi,
I am creating a game where the AI is collecting possible moves from positions which is done millions of times per turn. I am trying to figure out the fastest way to store these moves containing e.g. pieceFromSquare, pieceToSquare and other move related things.
Currently I have a Move class which contains public variables for the move as per below:
public class Move
{
public int FromSquare;
public int ToSquare;
public int MoveType;
public static Move GetMove(int fromSquare, int toSquare, int moveType)
{
Move move = new Move();
move.FromSquare = fromSquare;
move.ToSquare = toSquare;
move.MoveType = moveType;
return move;
}
}
When the AI finds a move it stores the move in a List. I was thinking if it would be faster to store the move as a list of integers instead so I ran a test:
Test
public void RunTimerTest()
{
// Function A
startTime = DateTime.UtcNow;
moveListA = new List<List<int>>();
for (int i = 0; i < numberOfIterations; i++)
{
FunctionA();
}
PrintResult((float)Math.Round((DateTime.UtcNow - startTime).TotalSeconds, 2), "Function A");
// Function B
startTime = DateTime.UtcNow;
moveListB = new List<Move>();
for (int i = 0; i < numberOfIterations; i++)
{
FunctionB();
}
PrintResult((float)Math.Round((DateTime.UtcNow - startTime).TotalSeconds, 2), "Function B");
}
private void FunctionA()
{
moveListA.Add(new List<int>() { 1, 123, 10});
}
private void FunctionB()
{
moveListB.Add(Move.GetMove(1, 123, 10));
}
The test results in the following result when running over 10 million times:
Function A ran for: 4,58 s.
Function B ran for: 1,47 s.
So it is more than 3 times faster to create the class and populate its variables than to create a list of integers.
Questions
Why is it so much faster to create a class than a list of integers?
Is there an even faster way to store this type of data?
As mentioned in the comments, the reasons are probably mostly due to the list example needing to allocate at least two objects, possibly more depending on how it is optimized.
For high performance code a common guideline is to avoid high frequency allocations. While allocations in c# are fast, they still take some time to manage. This often means sticking with fixed size arrays, or at least set the capacity of any lists on creation.
Another important point is using structs, they are stored directly in the list/array, instead of storing a reference to a separate object. This avoids some object overhead, removes the memory need of a separate reference, and ensures all values are stored sequentially in memory. All of this help ensure caches are used efficiently. Using a smaller datatype like short/ushort may also help if that is possible. Note that structs should preferably be immutable, and there are some keywords like 'ref' and 'in' that can help avoid the overhead of copying data.
In some specific cases it can be an idea to separate values into different arrays, i.e. one for all the FromSquare values, one for all the ToSquare values etc. This can be a benefit if an operation mostly uses only a single value, again benefiting from better cache usage. It might also make SIMD easier to apply, but that might not apply in this case.
Moreover, when when measuring performance, at least use a stopwatch. That is much more accurate than dateTime, and no harder to use. Benchmark.Net would be even better since that helps compensate for various sources of noise, and can benchmark for multiple platforms. A good profiler can also be useful since it can give hints at what takes most time, how much you are allocating etc.
I examined four options, and allocating arrays for each record is by for the slowest. I did not check allocating class objects, as I was going for the faster options.
struct Move {} stores three integer values into readonly fields of a strcuture.
struct FMove {} stores an fixed int array of size 3
(int,int,int) stores a tuple of three int values
int[] allocates an array of three values.
With the following results. I am tracking time and size allocated in bytes.
Storage
Iterations
Time
Size
Move struct
10000000 allocations
0.2633584 seconds
120000064 bytes
FMove struct
10000000 allocations
0.3572664 seconds
120000064 bytes
Tuple
10000000 allocations
0.702174 seconds
160000064 bytes
Array
10000000 allocations
1.2226393 seconds
480000064 bytes
public readonly struct Move
{
public readonly int FromSquare;
public readonly int ToSquare;
public readonly int MoveType;
public Move(int fromSquare, int toSquare, int moveType) : this()
{
FromSquare = fromSquare;
ToSquare = toSquare;
MoveType = moveType;
}
}
public unsafe struct FMove
{
fixed int Data[3];
public FMove(int fromSquare, int toSquare, int moveType) : this()
{
Data[0] = fromSquare;
Data[1] = toSquare;
Data[2] = moveType;
}
public int FromSquare { get => Data[0]; }
public int ToSquare { get => Data[1]; }
public int MoveType { get => Data[2]; }
}
static class Program
{
static void Main(string[] args)
{
// Always compile with Release to time
const int count = 10000000;
Console.WriteLine("Burn-in start");
// Burn-in. Do some calc to spool up
// the CPU
AllocateArray(count/10);
Console.WriteLine("Burn-in end");
double[] timing = new double[4];
// store timing results for four different
// allocation types
var sw = new Stopwatch();
Console.WriteLine("Timming start");
long startMemory;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r4 = AllocateArray(count);
sw.Stop();
var s4 = GC.GetTotalMemory(true) - startMemory;
timing[3] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r1 = AllocateMove(count);
sw.Stop();
var s1 = GC.GetTotalMemory(true) - startMemory;
timing[0] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r2 = AllocateFMove(count);
sw.Stop();
var s2 = GC.GetTotalMemory(true) - startMemory;
timing[1] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r3 = AllocateTuple(count);
sw.Stop();
var s3 = GC.GetTotalMemory(true) - startMemory;
timing[2] = sw.Elapsed.TotalSeconds;
Console.WriteLine($"| Storage | Iterations | Time | Size |");
Console.WriteLine($"|---|---|---|---|");
Console.WriteLine($"| Move struct| {r1.Count} allocations | {timing[0]} seconds | {s1} bytes |");
Console.WriteLine($"| FMove struct| {r2.Count} allocations | {timing[1]} seconds | {s2} bytes |");
Console.WriteLine($"| Tuple| {r3.Count} allocations | {timing[2]} seconds | {s3} bytes |");
Console.WriteLine($"| Array| {r4.Count} allocations | {timing[3]} seconds | {s4} bytes |");
}
static List<Move> AllocateMove(int count)
{
var result = new List<Move>(count);
for (int i = 0; i < count; i++)
{
result.Add(new Move(1, 123, 10));
}
return result;
}
static List<FMove> AllocateFMove(int count)
{
var result = new List<FMove>(count);
for (int i = 0; i < count; i++)
{
result.Add(new FMove(1, 123, 10));
}
return result;
}
static List<(int from, int to, int type)> AllocateTuple(int count)
{
var result = new List<(int from, int to, int type)>(count);
for (int i = 0; i < count; i++)
{
result.Add((1, 123, 10));
}
return result;
}
static List<int[]> AllocateArray(int count)
{
var result = new List<int[]>(count);
for (int i = 0; i < count; i++)
{
result.Add(new int[] { 1, 123, 10});
}
return result;
}
}
Based on the comments, I decided to use BenchmarkDotNet for the above comparison also and the results are quite similar
Method
Count
Mean
Error
StdDev
Ratio
Move
10000000
115.9 ms
2.27 ms
2.23 ms
0.10
FMove
10000000
149.7 ms
2.04 ms
1.91 ms
0.12
Tuple
10000000
154.8 ms
2.99 ms
2.80 ms
0.13
Array
10000000
1,217.5 ms
23.84 ms
25.51 ms
1.00
I decided to add a class allocation (called CMove) with the following definition
public class CMove
{
public readonly int FromSquare;
public readonly int ToSquare;
public readonly int MoveType;
public CMove(int fromSquare, int toSquare, int moveType)
{
FromSquare = fromSquare;
ToSquare = toSquare;
MoveType = moveType;
}
}
And used the above as a baseline for benchmarking. I also tried different allocation sizes. And here is a summary of the results.
Anything below 1.0 means it is faster than CMove. As you can see array allocations is always bad. For a few allocations it does not matter much, but for a lot of allocations there are clear winners.
I am writing some code on geometry processing, delaunay triangulation to be more specific, and I need it to be fast, so I use simple arrays of primitive as data structure to represent my triangulation information, here is a sample of it
private readonly float2[] points;
private readonly int[] pointsHalfEdgeStartCount;
private readonly int[] pointsIncomingHalfEdgeIndexes;
So let's say I want to iterate fast through all the incoming half-edge of the point of index p, I just do this using the precomputed arrays:
int count = pointsHalfEdgeStartCount[p * 2 + 1];
for (int i = 0; i < count; i++)
{
var e = pointsIncomingHalfEdgeIndexes[pointsHalfEdgeStartCount[p * 2] + i]
}
// pointsHalfEdgeStartCount[p * 2] is the start index
And this is fast enought, but does not feel safe or very clear. So I had the idea of wrapping my index into struct to make it clearer while retaining the performance, somthing like that:
public readonly struct Point
{
public readonly int index;
public readonly DelaunayTriangulation delaunay
public Point(int index, DelaunayTriangulation delaunay)
{
this.index = index;
this.delaunay = delaunay;
}
public int GetIncomingHalfEdgeCount() => delaunay.pointsEdgeStartCount[index * 2 + 1];
public HalfEdge GetIncomingHalfEdge(int i)
{
return new HalfEdge(
delaunay,
delaunay.pointsIncomingHalfEdgeIndexes[delaunay.pointsEdgeStartCount[index * 2] + i]
);
}
//... other methods
}
Then I can just do so:
int count = p.GetIncomingHalfEdgeCount();
for (int i = 0; i < count; i++)
{
var e = p.GetIncomingHalfEdge(i);
}
However it was kind of killing my performance, being a lot slower (around 10 times) on a benchmark I did, iterating over all the points and iterating over all their incoming half-edge. I guess because storing a reference to the delaunay triangulaiton in each point struct was an obvious waste and slowed down all the operations involving points, having twice the amount of data to move.
I could make the DelaunayTriangulation a static class but it was not practical for other reasons, so I did that:
public readonly struct Point
{
public readonly int index;
public Point(int index) => this.index = index;
public int GetIncomingHalfEdgeCount(DelaunayTriangulation delaunay) => delaunay.pointsEdgeStartCount[index * 2 + 1];
public HalfEdge GetIncomingHalfEdge(DelaunayTriangulation delaunay, int i)
{
return new HalfEdge(
delaunay.pointsIncomingHalfEdgeIndexes[delaunay.pointsEdgeStartCount[index * 2] + i]
);
}
//... other methods
}
I can just do so:
int count = p.GetIncomingHalfEdgeCount(delaunay);
for (int i = 0; i < count; i++)
{
var e = p.GetIncomingHalfEdge(delaunay, i);
}
It was quite a lot faster, but still 2.5 times slower than the first method using simple int. I wondered if it could be because I was getting int in the first method while I got HalfEdge struct in the other methods (A struct similar to the Point struct, contains only an index as data and a couple of methods), and difference between plain int and the faster struct vanished when I used the e int to instantiate a new HalfEdge struct. Though I am not sure why is that so costly.Weirder still, I explored for clarity sake the option of wrinting the method inside the Delaunay class instead of the Point struct:
// In the DelaunayTriangulation class:
public int GetPointIncomingHalfEdgeCount(Point p) => pointsEdgeStartCount[p.index * 2 + 1];
public HalfEdge GetPointIncomingHalfEdge(Point p, int i)
{
return new HalfEdge(
pointsIncomingHalfEdgeIndexes[pointsEdgeStartCount[p.index * 2] + i]
);
}
And I used it like this:
int count = delaunay.GetPointIncomingHalfEdgeCount(p);
for (int i = 0; i < count; i++)
{
var e = delaunay.GetPointIncomingHalfEdge(p, i);
}
And it was 3 times slower than the previous method! I have no idea why.
I tried to use disassembly to see what machine code was generated but I failed to do so (I am working with Unity3D). Am I condemned to rely on plain int in arrays and sane variable naming and to renounce on trying to have some compile-time type checking (is this int really a point index ?)
I am not even bringing up other questions such as, why it is even slower when I try to use IEnumerable types with yields like so:
public IEnumerable<int> GetPointIncomingHalfEdges(Point p)
{
int start = pointsEdgeStartCount[p.index * 2]; // this should be a slight optimization right ?
int count = pointsEdgeStartCount[p.index * 2 + 1];
for (int i = 0; i < count; i++)
{
yield pointsIncomingHalfEdgeIndexes[start + i];
}
}
I have added a compiler directive for aggressive inlining and it seems to make up for the discrepencies in time! For some reason the compiler fails to inline correctly for:
var e = delaunay.GetPointIncomingHalfEdge(p, i);
While it managed to do so with
var e = p.GetIncomingHalfEdge(delaunay, i);
Why ? I do not know. However It would be far easier if I was able to see how the code is compiled and I could not find how to do that. I will search that, maybe open another question and if I find a better explaination I will come back!
What i need:
a polygon with arbitrary amount of vertices ( or at least up to max number of vertices )
it should be a struct, so that it can be fast and can be assigned / passed by value
It seems like i can't use arrays or collections for storing vertices, because then my polygon struct would point to objects on a heap, and when one polygon is assigned to another one by value only shallow copy would be performed, and i would have both polygons pointing to the same vertex array. For example:
Polygon a = new Polygon();
Polygon b = a;
// both polygons would be changed
b.vertices[0] = 5;
Then how do i create a struct that can have arbitrary number (or some fixed number) of vertices, but without using heap at all?
I could just use lots of variables like v1, v2, v3 ... v10 etc, but i want to keep my code clean, more or less.
You have the option to define your array with the fixed keyword, which puts it in the stack.
But you cannot directly access the elements of the array, unless you are in an unsafe context and use pointers.
To get the following behavior:
static void Main(string[] args)
{
FixedArray vertices = new FixedArray(10);
vertices[0] = 4;
FixedArray copy = vertices;
copy[0] = 8;
Debug.WriteLine(vertices[0]);
// 4
Debug.WriteLine(copy[0]);
// 8
}
Then use the following class definition:
public unsafe struct FixedArray
{
public const int MaxSize = 100;
readonly int size;
fixed double data[MaxSize];
public FixedArray(int size) : this(new double[size])
{ }
public FixedArray(double[] values)
{
this.size = Math.Min(values.Length, MaxSize);
for (int i = 0; i < size; i++)
{
data[i] = values[i];
}
}
public double this[int index]
{
get
{
if (index>=0 && index<size)
{
return data[index];
}
return 0;
}
set
{
if (index>=0 && index<size)
{
data[index] = value;
}
}
}
public double[] ToArray()
{
var array = new double[size];
for (int i = 0; i < size; i++)
{
array[i] = data[i];
}
return array;
}
}
A couple of things to consider. The above needs to be compiled with the unsafe option. Also the MaxSize but be a constant, and the storage required cannot exceed this value. I am using an indexer this[int] to access the elements (instead of a field) and also have a method to convert to a native array with ToArray(). The constructor can also take a native array, or it will use an empty array to initialize the values. This is to ensure that new FixedArray(10) for example will have initialized at least 10 values in the fixed array (instead of being undefined as it is the default).
Read more about this usage of fixed from Microsoft or search for C# Fixed Size Buffers.
Heap array field
struct StdArray
{
int[] vertices;
Foo(int size)
{
vertices = new int[size];
}
}
Stack array field
unsafe struct FixedArray
{
fixed int vertices[100];
int size;
Foo(int size)
{
this.size = size;
// no initialization needed for `vertices`
}
}
If it suits your logic, you could use a Span<T>, which is allocated on the stack. Read more here
One other way to just copy the array with a copy constructor
public Polygon(Polygon other)
{
this.vertices = other.vertices.Clone() as int[];
}
then
var a = new Polygon();
a.vertices[0] = 5;
var b = new Polygon(a):
Debug.WriteLine(a.vertices[0]);
// 5
Debug.WriteLine(b.vertices[0]);
// 5
b.vertices[0] = 10;
Debug.WriteLine(a.vertices[0]);
// 5
Debug.WriteLine(b.vertices[0]);
// 10
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
I am rewriting a high performance C++ application to C#. The C# app is noticeably slower than the C++ original. Profiling tells me that the C# app spends most time in accessing array elements. Hence I create a simple array access benchmark. I get completely different results than others doing a similiar comparison.
The C++ code:
#include <limits>
#include <stdio.h>
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int main(void)
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int xRepLen = 100 * 1000;
int xRepCount = 1000;
unsigned short * xArray = new unsigned short[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArray[xIdx] = xIdx % USHRT_MAX;
int * xResults = new int[xRepLen];
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
{
unsigned short xValToBreach = (xArray[xIdx] + 25) % USHRT_MAX;
xResults[xIdx] = 0;
for (int xIdx2 = xIdx + 1; xIdx2 < xRepLen; xIdx2++)
if (xArray[xIdx2] >= xValToBreach)
{
xResults[xIdx] = xIdx2; break;
}
if (xResults[xIdx] == 0)
xResults[xIdx] = INT_MAX;
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "Elasped miliseconds " << duration;
getchar();
}
The C# code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace arrayBenchmarkCs
{
class Program
{
public static void benchCs()
{
unsafe
{
int xRepLen = 100 * 1000;
int xRepCount = 1000;
ushort[] xArr = new ushort[xRepLen];
for (int xIdx = 0; xIdx < xRepLen; xIdx++)
xArr[xIdx] = (ushort)(xIdx % 0xffff);
int[] xResults = new int[xRepLen];
Stopwatch xSw = new Stopwatch(); xSw.Start();
fixed (ushort * xArrayStart = & xArr [0])
{
for (int xRepIdx = 0; xRepIdx < xRepCount; xRepIdx++)
{
// in each repetition, go find the first value, that surpasses xArray[xIdx] + 25 - i.e. we will perform 25 searches
ushort * xArrayEnd = xArrayStart + xRepLen;
for (ushort* xPtr = xArrayStart; xPtr != xArrayEnd; xPtr++)
{
ushort xValToBreach = (ushort)((*xPtr + 25) % 0xffff);
int xResult = -1;
for (ushort * xPtr2 = xPtr + 1; xPtr2 != xArrayEnd; xPtr2++)
if ( *xPtr2 >= xValToBreach)
{
xResult = (int)(xPtr2 - xArrayStart);
break;
}
if (xResult == -1)
xResult = int.MaxValue;
// save result
xResults[xPtr - xArrayStart] = xResult;
}
}
} // fixed
xSw.Stop();
Console.WriteLine("Elapsed miliseconds: " + (xSw.ElapsedMilliseconds.ToString("0"));
}
}
static void Main(string[] args)
{
benchCs();
Console.ReadKey();
}
}
}
On my work computer (i7-3770), the C++ version is approx 2x faster than the C# version. On my home computer (i7-5820K) the C++ is 1.5x faster than the C# version. Both are measured in Release. I hoped that by using pointers in C# I would avoid the array boundary checking and the performance would be the same in both languages.
So my questions are the following:
home come others are finding C# to be of the same speed as C++?
how can I get C# performance to the C++ level if not via pointers?
what could be the driver of different speedups on different computers?
Any hint is much appreciated,
Daniel
You won't get this kind of hardcore number crunching to C++ speed. Using pointer arithmetic and unsafe code gets you some of the way there (it's almost half as slow again if you remove the unsafe and fixed parts). C# isn't compiled to native code, and the code that it is running is full of extra checks and stuff.
If you're willing to go unsafe then really there's nothing stopping you coding your C++ performance-critical stuff into a mixed-mode assembly, and calling that from your C# glue code.
C++ code is not working the same way as C#. The inner loop is different. There are 4 memory operations xResults[xIdx] and just 1 in c#.
I was shocked, that performance of C# code depends so much on framework version.
What's even more interesting C# on .net core 3.1 outperformed C++ by 5%. With other frameworks I checked C# was 30-50% slower then C++