I built a test and got following results:
allocating classes: 15.3260622, allocating structs: 14.7216018.
Looks like a 4% advantage when allocates structs instead of classes. That's cool but is it really enough to add in the language value types? Where I can find an example which shows that structs really beat classes?
void Main()
{
var stopWatch = new System.Diagnostics.Stopwatch();
stopWatch.Start();
for (int i = 0; i < 100000000; i++)
{
var foo = new refFoo()
{
Str = "Alex" + i
};
}
stopWatch.Stop();
stopWatch.Dump();
stopWatch.Restart();
for (int i = 0; i < 100000000; i++)
{
var foo = new valFoo()
{
Str = "Alex" + i
};
}
stopWatch.Stop();
stopWatch.Dump();
}
public struct valFoo
{
public string Str;
}
public class refFoo
{
public string Str;
}
Your methodology is wrong. You are mostly measuring string allocations, conversions of integers to strings, and concatenation of strings. This benchmark is not worth the bits it is written on.
In order to see the benefit of structs, compare allocating an array of 1000 objects and an array of 1000 structs. In the case of the array of objects, you will need one allocation for the array itself, and then one allocation for each object in the array. In the case of the array of structs, you have one allocation for the array of structs.
Also, look at the implementation of the Enumerator of the List class in the C# source code of .Net collections. It is declared as a struct. That's because it only contains an int, so the entire enumerator struct fits inside a machine word, so it is very inexpensive.
Try some simpler test:
int size = 1000000;
var listA = new List<int>(size);
for (int i = 0; i < size; i++)
listA.Add(i);
var listB = new List<object>(size);
for (int i = 0; i < size; i++)
listB.Add(i);
To store 1000000 integers in first case the system allocates 4000000 bytes. In second, if I'm not mistaken — about 12000000 bytes. And I suspect the performance difference will be much greater.
Related
Backgorund
Hi,
I am creating a game where the AI is collecting possible moves from positions which is done millions of times per turn. I am trying to figure out the fastest way to store these moves containing e.g. pieceFromSquare, pieceToSquare and other move related things.
Currently I have a Move class which contains public variables for the move as per below:
public class Move
{
public int FromSquare;
public int ToSquare;
public int MoveType;
public static Move GetMove(int fromSquare, int toSquare, int moveType)
{
Move move = new Move();
move.FromSquare = fromSquare;
move.ToSquare = toSquare;
move.MoveType = moveType;
return move;
}
}
When the AI finds a move it stores the move in a List. I was thinking if it would be faster to store the move as a list of integers instead so I ran a test:
Test
public void RunTimerTest()
{
// Function A
startTime = DateTime.UtcNow;
moveListA = new List<List<int>>();
for (int i = 0; i < numberOfIterations; i++)
{
FunctionA();
}
PrintResult((float)Math.Round((DateTime.UtcNow - startTime).TotalSeconds, 2), "Function A");
// Function B
startTime = DateTime.UtcNow;
moveListB = new List<Move>();
for (int i = 0; i < numberOfIterations; i++)
{
FunctionB();
}
PrintResult((float)Math.Round((DateTime.UtcNow - startTime).TotalSeconds, 2), "Function B");
}
private void FunctionA()
{
moveListA.Add(new List<int>() { 1, 123, 10});
}
private void FunctionB()
{
moveListB.Add(Move.GetMove(1, 123, 10));
}
The test results in the following result when running over 10 million times:
Function A ran for: 4,58 s.
Function B ran for: 1,47 s.
So it is more than 3 times faster to create the class and populate its variables than to create a list of integers.
Questions
Why is it so much faster to create a class than a list of integers?
Is there an even faster way to store this type of data?
As mentioned in the comments, the reasons are probably mostly due to the list example needing to allocate at least two objects, possibly more depending on how it is optimized.
For high performance code a common guideline is to avoid high frequency allocations. While allocations in c# are fast, they still take some time to manage. This often means sticking with fixed size arrays, or at least set the capacity of any lists on creation.
Another important point is using structs, they are stored directly in the list/array, instead of storing a reference to a separate object. This avoids some object overhead, removes the memory need of a separate reference, and ensures all values are stored sequentially in memory. All of this help ensure caches are used efficiently. Using a smaller datatype like short/ushort may also help if that is possible. Note that structs should preferably be immutable, and there are some keywords like 'ref' and 'in' that can help avoid the overhead of copying data.
In some specific cases it can be an idea to separate values into different arrays, i.e. one for all the FromSquare values, one for all the ToSquare values etc. This can be a benefit if an operation mostly uses only a single value, again benefiting from better cache usage. It might also make SIMD easier to apply, but that might not apply in this case.
Moreover, when when measuring performance, at least use a stopwatch. That is much more accurate than dateTime, and no harder to use. Benchmark.Net would be even better since that helps compensate for various sources of noise, and can benchmark for multiple platforms. A good profiler can also be useful since it can give hints at what takes most time, how much you are allocating etc.
I examined four options, and allocating arrays for each record is by for the slowest. I did not check allocating class objects, as I was going for the faster options.
struct Move {} stores three integer values into readonly fields of a strcuture.
struct FMove {} stores an fixed int array of size 3
(int,int,int) stores a tuple of three int values
int[] allocates an array of three values.
With the following results. I am tracking time and size allocated in bytes.
Storage
Iterations
Time
Size
Move struct
10000000 allocations
0.2633584 seconds
120000064 bytes
FMove struct
10000000 allocations
0.3572664 seconds
120000064 bytes
Tuple
10000000 allocations
0.702174 seconds
160000064 bytes
Array
10000000 allocations
1.2226393 seconds
480000064 bytes
public readonly struct Move
{
public readonly int FromSquare;
public readonly int ToSquare;
public readonly int MoveType;
public Move(int fromSquare, int toSquare, int moveType) : this()
{
FromSquare = fromSquare;
ToSquare = toSquare;
MoveType = moveType;
}
}
public unsafe struct FMove
{
fixed int Data[3];
public FMove(int fromSquare, int toSquare, int moveType) : this()
{
Data[0] = fromSquare;
Data[1] = toSquare;
Data[2] = moveType;
}
public int FromSquare { get => Data[0]; }
public int ToSquare { get => Data[1]; }
public int MoveType { get => Data[2]; }
}
static class Program
{
static void Main(string[] args)
{
// Always compile with Release to time
const int count = 10000000;
Console.WriteLine("Burn-in start");
// Burn-in. Do some calc to spool up
// the CPU
AllocateArray(count/10);
Console.WriteLine("Burn-in end");
double[] timing = new double[4];
// store timing results for four different
// allocation types
var sw = new Stopwatch();
Console.WriteLine("Timming start");
long startMemory;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r4 = AllocateArray(count);
sw.Stop();
var s4 = GC.GetTotalMemory(true) - startMemory;
timing[3] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r1 = AllocateMove(count);
sw.Stop();
var s1 = GC.GetTotalMemory(true) - startMemory;
timing[0] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r2 = AllocateFMove(count);
sw.Stop();
var s2 = GC.GetTotalMemory(true) - startMemory;
timing[1] = sw.Elapsed.TotalSeconds;
startMemory = GC.GetTotalMemory(true);
sw.Restart();
var r3 = AllocateTuple(count);
sw.Stop();
var s3 = GC.GetTotalMemory(true) - startMemory;
timing[2] = sw.Elapsed.TotalSeconds;
Console.WriteLine($"| Storage | Iterations | Time | Size |");
Console.WriteLine($"|---|---|---|---|");
Console.WriteLine($"| Move struct| {r1.Count} allocations | {timing[0]} seconds | {s1} bytes |");
Console.WriteLine($"| FMove struct| {r2.Count} allocations | {timing[1]} seconds | {s2} bytes |");
Console.WriteLine($"| Tuple| {r3.Count} allocations | {timing[2]} seconds | {s3} bytes |");
Console.WriteLine($"| Array| {r4.Count} allocations | {timing[3]} seconds | {s4} bytes |");
}
static List<Move> AllocateMove(int count)
{
var result = new List<Move>(count);
for (int i = 0; i < count; i++)
{
result.Add(new Move(1, 123, 10));
}
return result;
}
static List<FMove> AllocateFMove(int count)
{
var result = new List<FMove>(count);
for (int i = 0; i < count; i++)
{
result.Add(new FMove(1, 123, 10));
}
return result;
}
static List<(int from, int to, int type)> AllocateTuple(int count)
{
var result = new List<(int from, int to, int type)>(count);
for (int i = 0; i < count; i++)
{
result.Add((1, 123, 10));
}
return result;
}
static List<int[]> AllocateArray(int count)
{
var result = new List<int[]>(count);
for (int i = 0; i < count; i++)
{
result.Add(new int[] { 1, 123, 10});
}
return result;
}
}
Based on the comments, I decided to use BenchmarkDotNet for the above comparison also and the results are quite similar
Method
Count
Mean
Error
StdDev
Ratio
Move
10000000
115.9 ms
2.27 ms
2.23 ms
0.10
FMove
10000000
149.7 ms
2.04 ms
1.91 ms
0.12
Tuple
10000000
154.8 ms
2.99 ms
2.80 ms
0.13
Array
10000000
1,217.5 ms
23.84 ms
25.51 ms
1.00
I decided to add a class allocation (called CMove) with the following definition
public class CMove
{
public readonly int FromSquare;
public readonly int ToSquare;
public readonly int MoveType;
public CMove(int fromSquare, int toSquare, int moveType)
{
FromSquare = fromSquare;
ToSquare = toSquare;
MoveType = moveType;
}
}
And used the above as a baseline for benchmarking. I also tried different allocation sizes. And here is a summary of the results.
Anything below 1.0 means it is faster than CMove. As you can see array allocations is always bad. For a few allocations it does not matter much, but for a lot of allocations there are clear winners.
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
in this code i am trying to simulate a task that populats an array of structs,
...unsafe to get as much throughoutput as can be achived.
the issue is that i when calling the fucntion and itterating on the result
shows different characters but within the scope of GetSomeTs() it's fine.
so just before the return i test one of the elements and it prints the correct value.
this is the testing struct.
public unsafe struct T1
{
public char* block = stackalloc char[5];<--will not compile so the process will be done within a local variable inside a method
}
public unsafe struct T1
{
public char* block;
}
static unsafe T1[] GetSomeTs(int ArrSz)
{
char[] SomeValChars = { 'a', 'b', 'c', 'd', 'e' };
T1[] RtT1Arr = new T1[ArrSz];
for (int i = 0; i < RtT1Arr.Length; i++)
{
char* tmpCap = stackalloc char[5];
for (int l = 0; l < 5; l++)
{
SomeValChars[4] = i.ToString()[0];
tmpCap[l] = SomeValChars[l];
}
RtT1Arr[i].block = tmpCap;//try 1
//arr[i].block = &tmpCap[0];//try 2
}
// here its fine
Console.WriteLine("{0}", new string(RtT1Arr[1].block));
return RtT1Arr;
}
but using it anywhere else printing garbage.
void Main()
{
T1[] tstT1 = GetSomeTs(10);
for (int i = 0; i < 10; i++)
{
Console.WriteLine("{0}", new string(tstT1[i].block));//,0,5, Encoding.Default));
}
}
When you allocate memory with stackalloc that memory only exists until the function returns in which you have allocated it. You are returning a pointer to memory that is no longer allowed to be accessed.
Hard to recommend a fix because it's unclear what you want to achieve. Probably, you should just use a managed char[].
Encoding.Default.GetBytes is pretty slow so that's likely to be your hotspot anyway and the rest is less important. i.ToString() also is quite slow and produces garbage. If you are after perf then stop creating unneeded objects all the time such as SomeValChars. Create it once and reuse.
I know there are differences between jagged and multidimensional arrays. I know it is often likeable to use a "List<>" instead of arrays of arrays.
Could someone just explain me why, in the following code, the first is allowed but the second is an error? I just want to better understand C#...
Legal:
public class Banana
{
double[,] _banana;
public Banana(int h, int w)
{
_banana= new double[h,w];
}
}
Illegal (Error: a constant value is expected instead of h and w):
public class Banana
{
double[][] _banana;
public Banana(int h, int w)
{
_banana= new double[h][w]{};
}
}
TL;DR;
Why is it possible to initialize the dimensions of a multi array with variables but not a jagged array's?
An int[4,5] is single object which holds twenty integers. An int[4][] is an object which holds four references to integer arrays. Having int[][] foo = new int[4][5]; be equivalent to:
int[][] foo = new int[4][];
for (int temp = 0; temp < 4; temp++)
foo[i] = new int[5];
would make about as much sense as having StringBuilder bar[] = new StringBuilder[4](); be equivalent to:
StringBuilder bar = new StringBuilder[4];
for (int temp = 0; temp < 4; temp++)
bar[i] = new StringBuilder();
Such a feature might be helpful in many cases, and there wouldn't be anything particularly wrong with it conceptually, but the code required to explicitly initialize array elements isn't particularly onerous, and writing such code explicitly helps make clear that the array-of-references and the things to which those references refers are all separate entities.
With a jagged array you have to initialize each "leg" separately - there's no syntax to initialize the size of each leg in one pass:
public Banana(int h, int w)
{
_banana = new double[h][];
for (int i = 0; i < h; i++)
{
_banana[i] = new double[w];
}
}
Why is there no syntax? Because the spec doesn't require it, and in a "typical" jagged array the legs have different lengths, otherwise a 2-D array may be more appropriate.
Each nested array in a jagged array can have a different length. If you had to initialize it using your syntax example, they would all be required to have the same length. It simply doesn't make sense. You'd have to use something like:
_banana = new double[h][];
for(var i = 0; i < h; i++)
{
_banana[h] = new double[w];
}
I have the following array:
byte[][] A = new byte[256][];
Each element of this array references another array.
A[n] = new byte[256];
However, most elements reference the same array. In fact, array A only references two or three unique arrays.
Is there an easy way to determine how much memory the entire thing uses?
If your question is to find out the number of unique 1D arrays, you could do:
A.Distinct().Count()
This should do because equality of arrays works on reference-equality by default.
But perhaps you're looking for:
A.Distinct().Sum(oneDimArray => oneDimArray.Length) * sizeof(byte)
Of course, "number of bytes used by variables" is a somewhat imprecise term. In particular, the above expression doesn't account for the storage of the variable A, references in the jagged array, overhead, alignment etc.
EDIT: As Rob points out, you may need to filter null references out if the jagged-array can contain them.
You can estimate the cost of storing the references in the jagged-array with (unsafe context):
A.Length * sizeof(IntPtr)
I don't believe there's any built in functionality.
Whipped this up very quickly, haven't tested it throughly however;
void Main()
{
byte[][] a = new byte[256][];
var someArr = new byte[256];
a[0] = someArr;
a[1] = someArr;
a[2] = new byte[256];
getSize(a).Dump();
}
private long getSize(byte[][] arr)
{
var hashSet = new HashSet<byte[]>();
var size = 0;
foreach(var innerArray in arr)
{
if(innerArray != null)
hashSet.Add(innerArray);
}
foreach (var array in hashSet)
{
size += array.Length * sizeof(byte);
}
return size;
}
I just a modified Rob's getSize method to use the Buffer helper class.
private long getSize(byte[][] arr)
{
Dictionary<byte[], bool> lookup = new Dictionary<byte[], bool>();
long size = 0;
foreach (byte[] innerArray in arr)
{
if (innerArray == null || lookup.ContainsKey(innerArray)) continue;
lookup.Add(innerArray, true);
size += Buffer.ByteLength(innerArray);
}
return size;
}