This is basically a restatement of this question: Java: Multi-dimensional array vs. One-dimensional but for C#.
I have a set amount of elements that make sense to store as a grid.
Should I use a array[x*y] or a array[x][y]?
EDIT: Oh, so there are one dimensional array[x*y], multidimensional array[x,y] and jagged array[x][y], and I probably want jagged?
There are many advantages in C# to using jagged arrays (array[][]). They actually will often outperform multidimensional arrays.
That being said, I would personally use a multidimensional or jagged array instead of a single dimensional array, as this matches the problem space more closely. Using a one dimensional array is adding complexity to your implementation that does not provide real benefits, especially when compared to a 2D array, as internally, it's still a single block of memory.
I ran a test on unreasonably large arrays and was surprised to see that Jagged arrays([y][x]) appear to be faster than the single dimension array with manual multiplication [y * ySize + x]. And multi dimensional arrays [,] are slower but not by that much.
Of course you would have to test out on your particular arrays, but it would seem like the different isn't much so you should just use whichever approach fits what you are doing the best.
0.280 (100.0% | 0.0%) 'Jagged array 5,059x5,059 - 25,593,481'
| 0.006 (2.1% | 2.1%) 'Allocate'
| 0.274 (97.9% | 97.9%) 'Access'
0.336 (100.0% | 0.0%) 'TwoDim array 5,059x5,059 - 25,593,481'
| 0.000 (0.0% | 0.0%) 'Allocate'
| 0.336 (99.9% | 99.9%) 'Access'
0.286 (100.0% | 0.0%) 'SingleDim array 5,059x5,059 - 25,593,481'
| 0.000 (0.1% | 0.1%) 'Allocate'
| 0.286 (99.9% | 99.9%) 'Access'
0.552 (100.0% | 0.0%) 'Jagged array 7,155x7,155 - 51,194,025'
| 0.009 (1.6% | 1.6%) 'Allocate'
| 0.543 (98.4% | 98.4%) 'Access'
0.676 (100.0% | 0.0%) 'TwoDim array 7,155x7,155 - 51,194,025'
| 0.000 (0.0% | 0.0%) 'Allocate'
| 0.676 (100.0% | 100.0%) 'Access'
0.571 (100.0% | 0.0%) 'SingleDim array 7,155x7,155 - 51,194,025'
| 0.000 (0.1% | 0.1%) 'Allocate'
| 0.571 (99.9% | 99.9%) 'Access'
for (int i = 6400000; i < 100000000; i *= 2)
{
int size = (int)Math.Sqrt(i);
int totalSize = size * size;
GC.Collect();
ProfileTimer.Push(string.Format("Jagged array {0:N0}x{0:N0} - {1:N0}", size, totalSize));
ProfileTimer.Push("Allocate");
double[][] Jagged = new double[size][];
for (int x = 0; x < size; x++)
{
Jagged[x] = new double[size];
}
ProfileTimer.PopPush("Allocate", "Access");
double total = 0;
for (int trials = 0; trials < 10; trials++)
{
for (int y = 0; y < size; y++)
{
for (int x = 0; x < size; x++)
{
total += Jagged[y][x];
}
}
}
ProfileTimer.Pop("Access");
ProfileTimer.Pop("Jagged array");
GC.Collect();
ProfileTimer.Push(string.Format("TwoDim array {0:N0}x{0:N0} - {1:N0}", size, totalSize));
ProfileTimer.Push("Allocate");
double[,] TwoDim = new double[size,size];
ProfileTimer.PopPush("Allocate", "Access");
total = 0;
for (int trials = 0; trials < 10; trials++)
{
for (int y = 0; y < size; y++)
{
for (int x = 0; x < size; x++)
{
total += TwoDim[y, x];
}
}
}
ProfileTimer.Pop("Access");
ProfileTimer.Pop("TwoDim array");
GC.Collect();
ProfileTimer.Push(string.Format("SingleDim array {0:N0}x{0:N0} - {1:N0}", size, totalSize));
ProfileTimer.Push("Allocate");
double[] Single = new double[size * size];
ProfileTimer.PopPush("Allocate", "Access");
total = 0;
for (int trials = 0; trials < 10; trials++)
{
for (int y = 0; y < size; y++)
{
int yOffset = y * size;
for (int x = 0; x < size; x++)
{
total += Single[yOffset + x];
}
}
}
ProfileTimer.Pop("Access");
ProfileTimer.Pop("SingleDim array");
}
Pros of array[x,y]:
- Runtime will perform more checks for you. Each index access will be checked to be within allowed range. With another approach you could easily do smth like a[y*numOfColumns + x] where x can be more than "number of columns" and this code will extract some wrong value without throwing an exception.
- More clear index access. a[x,y] is cleaner than a[y*numOfColumns + x]
Pros of array[x*y]:
- Easier iteration over the entire array. You need only one loop instead of two.
And winner is... I would prefer array[x,y]
Related
I've run into something strange, when using AsSpan.Fill it's twice as fast on a byte[] array as opposed to an int or float array, and they are all of the same size in bytes. BUT it depends on the size of the arrays, on small arrays it is the same, but on larger ones the difference shows.
Here is a sample console application to illustrate
internal unsafe class Program {
static byte[]? ByteFrame;
static Int32[]? Int32Frame;
static float[]? FloatFrame;
static int[]? ResetCacheArray;
static void Main(string[] args) {
// size vars
int Width = 1500;
int Height = 1500;
// Init frames
ByteFrame = new byte[Width * Height * 4];
ByteFrame.AsSpan().Fill(0);
Int32Frame = new Int32[Width * Height];
Int32Frame.AsSpan().Fill(0);
FloatFrame = new float[Width * Height];
FloatFrame.AsSpan().Fill(1);
ResetCacheArray = new int[10000 * 10000];
ResetCacheArray.AsSpan().Fill(1);
// warmup jitter
for(int i = 0; i < 200; i++) {
ClearByteFrameAsSpanFill(0);
ClearInt32FrameAsSpanFill(0);
ClearFloatFrameAsSpanFill(0f);
ClearCache();
}
Console.WriteLine(Environment.Is64BitProcess);
int TestIterations;
double nanoseconds;
double MsDuration;
double MB = 0;
double MBSec;
double GBSec;
TestIterations = 1;
nanoseconds = 1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency;
for (int i = 0; i < TestIterations; i++) {
MB = ClearByteFrameAsSpanFill(0);
}
MsDuration = (((1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency) - nanoseconds) / TestIterations) / 1000000;
MBSec = (MB / MsDuration) * 1000;
GBSec = MBSec / 1000;
Console.WriteLine("ClearByteFrameAsSpanFill: MS:" + MsDuration + " GB/s:" + (int)GBSec + " MB/s:" + (int)MBSec);
ClearCache();
TestIterations = 1;
nanoseconds = 1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency;
for (int i = 0; i < TestIterations; i++) {
MB = ClearInt32FrameAsSpanFill(1);
}
MsDuration = (((1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency) - nanoseconds) / TestIterations) / 1000000;
MBSec = (MB / MsDuration) * 1000;
GBSec = MBSec / 1000;
Console.WriteLine("ClearInt32FrameAsSpanFill: MS:" + MsDuration + " GB/s:" + (int)GBSec + " MB/s:" + (int)MBSec);
ClearCache();
TestIterations = 1;
nanoseconds = 1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency;
for (int i = 0; i < TestIterations; i++) {
MB = ClearFloatFrameAsSpanFill(1f);
}
MsDuration = (((1_000_000_000.0 * Stopwatch.GetTimestamp() / Stopwatch.Frequency) - nanoseconds) / TestIterations) / 1000000;
MBSec = (MB / MsDuration) * 1000;
GBSec = MBSec / 1000;
Console.WriteLine("ClearFloatFrameAsSpanFill: MS:" + MsDuration + " GB/s:" + (int)GBSec + " MB/s:" + (int)MBSec);
ClearCache();
Console.ReadLine();
}
static double ClearByteFrameAsSpanFill(byte clearValue) {
ByteFrame.AsSpan().Fill(clearValue);
return ByteFrame.Length / 1000000;
}
static double ClearInt32FrameAsSpanFill(Int32 clearValue) {
Int32Frame.AsSpan().Fill(clearValue);
return (Int32Frame.Length * 4) / 1000000;
}
static double ClearFloatFrameAsSpanFill(float clearValue) {
FloatFrame.AsSpan().Fill(clearValue);
return (FloatFrame.Length * 4) / 1000000;
}
static void ClearCache() {
int sum = 0;
for (int i = 0; i < ResetCacheArray.Length; i++) {
sum += ResetCacheArray[i];
}
}
}
On my machine it outputs the following:
ClearByteFrameAsSpanFill: MS:0,4913 GB/s:18 MB/s:18318
ClearInt32FrameAsSpanFill: MS:0,4851 GB/s:18 MB/s:18552
ClearFloatFrameAsSpanFill: MS:0,458 GB/s:19 MB/s:19650
It varies a little from run to run, + - a few GB/s but roughly each operation takes the same amount of time.
Now when i change the size variables to: Width = 4500, Height = 4500 then it outputs the following:
ClearByteFrameAsSpanFill: MS:3,4015 GB/s:23 MB/s:23813
ClearInt32FrameAsSpanFill: MS:7,635 GB/s:10 MB/s:10609
ClearFloatFrameAsSpanFill: MS:7,4429 GB/s:10 MB/s:10882
This will obviously change depending on ram speed from machine to machine, but on mine at least it is as such, on "small" arrays it is the same, but on large arrays filling a byte array is twice as fast as a int or float array of same byte length.
Does anyone have an explanation of this?
You are testing filling the byte array with 0 and filling the int array with 1:
ClearByteFrameAsSpanFill(0);
ClearInt32FrameAsSpanFill(1);
These cases have different optimisations.
If you fill an array of bytes with any value it will be around the same speed, because there's a processor instruction to fill a block of bytes with a specific byte value.
Although there may be processor instructions to fill an array of int or float values with non-zero values, they are likely to be slower than filling the block of memory with zero values.
I tried this out with the following code using BenchmarkDotNet:
[SimpleJob(RuntimeMoniker.Net60)]
public class UnderTest
{
[Benchmark]
public void FillBytesWithZero()
{
_bytes.AsSpan().Fill(0);
}
[Benchmark]
public void FillBytesWithOne()
{
_bytes.AsSpan().Fill(1);
}
[Benchmark]
public void FillIntsWithZero()
{
_ints.AsSpan().Fill(0);
}
[Benchmark]
public void FillIntsWithOne()
{
_ints.AsSpan().Fill(1);
}
const int COUNT = 1500 * 1500;
static readonly byte[] _bytes = new byte[COUNT * sizeof(int)];
static readonly int[] _ints = new int[COUNT];
}
With the following results:
For COUNT = 1500 * 1500:
| Method | Mean | Error | StdDev | Median |
|------------------ |---------:|---------:|---------:|---------:|
| FillBytesWithZero | 299.7 us | 7.82 us | 22.95 us | 299.3 us |
| FillBytesWithOne | 305.6 us | 11.46 us | 33.80 us | 293.3 us |
| FillIntsWithZero | 322.4 us | 2.37 us | 2.10 us | 321.6 us |
| FillIntsWithOne | 502.9 us | 27.68 us | 81.60 us | 534.4 us |
For COUNT = 4500 * 4500:
| Method | Mean | Error | StdDev |
|------------------ |---------:|----------:|----------:|
| FillBytesWithZero | 2.554 ms | 0.0307 ms | 0.0240 ms |
| FillBytesWithOne | 2.632 ms | 0.0522 ms | 0.1101 ms |
| FillIntsWithZero | 4.169 ms | 0.0258 ms | 0.0229 ms |
| FillIntsWithOne | 4.979 ms | 0.0488 ms | 0.0433 ms |
Note how filling a byte array with 0 or 1 is significantly faster.
If you inspect the source code for Span<T>.Fill() you'll see this:
public void Fill(T value)
{
if (Unsafe.SizeOf<T>() == 1)
{
// Special-case single-byte types like byte / sbyte / bool.
// The runtime eventually calls memset, which can efficiently support large buffers.
// We don't need to check IsReferenceOrContainsReferences because no references
// can ever be stored in types this small.
Unsafe.InitBlockUnaligned(ref Unsafe.As<T, byte>(ref _reference), Unsafe.As<T, byte>(ref value), (uint)_length);
}
else
{
// Call our optimized workhorse method for all other types.
SpanHelpers.Fill(ref _reference, (uint)_length, value);
}
}
This explains why filling a byte array is faster than filling an int array: It uses Unsafe.InitBlockUnaligned() for a byte array and SpanHelpers.Fill(ref _reference, (uint)_length, value); for a non-byte array.
Unsafe.InitBlockUnaligned() happens to be more performant; it's implemented as an intrinsic which performs the following:
ldarg .0
ldarg .1
ldarg .2
unaligned. 0x1
initblk
ret
Whereas SpanHelpers.Fill() is much less optimised.
It tries its best, using vectorised instructions to fill the memory if possible, but it can't compete with initblk. (It's too long to post here, but you can follow that link to look at it.)
One thing this doesn't explain is why filling an int array with zeroes is slightly faster than filling it with ones. To explain this you'd have to look at the actual processor instructions that the JIT produces, but it's definitely faster to fill a block of bytes with all 0's than it is to fill a block of bytes with 1,0,0,0 (which it would have to do for an int value of 1).
It's probably down to the comparative speeds of instructions like rep stosb (for bytes) and rep stosw (for words).
The outlier in these results is that the unaligned.1 initblk opcode sequence is about 50% faster for the smaller block size. The other times all scale up by approximately the increase in size of the memory block, i.e. around 9 times slower for the blocks that are 9 times bigger.
So the remaining question is: Why is initblk 50% faster per-byte for smaller buffer sizes (2_250_000 versus 20_250_000 bytes)?
I struggle to realize why my usage of intrinsics API is slower than just sum with foreach loop?
public class ArraySum
{
private double[] data;
public ArraySum()
{
if (!Avx.IsSupported)
{
throw new Exception("Avx is not supported");
}
var rnd = new Random();
var list = new List<double>();
for (int i = 0; i < 100_000; i++)
{
list.Add(rnd.Next(500));
}
data = list.ToArray();
}
[Benchmark]
public void Native()
{
int result = 0;
foreach (int i in data)
{
result += i;
}
Console.WriteLine($"Native: {result}");
}
[Benchmark]
public unsafe void Intrinsics()
{
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<double>.Zero;
int i;
var array = data;
fixed (double* ptr = array)
{
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = Avx.LoadVector256(ptr + i);
accVector = Avx.Add(accVector, v);
}
}
double result = 0;
var temp = stackalloc double[vectorSize];
Avx.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < array.Length; i++)
{
result += array[i];
}
Console.WriteLine($"Intrinsics: {result}");
}
Result:
.NET SDK=6.0.100-rc.2.21505.57
| Method | Mean | Error | StdDev | Median |
|----------- |---------:|---------:|---------:|---------:|
| Native | 387.6 us | 12.15 us | 35.83 us | 405.8 us |
| Intrinsics | 393.2 us | 9.01 us | 25.70 us | 385.0 us |
what may be causing this?
It's running on Windows and Intel Core i5-3340M CPU 2.70GHz (Ivy Bridge) if it does matter
BenchmarkDotNet warns that ArraySum.Native: Default -> It seems that the distribution is bimodal (mValue = 3.92)
I just realized that
native method should perform it on doubles not ints, opsie
[Benchmark]
public void Native()
{
double result = 0;
foreach (double i in data)
{
result += i;
}
Console.WriteLine($"Native: {result}");
}
| Method | Mean | Error | StdDev | Median |
|----------- |---------:|---------:|---------:|---------:|
| Native | 415.1 us | 25.35 us | 73.95 us | 385.9 us |
| Intrinsics | 388.7 us | 7.58 us | 21.74 us | 384.7 us |
but also:
Console.WriteLine adds probably too much overhead which is way higher than time spent performing sum and skews the results
now the difference is more significant:
[Benchmark]
public double Native()
{
double result = 0;
foreach (double i in data)
{
result += i;
}
return result;
}
[Benchmark]
public unsafe double Intrinsics()
{
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<double>.Zero;
int i;
var array = data;
fixed (double* ptr = array)
{
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = Avx.LoadVector256(ptr + i);
accVector = Avx.Add(accVector, v);
}
}
double result = 0;
var temp = stackalloc double[vectorSize];
Avx.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < array.Length; i++)
{
result += array[i];
}
return result;
}
| Method | Mean | Error | StdDev |
|----------- |---------:|---------:|---------:|
| Native | 92.92 us | 1.547 us | 1.447 us |
| Intrinsics | 25.06 us | 0.459 us | 1.090 us |
I have implemented a very simple binarySearch implementation in C# for finding integers in an integer array:
Binary Search
static int binarySearch(int[] arr, int i)
{
int low = 0, high = arr.Length - 1, mid;
while (low <= high)
{
mid = (low + high) / 2;
if (i < arr[mid])
high = mid - 1;
else if (i > arr[mid])
low = mid + 1;
else
return mid;
}
return -1;
}
When comparing it to C#'s native Array.BinarySearch() I can see that Array.BinarySearch() is more than twice as fast as my function, every single time.
MSDN on Array.BinarySearch:
Searches an entire one-dimensional sorted array for a specific element, using the IComparable generic interface implemented by each element of the Array and by the specified object.
What makes this approach so fast?
Test code
using System;
using System.Diagnostics;
class Program
{
static void Main()
{
Random rnd = new Random();
Stopwatch sw = new Stopwatch();
const int ELEMENTS = 10000000;
int temp;
int[] arr = new int[ELEMENTS];
for (int i = 0; i < ELEMENTS; i++)
arr[i] = rnd.Next(int.MinValue,int.MaxValue);
Array.Sort(arr);
// Custom binarySearch
sw.Restart();
for (int i = 0; i < ELEMENTS; i++)
temp = binarySearch(arr, i);
sw.Stop();
Console.WriteLine($"Elapsed time for custom binarySearch: {sw.ElapsedMilliseconds}ms");
// C# Array.BinarySearch
sw.Restart();
for (int i = 0; i < ELEMENTS; i++)
temp = Array.BinarySearch(arr,i);
sw.Stop();
Console.WriteLine($"Elapsed time for C# BinarySearch: {sw.ElapsedMilliseconds}ms");
}
static int binarySearch(int[] arr, int i)
{
int low = 0, high = arr.Length - 1, mid;
while (low <= high)
{
mid = (low+high) / 2;
if (i < arr[mid])
high = mid - 1;
else if (i > arr[mid])
low = mid + 1;
else
return mid;
}
return -1;
}
}
Test results
+------------+--------------+--------------------+
| Attempt No | binarySearch | Array.BinarySearch |
+------------+--------------+--------------------+
| 1 | 2700ms | 1099ms |
| 2 | 2696ms | 1083ms |
| 3 | 2675ms | 1077ms |
| 4 | 2690ms | 1093ms |
| 5 | 2700ms | 1086ms |
+------------+--------------+--------------------+
Your code is faster when run outside Visual Studio:
Yours vs Array's:
From VS - Debug mode: 3248 vs 1113
From VS - Release mode: 2932 vs 1100
Running exe - Debug mode: 3152 vs 1104
Running exe - Release mode: 559 vs 1104
Array's code might be already optimized in the framework but also does a lot more checking than your version (for instance, your version may overflow if arr.Length is greater than int.MaxValue / 2) and, as already said, is designed for a wide range of types, not just int[].
So, basically, it's slower only when you are debugging your code, because Array's code is always run in release and with less control behind the scenes.
Is there a faster way of doing this using C#?
double[,] myArray = new double[length1, length2];
for(int i=0;i<length1;i++)
for(int j=0;j<length2;j++)
myArray[i,j] = double.PositiveInfinity;
I remember using C++, there was something called memset() for doing these kind of things...
A multi-dimensional array is just a large block of memory, so we can treat it like one, similar to how memset() works. This requires unsafe code. I wouldn't say it's worth doing unless it's really performance critical. This is a fun exercise, though, so here are some benchmarks using BenchmarkDotNet:
public class ArrayFillBenchmark
{
const int length1 = 1000;
const int length2 = 1000;
readonly double[,] _myArray = new double[length1, length2];
[Benchmark]
public void MultidimensionalArrayLoop()
{
for (int i = 0; i < length1; i++)
for (int j = 0; j < length2; j++)
_myArray[i, j] = double.PositiveInfinity;
}
[Benchmark]
public unsafe void MultidimensionalArrayNaiveUnsafeLoop()
{
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
for (int i = 0; i < length1; i++)
for (int j = 0; j < length2; j++)
*b++ = double.PositiveInfinity;
}
}
[Benchmark]
public unsafe void MultidimensionalSpanFill()
{
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
var span = new Span<double>(b, length1 * length2);
span.Fill(double.PositiveInfinity);
}
}
[Benchmark]
public unsafe void MultidimensionalSseFill()
{
var vectorPositiveInfinity = Vector128.Create(double.PositiveInfinity);
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
ulong i = 0;
int size = Vector128<double>.Count;
ulong length = length1 * length2;
for (; i < (length & ~(ulong)15); i += 16)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
Sse2.Store(b+size*2, vectorPositiveInfinity);
Sse2.Store(b+size*3, vectorPositiveInfinity);
Sse2.Store(b+size*4, vectorPositiveInfinity);
Sse2.Store(b+size*5, vectorPositiveInfinity);
Sse2.Store(b+size*6, vectorPositiveInfinity);
Sse2.Store(b+size*7, vectorPositiveInfinity);
b += size*8;
}
for (; i < (length & ~(ulong)7); i += 8)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
Sse2.Store(b+size*2, vectorPositiveInfinity);
Sse2.Store(b+size*3, vectorPositiveInfinity);
b += size*4;
}
for (; i < (length & ~(ulong)3); i += 4)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
b += size*2;
}
for (; i < length; i++)
{
*b++ = double.PositiveInfinity;
}
}
}
}
Results:
| Method | Mean | Error | StdDev | Ratio |
|------------------------------------- |-----------:|----------:|----------:|------:|
| MultidimensionalArrayLoop | 1,083.1 us | 11.797 us | 11.035 us | 1.00 |
| MultidimensionalArrayNaiveUnsafeLoop | 436.2 us | 8.567 us | 8.414 us | 0.40 |
| MultidimensionalSpanFill | 321.2 us | 6.404 us | 10.875 us | 0.30 |
| MultidimensionalSseFill | 231.9 us | 4.616 us | 11.323 us | 0.22 |
MultidimensionalArrayLoop is slow because of bounds checking. The JIT emits code each loop that makes sure that [i, j] is inside the bounds of the array. The JIT can elide bounds checking sometimes, I know it does for single-dimensional arrays. I'm not sure if it does it for multi-dimensional.
MultidimensionalArrayNaiveUnsafeLoop is essentially the same code as MultidimensionalArrayLoop but without bounds checking. It's considerably faster, taking 40% of the time. It's considered 'Naive', though, because the loop could still be improved by unrolling the loop.
MultidimensionalSpanFill also has no bounds check, and is more-or-less the same as MultidimensionalArrayNaiveUnsafeLoop, however, Span.Fill internally does loop unrolling, which is why it's a bit faster than our naive unsafe loop. It only take 30% of the time as our original.
MultidimensionalSseFill improves on our first unsafe loop by doing two things: loop unrolling and vectorizing. This requires a CPU with Sse2 support, but it allows us to write 128-bits (16 bytes) in a single instruction. This gives us an additional speed boost, taking it down to 22% of the original. Interestingly, this same loop with Avx (256-bits) was consistently slower than the Sse2 version, so that benchmark is not included here.
But these numbers only apply to an array that is 1000x1000. As you change the size of the array, the results differ. For example, when we change the array size to 10000x10000, the results for all of the unsafe benchmarks are very close. Probably because there are more memory fetches for the larger array that it tends to equalize the smaller iterative improvements seen in the last three benchmarks.
There's a lesson in there somewhere, but I mostly just wanted to share these results, since it was a pretty fun experiment to do.
I wrote the method that is not faster, but it works with actual multidimensional arrays, not only 2D.
public static class ArrayExtensions
{
public static void Fill(this Array array, object value)
{
var indicies = new int[array.Rank];
Fill(array, 0, indicies, value);
}
public static void Fill(Array array, int dimension, int[] indicies, object value)
{
if (dimension < array.Rank)
{
for (int i = array.GetLowerBound(dimension); i <= array.GetUpperBound(dimension); i++)
{
indicies[dimension] = i;
Fill(array, dimension + 1, indicies, value);
}
}
else
array.SetValue(value, indicies);
}
}
double[,] myArray = new double[x, y];
if( parallel == true )
{
stopWatch.Start();
System.Threading.Tasks.Parallel.For( 0, x, i =>
{
for( int j = 0; j < y; ++j )
myArray[i, j] = double.PositiveInfinity;
});
stopWatch.Stop();
Print( "Elapsed milliseconds: {0}", stopWatch.ElapsedMilliseconds );
}
else
{
stopWatch.Start();
for( int i = 0; i < x; ++i )
for( int j = 0; j < y; ++j )
myArray[i, j] = double.PositiveInfinity;
stopWatch.Stop();
Print("Elapsed milliseconds: {0}", stopWatch.ElapsedMilliseconds);
}
When setting x and y to 10000 I get 553 milliseconds for the single-threaded approach and 170 for the multi-threaded one.
There is a possibility to quickly fill an md-array that does not use the keyword unsafe (see answers for this question)
As per following the inital thread make efficient the copy of symmetric matrix in c-sharp from cMinor.
I would be quite interesting with some inputs in how to build a symmetric square matrix multiplication with one line vector and one column vector by using an array implementation of the matrix, instead of the classical
long s = 0;
List<double> columnVector = new List<double>(N);
List<double> lineVector = new List<double>(N);
//- init. vectors and symmetric square matrix m
for (int i=0; i < N; i++)
{
for(int j=0; j < N; j++){
s += lineVector[i] * columnVector[j] * m[i,j];
}
}
Thanks for your input !
The line vector times symmetric matrix equals to the transpose of the matrix times the column vector. So only the column vector case needs to be considered.
Originally the i-th element of y=A*x is defined as
y[i] = SUM( A[i,j]*x[j], j=0..N-1 )
but since A is symmetric, the sum be split into sums, one below the diagonal and the other above
y[i] = SUM( A[i,j]*x[j], j=0..i-1) + SUM( A[i,j]*x[j], j=i..N-1 )
From the other posting the matrix index is
A[i,j] = A[i*N-i*(i+1)/2+j] // j>=i
A[i,j] = A[j*N-j*(j+1)/2+i] // j< i
For a N×N symmetric matrix A = new double[N*(N+1)/2];
In C# code the above is:
int k;
for(int i=0; i<N; i++)
{
// start sum with zero
y[i]=0;
// below diagonal
k=i;
for(int j=0; j<=i-1; j++)
{
y[i]+=A[k]*x[j];
k+=N-j-1;
}
// above diagonal
k=i*N-i*(i+1)/2+i;
for(int j=i; j<=N-1; j++)
{
y[i]+=A[k]*x[j];
k++;
}
}
Example for you to try:
| -7 -6 -5 -4 -3 | | -2 | | -5 |
| -6 -2 -1 0 1 | | -1 | | 21 |
| -5 -1 2 3 4 | | 0 | = | 42 |
| -4 0 3 5 6 | | 1 | | 55 |
| -3 1 4 6 7 | | 7 | | 60 |
To get the quadratic form do a dot product with the multiplication result vector x·A·y = Dot(x,A*y)
You could make matrix multiplication pretty fast with unsafe code. I have blogged about it.
Making matrix multiplication as fast as possible is easy: Use a well-known library. Insane amounts of performance work has gone into such libraries. You cannot compete with that.