Executive Summary: Reed's answer below is the fastest if you want to stay in C#. If you're willing to marshal to C++ (which I am), that's a faster solution.
I have two 55mb ushort arrays in C#. I am combining them using the following loop:
float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}
This code, according to adding DateTime.Now calls before and afterwards, takes 3.5 seconds to run. How can I make it faster?
EDIT: Here is some code that, I think, shows the root of the problem. When the following code is run in a brand new WPF application, I get these timing results:
Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods
So when arrays are walked directly, the time is much faster than if the arrays are inside of another object or container. This code shows that somehow, I'm using an accessor method, rather than accessing the arrays directly. Even so, the fastest I seem to be able to get is half a second. When I run the second listing of code in C++ with icc, I get:
Run time for pointer walk: 0.0743338
In this case, then, C++ is 7x faster (using icc, not sure if the same performance can be obtained with msvc-- I'm not as familiar with optimizations there). Is there any way to get C# near that level of C++ performance, or should I just have C# call my C++ routine?
Listing 1, C# code:
public class ArrayHolder
{
int length;
public ushort[] output;
public ushort[] input1;
public ushort[] input2;
public ArrayHolder(int inLength)
{
length = inLength;
output = new ushort[length];
input1 = new ushort[length];
input2 = new ushort[length];
}
public ushort[] getOutput() { return output; }
public ushort[] getInput1() { return input1; }
public ushort[] getInput2() { return input2; }
}
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
Random random = new Random();
int length = 55 * 1024 * 1024;
ushort[] output = new ushort[length];
ushort[] input1 = new ushort[length];
ushort[] input2 = new ushort[length];
ArrayHolder theArrayHolder = new ArrayHolder(length);
for (int i = 0; i < length; i++)
{
output[i] = (ushort)random.Next(0, 16384);
input1[i] = (ushort)random.Next(0, 16384);
input2[i] = (ushort)random.Next(0, 16384);
theArrayHolder.getOutput()[i] = output[i];
theArrayHolder.getInput1()[i] = input1[i];
theArrayHolder.getInput2()[i] = input2[i];
}
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int number = 44;
float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * (float)input2[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < length; i++)
{
theArrayHolder.output[i] =
(ushort)(theArrayHolder.input1[i] +
(ushort)(b * (float)theArrayHolder.input2[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < length; i++)
{
theArrayHolder.getOutput()[i] =
(ushort)(theArrayHolder.getInput1()[i] +
(ushort)(b * (float)theArrayHolder.getInput2()[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
}
}
Listing 2, C++ equivalent:
// looptiming.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>
int _tmain(int argc, _TCHAR* argv[])
{
int length = 55*1024*1024;
unsigned short* output = new unsigned short[length];
unsigned short* input1 = new unsigned short[length];
unsigned short* input2 = new unsigned short[length];
unsigned short* outPtr = output;
unsigned short* in1Ptr = input1;
unsigned short* in2Ptr = input2;
int i;
const int max = 16384;
for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
*outPtr = rand()%max;
*in1Ptr = rand()%max;
*in2Ptr = rand()%max;
}
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick1, tick2; // A point in time
LARGE_INTEGER time; // For converting tick into real time
QueryPerformanceCounter(&tick1);
outPtr = output;
in1Ptr = input1;
in2Ptr = input2;
int number = 44;
float b = (float)number/100.0f;
for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
*outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
}
QueryPerformanceCounter(&tick2);
QueryPerformanceFrequency(&ticksPerSecond);
time.QuadPart = tick2.QuadPart - tick1.QuadPart;
std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;
return 0;
}
EDIT 2: Enabling /QxHost in the second example drops the time down to 0.0662714 seconds. Modifying the first loop as #Reed suggested gets me down to
Time elapsed: 00:00:00.3835017
So, still not fast enough for a slider. That time is via the code:
stopwatch.Start();
Parallel.ForEach(Partitioner.Create(0, length),
(range) =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * (float)input2[i]));
}
});
stopwatch.Stop();
EDIT 3 As per #Eric Lippert's suggestion, I've rerun the code in C# in release, and, rather than use an attached debugger, just print the results to a dialog. They are:
Simple arrays: ~0.273s
Contained arrays: ~0.330s
Accessor arrays: ~0.345s
Parallel arrays: ~0.190s
(these numbers come from a 5 run average)
So the parallel solution is definitely faster than the 3.5 seconds I was getting before, but is still a bit under the 0.074 seconds achievable using the non-icc processor. It seems, therefore, that the fastest solution is to compile in release and then marshal to an icc-compiled C++ executable, which makes using a slider possible here.
EDIT 4: Three more suggestions from #Eric Lippert: change the inside of the for loop from length to array.length, use doubles, and try unsafe code.
For those three, the timing is now:
length: ~0.274s
doubles, not floats: ~0.290s
unsafe: ~0.376s
So far, the parallel solution is the big winner. Although if I could add these via a shader, maybe I could see some kind of speedup there...
Here's the additional code:
stopwatch.Reset();
stopwatch.Start();
double b2 = ((double)number) / 100.0;
for (int i = 0; i < output.Length; ++i)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b2 * (double)input2[i]));
}
stopwatch.Stop();
DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < output.Length; ++i)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * input2[i]));
}
stopwatch.Stop();
LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
unsafe
{
fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
ushort* outP = outPtr;
ushort* in1P = in1Ptr;
ushort* in2P = in2Ptr;
for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
{
*outP = (ushort)(*in1P + b * (float)*in2P);
}
}
}
stopwatch.Stop();
UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
This should be perfectly parallelizable. However, given the small amount of work being done per element, you'll need to handle this with extra care.
The proper way to do this (in .NET 4) would be to use Parallel.ForEach in conjunction with a Partitioner:
float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length),
(range) =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}
});
This will efficiently partition the work across available processing cores in your system, and should provide a decent speedup if you have multiple cores.
That being said, this will, at best, only speed up this operation by the number of cores in your system. If you need to speed it up more, you'll likely need to revert to a mix of parallelization and unsafe code. At that point, it might be worth thinking about alternatives to trying to present this in real time.
Assuming you have a lot of these guys, you can attempt to parallelize the operation (and you're using .NET 4):
Parallel.For(0, length, i=>
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
});
Of course that is all going to depend on whether or not parallelization of this would be worth it. That statement looks fairly computationally short; accessing indices by number is pretty fast as is. You might get gains because this loop is being run so many times with that much data.
Related
I'm trying to write a code to find prime numbers within a given range. Unfortunately I'm running into some problems with too many repetitions that'll give me a stackoverflowexception after prime nr: 30000. I have tried using a 'foreach' and also not using a list, (doing each number as it comes) but nothing seems to handle the problem in hand.
How can I make this program run forever without causing a stackoverflow?
class Program
{
static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
List<double> Primes = new List<double>();
const double Start = 0;
const double End = 100000;
double counter = 0;
int lastInt = 0;
for (int i = 0; i < End; i++)
Primes.Add(i);
for (int i =0;i< Primes.Count;i++)
{
lastInt = (int)Primes[i] - RoundOff((int)Primes[i]);
Primes[i] = (int)CheckForPrime(Primes[i], Math.Round(Primes[i] / 2));
if (Primes[i] != 0)
{
Console.Write(", {0}", Primes[i]);
counter++;
}
}
stopwatch.Stop();
Console.WriteLine("\n\nNumber of prime-numbers between {0} and {1} is: {2}, time it took to calc this: {3} (millisecounds).\n\n" +
" The End\n", Start, End, counter, stopwatch.ElapsedMilliseconds);
}
public static double CheckForPrime(double Prim, double Devider)
{
if (Prim / Devider == Math.Round(Prim / Devider))
return 0;
else if (Devider > 2)
return CheckForPrime(Prim, Devider - 1);
else
return Prim;
}
public static int RoundOff(int i)
{
return ((int)Math.Floor(i / 10.0)) * 10;
}
}
const double pi = 3.1415926535897;
static double mysin(double x) {
return ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
}
static void Main(string[] args) {
Stopwatch sw = new Stopwatch();
double a = 0;
double[] arg = new double[1000000];
for (int i = 0; i < 1000000; i++) {
arg[i] = (pi / 2000000);
}
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + Math.Sin(arg[i]);
}
sw.Stop();
double t1 = (double)(sw.Elapsed.TotalMilliseconds);
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + mysin(arg[i]);
}
sw.Stop();
double t2 = (double)(sw.Elapsed.TotalMilliseconds);
Console.WriteLine("{0}\n{1}\n", t1,t2);
Console.Read();
}
This power series is valid for [0,pi/2] and it is 10 times slower than the built in sine function in release mode. 1ms vs 10ms.
But when I copy paste mysin code into the function I get practically the same time in release and my code is about 4 times faster when in debug mode.
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
double x = arg[i];
a = a + ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
//a = a + mysin(arg[i]);
}
What is the deal here? How do I make this sort of calculations faster? I am guessing the code automatically recognizes that sin algorithm should not be called but copy paste into the loop. How do I make the compiler do the same for me.
One more question, would c++ do the same optimization for its default sin/cos function? If not how would I make sure that it does. Edit: I tested it and my sine function (with 4 extra if conditions added to expand the domain to all real) runs about 25% faster (albeit inaccurate) than the default sin function. And in fact, the copy pasted version runs slower than when I write it as a separate function.
I assume that you tested this on x86, because I cannot repro the numbers on x64. On x64, your code actually appears to be faster.
I disassembled the code for x86/release. The reason for the difference is that your method is just that, a method whereas Math.Sin is compiled to use the x86 fsin instruction directly thus eliminating a function call per invocation.
FWIW, the x64 code is quite different. Math.Sin is translated into clr!COMDouble::Sin.
See FSIN.
I am trying to compare performance between parallel streams in Java 8 and PLINQ (C#/.Net 4.5.1).
Here is the result I get on my machine ( System Manufacturer Dell Inc. System Model Precision M4700 Processor Intel(R) Core(TM) i7-3740QM CPU # 2.70GHz, 2701 Mhz, 4 Core(s), 8 Logical Processor(s) Installed Physical Memory (RAM) 16.0 GB OS Name Microsoft Windows 7 Enterprise Version 6.1.7601 Service Pack 1 Build 7601)
C# .Net 4.5.1 (X64-release)
Serial:
470.7784, 491.4226, 502.4643, 481.7507, 464.1156, 463.0088, 546.149, 481.2942, 502.414, 483.1166
Average: 490.6373
Parallel:
158.6935, 133.4113, 217.4304, 182.3404, 184.188, 128.5767, 160.352, 277.2829, 127.6818, 213.6832
Average: 180.5496
Java 8 (X64)
Serial:
471.911822, 333.843924, 324.914299, 325.215631, 325.208402, 324.872828, 324.888046, 325.53066, 325.765791, 325.935861
Average:326.241715
Parallel:
212.09323, 73.969783, 68.015431, 66.246628, 66.15912, 66.185373, 80.120837, 75.813539, 70.085948, 66.360769
Average:70.3286
It looks like PLINQ does not scale across the CPU cores. I am wondering if I miss something.
Here is the code for C#:
class Program
{
static void Main(string[] args)
{
var NUMBER_OF_RUNS = 10;
var size = 10000000;
var vals = new double[size];
var rnd = new Random();
for (int i = 0; i < size; i++)
{
vals[i] = rnd.NextDouble();
}
var avg = 0.0;
Console.WriteLine("Serial:");
for (int i = 0; i < NUMBER_OF_RUNS; i++)
{
var watch = Stopwatch.StartNew();
var res = vals.Select(v => Math.Sin(v)).ToArray();
var elapsed = watch.Elapsed.TotalMilliseconds;
Console.Write(elapsed + ", ");
if (i > 0)
avg += elapsed;
}
Console.Write("\nAverage: " + (avg / (NUMBER_OF_RUNS - 1)));
avg = 0.0;
Console.WriteLine("\n\nParallel:");
for (int i = 0; i < NUMBER_OF_RUNS; i++)
{
var watch = Stopwatch.StartNew();
var res = vals.AsParallel().Select(v => Math.Sin(v)).ToArray();
var elapsed = watch.Elapsed.TotalMilliseconds;
Console.Write(elapsed + ", ");
if (i > 0)
avg += elapsed;
}
Console.Write("\nAverage: " + (avg / (NUMBER_OF_RUNS - 1)));
}
}
Here is the code for Java:
import java.util.Arrays;
import java.util.Random;
import java.util.stream.DoubleStream;
public class Main {
private static final Random rand = new Random();
private static final int MIN = 1;
private static final int MAX = 140;
private static final int POPULATION_SIZE = 10_000_000;
public static final int NUMBER_OF_RUNS = 10;
public static void main(String[] args) throws InterruptedException {
Random rnd = new Random();
double[] vals1 = DoubleStream.generate(rnd::nextDouble).limit(POPULATION_SIZE).toArray();
double avg = 0.0;
System.out.println("Serial:");
for (int i = 0; i < NUMBER_OF_RUNS; i++)
{
long start = System.nanoTime();
double[] res = Arrays.stream(vals1).map(Math::sin).toArray();
double duration = (System.nanoTime() - start) / 1_000_000.0;
System.out.print(duration + ", " );
if (i > 0)
avg += duration;
}
System.out.println("\nAverage:" + (avg / (NUMBER_OF_RUNS - 1)));
avg = 0.0;
System.out.println("\n\nParallel:");
for (int i = 0; i < NUMBER_OF_RUNS; i++)
{
long start = System.nanoTime();
double[] res = Arrays.stream(vals1).parallel().map(Math::sin).toArray();
double duration = (System.nanoTime() - start) / 1_000_000.0;
System.out.print(duration + ", " );
if (i > 0)
avg += duration;
}
System.out.println("\nAverage:" + (avg / (NUMBER_OF_RUNS - 1)));
}
}
Both runtimes make a decision about how many threads to use in order to complete the parallel operation. That is a non-trivial task that can take many factors into account, including the degree to which the task is CPU bound, the estimated time to complete the task, etc.
Each runtime is different decisions about how many threads to use to resolve the request. Neither decision is obviously right or wrong in terms of system-wide scheduling, but the Java strategy performs the benchmark better (and leaves fewer CPU resources available for other tasks on the system).
I have some large arrays of 2D data elements. A and B aren't equally sized dimensions.
A) is between 5 and 20
B) is between 1000 and 100000
The initialization time is no problem as its only going to be lookup tables for realtime application, so performance on indexing elements from knowing value A and B is crucial. The data stored is currently a single byte-value.
I was thinking around these solutions:
byte[A][B] datalist1a;
or
byte[B][A] datalist2a;
or
byte[A,B] datalist1b;
or
byte[B,A] datalist2b;
or perhaps loosing the multidimension as I know the fixed size and just multiply the to values before looking it up.
byte[A*Bmax + B] datalist3;
or
byte[B*Amax + A] datalist4;
What I need is to know, what datatype/array structure to use for most efficient lookup in C# when I have this setup.
Edit 1
the first two solutions were supposed to be multidimensional, not multi arrays.
Edit 2
All data in the smallest dimension is read at each lookup, but the large one is only used for indexing once at a time.
So its something like - Grab all A's from sample B.
I'd bet on the jagged arrays, unless the Amax or Bmax are a power of 2.
I'd say so, because a jagged array needs two indexed accesses, thus very fast. The other forms implies a multiplication, either implicit or explicit. Unless that multiplication is a simple shift, I think could be a bit heavier than a couple of indexed accesses.
EDIT: Here is the small program used for the test:
class Program
{
private static int A = 10;
private static int B = 100;
private static byte[] _linear;
private static byte[,] _square;
private static byte[][] _jagged;
unsafe static void Main(string[] args)
{
//init arrays
_linear = new byte[A * B];
_square = new byte[A, B];
_jagged = new byte[A][];
for (int i = 0; i < A; i++)
_jagged[i] = new byte[B];
//set-up the params
var sw = new Stopwatch();
byte b;
const int N = 100000;
//one-dim array (buffer)
sw.Restart();
for (int i = 0; i < N; i++)
{
for (int r = 0; r < A; r++)
{
for (int c = 0; c < B; c++)
{
b = _linear[r * B + c];
}
}
}
sw.Stop();
Console.WriteLine("linear={0}", sw.ElapsedMilliseconds);
//two-dim array
sw.Restart();
for (int i = 0; i < N; i++)
{
for (int r = 0; r < A; r++)
{
for (int c = 0; c < B; c++)
{
b = _square[r, c];
}
}
}
sw.Stop();
Console.WriteLine("square={0}", sw.ElapsedMilliseconds);
//jagged array
sw.Restart();
for (int i = 0; i < N; i++)
{
for (int r = 0; r < A; r++)
{
for (int c = 0; c < B; c++)
{
b = _jagged[r][c];
}
}
}
sw.Stop();
Console.WriteLine("jagged={0}", sw.ElapsedMilliseconds);
//one-dim array within unsafe access (and context)
sw.Restart();
for (int i = 0; i < N; i++)
{
for (int r = 0; r < A; r++)
{
fixed (byte* offset = &_linear[r * B])
{
for (int c = 0; c < B; c++)
{
b = *(byte*)(offset + c);
}
}
}
}
sw.Stop();
Console.WriteLine("unsafe={0}", sw.ElapsedMilliseconds);
Console.Write("Press any key...");
Console.ReadKey();
Console.WriteLine();
}
}
Multidimensional ([,]) arrays are nearly always the slowest, unless under a heavy random access scenario. In theory they shouldn't be, but it's one of the CLR oddities.
Jagged arrays ([][]) are nearly always faster than multidimensional arrays; even under random access scenarios. These have a memory overhead.
Singledimensional ([]) and algebraic arrays ([y * stride + x]) are the fastest for random access in safe code.
Unsafe code is, normally, fastest in all cases (provided you don't pin it repeatedly).
The only useful answer to "which X is faster" (for all X) is: you have to do performance tests that reflect your requirements.
And remember to consider, in general*:
Maintenance of the program. If this is not a quick one off, a slightly slower but maintainable program is a better option in most cases.
Micro benchmarks can be deceptive. For instance a tight loop just reading from a collection might be optimised away in ways not possible when real work is being done.
Additionally consider that you need to look at the complete program to decide where to optimise. Speeding up a loop by 1% might be useful for that loop, but if it is only 1% of the complete runtime then it is not making much differences.
* But all rules have exceptions.
On most modern computers, arithmetic operations are far, far faster than memory lookups.
If you fetch a memory address that isn't in a cache or where the out of order execution pulls from the wrong place you are looking at 10-100 clocks, a pipelined multiply is 1 clock.
The other issue is cache locality.
byte[BAmax + A] datalist4; seems like the best bet if you are accessing with A's varying sequentially.
When datalist4[bAmax + a] is accessed, the computer will usually start pulling in datalist4[bAmax + a+ 64/sizeof(dataListType)], ... +128 ... etc, or if it detects a reverse iteration, datalist4[bAmax + a - 64/sizeof(dataListType)]
Hope that helps!
May be best way for u will be use HashMap
Dictionary?
I am looking for the fastest way to de/interleave a buffer.
To be more specific, I am dealing with audio data, so I am trying to optimize the time I spend on splitting/combining channels and FFT buffers.
Currently I am using a for loop with 2 index variables for each array, so only plus operations, but all the managed array checks will not compare to a C pointer method.
I like the Buffer.BlockCopy and Array.Copy methods, which cut a lot of time when I process channels, but there is no way for an array to have a custom indexer.
I was trying to find a way to make an array mask, where it would be a fake array with a custom indexer, but that proves to be two times slower when using it in my FFT operation. I guess there are a lot of optimization tricks the compiler can pull when accessing an array directly, but accessing through a class indexer cannot be optimized.
I do not want an unsafe solution, although from the looks of it, that might be the only way to optimize this type of operation.
Thanks.
Here is the type of thing I'm doing right now:
private float[][] DeInterleave(float[] buffer, int channels)
{
float[][] tempbuf = new float[channels][];
int length = buffer.Length / channels;
for (int c = 0; c < channels; c++)
{
tempbuf[c] = new float[length];
for (int i = 0, offset = c; i < tempbuf[c].Length; i++, offset += channels)
tempbuf[c][i] = buffer[offset];
}
return tempbuf;
}
I ran some tests and here is the code I tested:
delegate(float[] inout)
{ // My Original Code
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
for (int c = 0; c < 2; c++)
{
tempbuf[c] = new float[length];
for (int i = 0, offset = c; i < tempbuf[c].Length; i++, offset += 2)
tempbuf[c][i] = inout[offset];
}
}
delegate(float[] inout)
{ // jerryjvl's recommendation: loop unrolling
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
for (int c = 0; c < 2; c++)
tempbuf[c] = new float[length];
for (int ix = 0, i = 0; ix < length; ix++)
{
tempbuf[0][ix] = inout[i++];
tempbuf[1][ix] = inout[i++];
}
}
delegate(float[] inout)
{ // Unsafe Code
unsafe
{
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
fixed (float* buffer = inout)
for (int c = 0; c < 2; c++)
{
tempbuf[c] = new float[length];
float* offset = buffer + c;
fixed (float* buffer2 = tempbuf[c])
{
float* p = buffer2;
for (int i = 0; i < length; i++, offset += 2)
*p++ = *offset;
}
}
}
}
delegate(float[] inout)
{ // Modifying my original code to see if the compiler is not as smart as i think it is.
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
for (int c = 0; c < 2; c++)
{
float[] buf = tempbuf[c] = new float[length];
for (int i = 0, offset = c; i < buf.Length; i++, offset += 2)
buf[i] = inout[offset];
}
}
and results: (buffer size = 2^17, number iterations timed per test = 200)
Average for test #1: 0.001286 seconds +/- 0.000026
Average for test #2: 0.001193 seconds +/- 0.000025
Average for test #3: 0.000686 seconds +/- 0.000009
Average for test #4: 0.000847 seconds +/- 0.000008
Average for test #1: 0.001210 seconds +/- 0.000012
Average for test #2: 0.001048 seconds +/- 0.000012
Average for test #3: 0.000690 seconds +/- 0.000009
Average for test #4: 0.000883 seconds +/- 0.000011
Average for test #1: 0.001209 seconds +/- 0.000015
Average for test #2: 0.001060 seconds +/- 0.000013
Average for test #3: 0.000695 seconds +/- 0.000010
Average for test #4: 0.000861 seconds +/- 0.000009
I got similar results every test. Obviously the unsafe code is the fastest, but I was surprised to see that the CLS couldn't figure out that that it can drop the index checks when dealing with jagged array. Maybe someone can think of more ways to optimize my tests.
Edit:
I tried loop unrolling with the unsafe code and it didn't have an effect.
I also tried optimizing the loop unrolling method:
delegate(float[] inout)
{
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
float[] tempbuf0 = tempbuf[0] = new float[length];
float[] tempbuf1 = tempbuf[1] = new float[length];
for (int ix = 0, i = 0; ix < length; ix++)
{
tempbuf0[ix] = inout[i++];
tempbuf1[ix] = inout[i++];
}
}
The results are also a hit-miss compared test#4 with 1% difference. Test #4 is my best way to go, so far.
As I told jerryjvl, the problem is getting the CLS to not index check the input buffer, since adding a second check (&& offset < inout.Length) will slow it down...
Edit 2:
I ran the tests before in the IDE, so here are the results outside:
2^17 items, repeated 200 times
******************************************
Average for test #1: 0.000533 seconds +/- 0.000017
Average for test #2: 0.000527 seconds +/- 0.000016
Average for test #3: 0.000407 seconds +/- 0.000008
Average for test #4: 0.000374 seconds +/- 0.000008
Average for test #5: 0.000424 seconds +/- 0.000009
2^17 items, repeated 200 times
******************************************
Average for test #1: 0.000547 seconds +/- 0.000016
Average for test #2: 0.000732 seconds +/- 0.000020
Average for test #3: 0.000423 seconds +/- 0.000009
Average for test #4: 0.000360 seconds +/- 0.000008
Average for test #5: 0.000406 seconds +/- 0.000008
2^18 items, repeated 200 times
******************************************
Average for test #1: 0.001295 seconds +/- 0.000036
Average for test #2: 0.001283 seconds +/- 0.000020
Average for test #3: 0.001085 seconds +/- 0.000027
Average for test #4: 0.001035 seconds +/- 0.000025
Average for test #5: 0.001130 seconds +/- 0.000025
2^18 items, repeated 200 times
******************************************
Average for test #1: 0.001234 seconds +/- 0.000026
Average for test #2: 0.001319 seconds +/- 0.000023
Average for test #3: 0.001309 seconds +/- 0.000025
Average for test #4: 0.001191 seconds +/- 0.000026
Average for test #5: 0.001196 seconds +/- 0.000022
Test#1 = My Original Code
Test#2 = Optimized safe loop unrolling
Test#3 = Unsafe code - loop unrolling
Test#4 = Unsafe code
Test#5 = My Optimized Code
Looks like loop unrolling is not favorable. My optimized code is still my best way to go and with only 10% difference compared to the unsafe code. If only I could tell the compiler that (i < buf.Length) implies that (offset < inout.Length), it will drop the check (inout[offset]) and I will basically get the unsafe performance.
As there's no built in function to do that, using array indexes is the fastest operation you could think of. Indexers and solutions like that only make things worse by introducing method calls and preventing the JIT optimizer to be able to optimize bound checks.
Anyway, I think your current method is the fastest non-unsafe solution you could be using. If performance really matter to you (which usually does in signal processing applications), you can do the whole thing in unsafe C# (which is fast enough, probably comparable with C) and wrap it in a method that you'd call from your safe methods.
It's not going to get you a major performance boost (I roughly measured 20% on my machine), but you could consider some loop unrolling for common cases. If most of the time you have a relatively limited number of channels:
static private float[][] Alternative(float[] buffer, int channels)
{
float[][] result = new float[channels][];
int length = buffer.Length / channels;
for (int c = 0; c < channels; c++)
result[c] = new float[length];
int i = 0;
if (channels == 8)
{
for (int ix = 0; ix < length; ix++)
{
result[0][ix] = buffer[i++];
result[1][ix] = buffer[i++];
result[2][ix] = buffer[i++];
result[3][ix] = buffer[i++];
result[4][ix] = buffer[i++];
result[5][ix] = buffer[i++];
result[6][ix] = buffer[i++];
result[7][ix] = buffer[i++];
}
}
else
for (int ix = 0; ix < length; ix++)
for (int ch = 0; ch < channels; ch++)
result[ch][ix] = buffer[i++];
return result;
}
As long as you leave the general fallback variant there it'll handle any number of channels, but you'll get a speed boost if it is one of the unrolled variants.
Maybe some unrolling in your own best answer:
delegate(float[] inout)
{
unsafe
{
float[][] tempbuf = new float[2][];
int length = inout.Length / 2;
fixed (float* buffer = inout)
{
float* pbuffer = buffer;
tempbuf[0] = new float[length];
tempbuf[1] = new float[length];
fixed (float* buffer0 = tempbuf[0])
fixed (float* buffer1 = tempbuf[1])
{
float* pbuffer0 = buffer0;
float* pbuffer1 = buffer1;
for (int i = 0; i < length; i++)
{
*pbuffer0++ = *pbuffer++;
*pbuffer1++ = *pbuffer++;
}
}
}
}
}
This might get a little more performance still.
I think a lot of readers would question why you don't want an unsafe solution for something like audio processing. It's the type of thing that begs for hot-blooded optimization and I would peronally be unhappy knowing it's being forced through a vm.