I've created the following test method to understand how SSE and AVX work and what their benefits are. Now I'm actually very surprised to see that System.Runtime.Intrinsics.X86.Avx.Multiply is less than 5% faster compared to the traditional approach with the * operator.
I don't understand why this is. Would you please enlighten me?
I've put my benchmark results in the last line of the code examples.
(long TicksSse2, long TicksAlu) TestFloat()
{
Vector256<float> x = Vector256.Create((float)255, (float)128, (float)64, (float)32, (float)16, (float)8, (float)4, (float)2);
Vector256<float> y = Vector256.Create((float).5);
Stopwatch timerSse = new Stopwatch();
Stopwatch timerAlu = new Stopwatch();
for (int cnt = 0; cnt < 100_000_000; cnt++)
{
timerSse.Start();
var xx = Avx.Multiply(x, y);
timerSse.Stop();
timerAlu.Start();
float a = (float)255 * (float).5;
float b = (float)128 * (float).5;
float c = (float)64 * (float).5;
float d = (float)32 * (float).5;
float e = (float)16 * (float).5;
float f = (float)8 * (float).5;
float g = (float)4 * (float).5;
float h = (float)2 * (float).5;
timerAlu.Stop();
}
return (timerSse.ElapsedMilliseconds, timerAlu.ElapsedMilliseconds);
// timerSse = 1688ms; timerAlu = 1748ms.
}
Even more drastically, I created the following test method for mass byte multiplication. This one is even slower using the SSE commands:
Vector128<byte> MultiplyBytes(Vector128<byte> x, Vector128<byte> y)
{
Vector128<ushort> xAsUShort = x.AsUInt16();
Vector128<ushort> yAsUShort = y.AsUInt16();
Vector128<ushort> dstEven = Sse2.MultiplyLow(xAsUShort, yAsUShort);
Vector128<ushort> dstOdd = Sse2.MultiplyLow(Sse2.ShiftRightLogical(xAsUShort, 8), Sse2.ShiftRightLogical(yAsUShort, 8));
return Sse2.Or(Sse2.ShiftLeftLogical(dstOdd, 8), Sse2.And(dstEven, helper)).AsByte();
}
(long TicksSse2, long TicksAlu) TestBytes()
{
Vector128<byte> x = Vector128.Create((byte)1, (byte)2, (byte)3, (byte)4, (byte)5, (byte)6, (byte)7, (byte)8, (byte)9, (byte)10, (byte)11, (byte)12, (byte)13, (byte)14, (byte)15, (byte)16);
Vector128<byte> y = Vector128.Create((byte)2);
Stopwatch timerSse = new Stopwatch();
Stopwatch timerAlu = new Stopwatch();
for (int cnt = 0; cnt < 100_000_000; cnt++)
{
timerSse.Start();
var xx = MultiplyBytes(x, y);
timerSse.Stop();
timerAlu.Start();
byte a = (byte)1 * (byte)2;
byte b = (byte)2 * (byte)2;
byte c = (byte)3 * (byte)2;
byte d = (byte)4 * (byte)2;
byte e = (byte)5 * (byte)2;
byte f = (byte)6 * (byte)2;
byte g = (byte)7 * (byte)2;
byte h = (byte)8 * (byte)2;
byte i = (byte)9 * (byte)2;
byte j = (byte)10 * (byte)2;
byte k = (byte)11 * (byte)2;
byte l = (byte)12 * (byte)2;
byte m = (byte)13 * (byte)2;
byte n = (byte)14 * (byte)2;
byte o = (byte)15 * (byte)2;
byte p = (byte)16 * (byte)2;
timerAlu.Stop();
}
return (timerSse.ElapsedMilliseconds, timerAlu.ElapsedMilliseconds);
// timerSse = 3439ms; timerAlu = 1800ms
}
The benchmark code isn't meaningful.
It tries to measure the duration of a single operation, 100M times, using a timer that simply doesn't have the resolution to measure single CPU operations. Any differences are due to rounding errors.
On my machine Stopwatch.Frequency returns 10_000_000. That's 10MHz, on a 2.7GHZ CPU.
A very crude test would be to repeat each operation 100M times in a loop and measure the entire loop :
timerSse.Start();
for (int cnt = 0; cnt < iterations; cnt++)
{
var xx = Avx.Multiply(x, y);
}
timerSse.Stop();
timerAlu.Start();
for (int cnt = 0; cnt < iterations; cnt++)
{
float a = (float)255 * (float).5;
float b = (float)128 * (float).5;
float c = (float)64 * (float).5;
float d = (float)32 * (float).5;
float e = (float)16 * (float).5;
float f = (float)8 * (float).5;
float g = (float)4 * (float).5;
float h = (float)2 * (float).5;
}
timerAlu.Stop();
In that case the results show a significant difference:
TicksSse2 = 357384, TicksAlu = 474061
The SSE2 code is 75% of the floating point code. That's still not meaningful because the actual code isn't multiplying anything.
The compiler sees that the values are constant and the results never used and eliminates them. Checking the IL generated in Release mode in Sharplab.io shows this:
// loop start (head: IL_005e)
IL_0050: ldloc.0
IL_0051: ldloc.1
IL_0052: call valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32> [System.Runtime.Intrinsics]System.Runtime.Intrinsics.X86.Avx::Multiply(valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32>, valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32>)
IL_0057: pop
IL_0058: ldloc.s 4
IL_005a: ldc.i4.1
IL_005b: add
IL_005c: stloc.s 4
IL_005e: ldloc.s 4
IL_0060: ldarg.1
IL_0061: blt.s IL_0050
// end loop
IL_0063: ldloc.2
IL_0064: callvirt instance void [System.Runtime]System.Diagnostics.Stopwatch::Stop()
IL_0069: ldloc.3
IL_006a: callvirt instance void [System.Runtime]System.Diagnostics.Stopwatch::Start()
IL_006f: ldc.i4.0
IL_0070: stloc.s 5
// sequence point: hidden
IL_0072: br.s IL_007a
// loop start (head: IL_007a)
IL_0074: ldloc.s 5
IL_0076: ldc.i4.1
IL_0077: add
IL_0078: stloc.s 5
IL_007a: ldloc.s 5
IL_007c: ldarg.1
IL_007d: blt.s IL_0074
// end loop
Related
I try to improve basic Sieve of Eratosthenes algorithm by avoiding cross out duplicate multiple of primes, but it turn out to be worse than my expectation
I has implemented two method that return primes in range [2..max)
Basic sieve
public static List<int> Sieve22Max_Basic(int n) {
var primes = new List<int>();
var sieve = new BitArray(n, true); // default all number are prime
//int crossTotal = 0;
int sqrt_n = (int)Math.Sqrt(n) + 1;
for (int p = 2; p < sqrt_n; ++p) {
if (sieve[p]) {
primes.Add(p);
//var cross = new List<int>();
int inc = p == 2 ? p : 2 * p;
for (int mul = p * p; mul < n; mul += inc) {
// cross out multiple of prime p
// cross.Add(mul);
//++crossTotal;
sieve[mul] = false;
}
//if (cross.Count > 0)
// Console.WriteLine($"Prime {p}, cross out: {string.Join(' ', cross)}");
}
}
//Console.WriteLine($"crossTotal: {crossTotal:n0}");
for (int p = sqrt_n; p < n; ++p)
if (sieve[p])
primes.Add(p);
return primes;
}
Run Sieve22Max_Basic(100), see some multiples are cross more than one (ex: 45, 75, 63)
Prime 2, cross out: 4 6 8 ... 96 98
Prime 3, cross out: 9 15 21 27 33 39 45 51 57 63 69 75 81 87 93 99
Prime 5, cross out: 25 35 45 55 65 75 85 95
Prime 7, cross out: 49 63 77 91
Enhance Sieve
Then, I try to improve by using array that store smallest prime divisor (spd) of each number.
45 = 3 x 5 // spd[45] = 3
75 = 3 x 5 x 5 // spd[75] = 3
63 = 3 x 3 x 7 // spd[63] = 3
When go through multiple of prime p, I will not cross out number mul that have spd[mul] < p because mul was cross out by spd[mul] before
public static List<int> Sieve22Max_Enh(int n) {
var sieve = new BitArray(n, true);
var spd = new int[n];
for (int i = 0; i < n; ++i) spd[i] = i;
var primes = new List<int>();
//int crossTotal = 0;
int sqrt_n = (int)Math.Sqrt(n) + 1;
for (int p = 2; p < sqrt_n; ++p) {
if (sieve[p]) {
primes.Add(p);
//var cross = new List<int>();
int inc = p == 2 ? 1 : 2;
for (long mul = p; mul * p < n; mul += inc) {
if (spd[mul] >= p) {
sieve[(int)(mul * p)] = false;
spd[mul * p] = p;
//++crossTotal;
//cross.Add((int)(mul * p));
}
}
//if (cross.Count > 0)
// Console.WriteLine($"Prime {p}, cross out: {string.Join(' ', cross)}");
}
}
//Console.WriteLine($"crossTotal: {crossTotal:n0}");
for (int p = sqrt_n; p < n; ++p)
if (sieve[p])
primes.Add(p);
return primes;
}
Test
I test on my laptop (core i7 - 2.6 Ghz), with n = 1 billion
Sieve22Max_Basic take only 6s while Sieve22Max_Enh take more than 10s to complete
var timer = new Stopwatch();
int n = 1_000_000_000;
timer.Restart();
Console.WriteLine("==== Sieve22Max_Basic ===");
var list = Sieve22Max_Basic(n);
Console.WriteLine($"Count: {list.Count:n0}, Last: {list[list.Count - 1]:n0}, elapsed: {timer.Elapsed}");
Console.WriteLine();
timer.Restart();
Console.WriteLine("==== Sieve22Max_Enh ===");
list = Sieve22Max_Enh(n);
Console.WriteLine($"Count: {list.Count:n0}, Last: {list[list.Count - 1]:n0}, elapsed: {timer.Elapsed}");
You can try at https://onlinegdb.com/tWfMuDDK0
What does it make slower?
Compare your two loops from the original and improved versions.
Original:
int inc = p == 2 ? p : 2 * p;
for (int mul = p * p; mul < n; mul += inc) {
sieve[mul] = false;
}
Improved:
int inc = p == 2 ? 1 : 2;
for (long mul = p; mul * p < n; mul += inc) {
if (spd[mul] >= p) {
sieve[(int)(mul * p)] = false;
spd[mul * p] = p;
}
}
Some observations:
Both loops run the same amount of iterations.
For every iteration, the original loop executes three very fast operations: 1) change a value in BitArray, mul += inc and check mul < n.
For every iteration of the improved loop, we execute more operations: check spd[mul] >= p, mul += inc, mul * p (in the for-loop condition), check mul * p < n.
Increment += and loop condition check for < are the same in both loops; check spd[mul] >= p and change a value in BitArray are comparable in how long they take; but the additional op mul * p in the second loop's condition is multiplication – it's expensive!
But also, for every iteration of the second loop, if spd[mul] >= p is true, then we also execute: mul * p (again!), cast to int, change a value in BitArray, mul * p(third time!), I assume again a cast to int in the indexing of spd, and assign a value in the array spd.
To summarize, your second improved loop's every iteration is computationally "heavier". This is the reason why your improved version is slower.
Came accross a real head scratcher this week. I am implementing an IIR filter in C# so I copied directly from the matlab source to filter their time domain filter function (direct form II transpose):
// direct form ii transposed
for (int i = 0; i < data.Length; i++)
{
Xi = data[i];
Yi = b[0] * Xi + Z[0];
for (int j = 1; j < order; j++)
Z[j - 1] = b[j] * Xi + Z[j] - a[j] * Yi;
Z[order - 1] = b[order] * Xi - a[order] * Yi;
output[i] = Yi;
}
return output;
What is odd is that when I test the filter with an impulse, I get slightly different values from those reported by Matlab. I am getting the filter coefficients from Matlab as well. Here is the code:
[b,a] = butter(3, [.0360, .1160], 'bandpass');
x = zeros(50,1);
x(1) = 1.0;
y = filter(b,a,x)
I use the values in b and a as the coeffients in my C# code.
The first few values for y as reported by Matlab:
>> y(1:13)
ans =
0.0016
0.0084
0.0216
0.0368
0.0487
0.0537
0.0501
0.0382
0.0194
-0.0038
-0.0286
-0.0519
-0.0713
Since this was different from my C# port, I directly copied the code from filter to a C file and ran it there using the same coefficients. The output was exactly the same, slightly off version of the impulse response that I got in my C# implementation:
[0] 0.0016 double
[1] 0.0086161600000000012 double
[2] 0.022182403216000009 double
[3] 0.038161063110481647 double
[4] 0.051323531488129848 double
[5] 0.05827273642334313 double
[6] 0.057456579295617483 double
[7] 0.048968543791003127 double
[8] 0.034196988694833064 double
[9] 0.015389667539999874 double
[10] -0.0048027826346631469 double
[11] -0.023749640330880527 double
[12] -0.039187648694732236 double
[13] -0.04946710058803272 double
I looked carefully at the source to filter and I don't see any evidence of massaging the coefficients prior to calculating the filter output. filter normalizes the feed forward coefficients only in the case that the a[0] does not equal 1 (which in this case it most certainly does). Other than that, I expect to see exactly the same filter output from Matlab's C code as I do from Matlab.
I would really like to know where this discrepency is coming from because I need to be confident that my filter is exactly correct (don't we all...). I have checked and rechecked my filter coefficients. They are identical between C/C# and Matlab.
The full C file I used to get these 'wrong' values follows. I tried both the filter implemented as a fixed number of states (6 in this case) and the general case of N states (commented out here). Both come from the Matlab source code and both produce identical, 'wrong' outputs:
# define order 6
# define len 50
int main(void){
// filter coeffs
float a[7] = { 1.0000, -5.3851, 12.1978, -14.8780, 10.3077, -3.8465, 0.6041 };
float b[7] = { 0.0016, 0.0, -0.0047, 0.0, 0.0047, 0, -0.0016 };
float a1 = a[1], a2 = a[2], a3 = a[3], a4 = a[4], a5 = a[5], a6 = a[6];
// input, output, and state arrays
float X[len];
float Y[len];
float Z[order];
float Xi;
float Yi;
float z0, z1, z2, z3,z4, z5;
// indeces
int i,j;
// initialize arrays
for(i=0;i<len;i++) {
X[i] = 0.0;
Y[i] = 0.0;
}
X[0] = 1.0;
for(i=0;i<order;i++)
Z[i] = 0.0;
z0 = Z[0];
z1 = Z[1];
z2 = Z[2];
z3 = Z[3];
z4 = Z[4];
z5 = Z[5];
i = 0;
while (i < len) {
Xi = X[i];
Yi = b[0] * Xi + z0;
z0 = b[1] * Xi + z1 - a1 * Yi;
z1 = b[2] * Xi + z2 - a2 * Yi;
z2 = b[3] * Xi + z3 - a3 * Yi;
z3 = b[4] * Xi + z4 - a4 * Yi;
z4 = b[5] * Xi + z5 - a5 * Yi;
z5 = b[6] * Xi - a6 * Yi;
Y[i++] = Yi;
}
//// copied from matlab filter source code
//i=0;
//while (i < len) {
// Xi = X[i]; // Get signal
// Yi = b[0] * Xi + Z[0]; // Filtered value
// for (j = 1; j < order; j++) { // Update conditions
// Z[j - 1] = b[j] * Xi + Z[j] - a[j] * Yi;
// }
// Z[order - 1] = b[order] * Xi - a[order] * Yi;
//
// Y[i++] = Yi; // Write to output
// }
}
If I use all the digits of the filter coefficients, I get the Matlab answer.
Specifically,
float a[7] = {1,
-5.3850853610906082025167052052,
12.1978301571107792256043467205,
-14.8779557262737220924009307055,
10.3076512098041828124905805453,
-3.84649525959781790618308150442,
0.604109699507274999774608659209};
float b[7] = {0.00156701035058826889344307797813, 0,
-0.0047010310517648064634887994373, 0,
0.0047010310517648064634887994373, 0,
-0.00156701035058826889344307797813};
(Filter coefficients obtained using Scipy: [b,a] = scipy.signal.butter(3, [.0360, .1160], 'bandpass') since I’m not made of 💸.)
With this in place, your C code can print out:
[0] = 0.0015670103020966053009033203125
[1] = 0.00843848474323749542236328125
[2] = 0.02162680588662624359130859375
[3] = 0.036844909191131591796875
[4] = 0.048709094524383544921875
[5] = 0.05368389189243316650390625
[6] = 0.05014741420745849609375
[7] = 0.0382179915904998779296875
[8] = 0.0194064676761627197265625
[9] = -0.003834001719951629638671875
which matches your Matlab output.
Nothing to do with float/double. In situations like this, when porting implementations from one language to another, please strive to ensure bit-exact inputs and compare outputs using relative error—not copy-pasting and comparing printouts.
(PS. Notice how, thanks to the symmetry of your passband, every other element of b is 0. This can be used to reduce the number of flops needed!)
I found this code to swap two numbers without using a third variable, using the XOR ^ operator.
Code:
int i = 25;
int j = 36;
j ^= i;
i ^= j;
j ^= i;
Console.WriteLine("i:" + i + " j:" + j);
//numbers Swapped correctly
//Output: i:36 j:25
Now I changed the above code to this equivalent code.
My Code:
int i = 25;
int j = 36;
j ^= i ^= j ^= i; // I have changed to this equivalent (???).
Console.WriteLine("i:" + i + " j:" + j);
//Not Swapped correctly
//Output: i:36 j:0
Now, I want to know, Why does my code give incorrect output?
EDIT: Okay, got it.
The first point to make is that obviously you shouldn't use this code anyway. However, when you expand it, it becomes equivalent to:
j = j ^ (i = i ^ (j = j ^ i));
(If we were using a more complicated expression such as foo.bar++ ^= i, it would be important that the ++ was only evaluated once, but here I believe it's simpler.)
Now, the order of evaluation of the operands is always left to right, so to start with we get:
j = 36 ^ (i = i ^ (j = j ^ i));
This (above) is the most important step. We've ended up with 36 as the LHS for the XOR operation which is executed last. The LHS is not "the value of j after the RHS has been evaluated".
The evaluation of the RHS of the ^ involves the "one level nested" expression, so it becomes:
j = 36 ^ (i = 25 ^ (j = j ^ i));
Then looking at the deepest level of nesting, we can substitute both i and j:
j = 36 ^ (i = 25 ^ (j = 25 ^ 36));
... which becomes
j = 36 ^ (i = 25 ^ (j = 61));
The assignment to j in the RHS occurs first, but the result is then overwritten at the end anyway, so we can ignore that - there are no further evaluations of j before the final assignment:
j = 36 ^ (i = 25 ^ 61);
This is now equivalent to:
i = 25 ^ 61;
j = 36 ^ (i = 25 ^ 61);
Or:
i = 36;
j = 36 ^ 36;
Which becomes:
i = 36;
j = 0;
I think that's all correct, and it gets to the right answer... apologies to Eric Lippert if some of the details about evaluation order are slightly off :(
Checked the generated IL and it gives out different results;
The correct swap generates a straightforward:
IL_0001: ldc.i4.s 25
IL_0003: stloc.0 //create a integer variable 25 at position 0
IL_0004: ldc.i4.s 36
IL_0006: stloc.1 //create a integer variable 36 at position 1
IL_0007: ldloc.1 //push variable at position 1 [36]
IL_0008: ldloc.0 //push variable at position 0 [25]
IL_0009: xor
IL_000a: stloc.1 //store result in location 1 [61]
IL_000b: ldloc.0 //push 25
IL_000c: ldloc.1 //push 61
IL_000d: xor
IL_000e: stloc.0 //store result in location 0 [36]
IL_000f: ldloc.1 //push 61
IL_0010: ldloc.0 //push 36
IL_0011: xor
IL_0012: stloc.1 //store result in location 1 [25]
The incorrect swap generates this code:
IL_0001: ldc.i4.s 25
IL_0003: stloc.0 //create a integer variable 25 at position 0
IL_0004: ldc.i4.s 36
IL_0006: stloc.1 //create a integer variable 36 at position 1
IL_0007: ldloc.1 //push 36 on stack (stack is 36)
IL_0008: ldloc.0 //push 25 on stack (stack is 36-25)
IL_0009: ldloc.1 //push 36 on stack (stack is 36-25-36)
IL_000a: ldloc.0 //push 25 on stack (stack is 36-25-36-25)
IL_000b: xor //stack is 36-25-61
IL_000c: dup //stack is 36-25-61-61
IL_000d: stloc.1 //store 61 into position 1, stack is 36-25-61
IL_000e: xor //stack is 36-36
IL_000f: dup //stack is 36-36-36
IL_0010: stloc.0 //store 36 into positon 0, stack is 36-36
IL_0011: xor //stack is 0, as the original 36 (instead of the new 61) is xor-ed)
IL_0012: stloc.1 //store 0 into position 1
It's evident that the code generated in the second method is incorect, as the old value of j is used in a calculation where the new value is required.
C# loads j, i, j, i on the stack, and stores each XOR result without updating the stack, so the leftmost XOR uses the initial value for j.
Rewriting:
j ^= i;
i ^= j;
j ^= i;
Expanding ^=:
j = j ^ i;
i = j ^ i;
j = j ^ i;
Substitute:
j = j ^ i;
j = j ^ (i = j ^ i);
Substitute this only works if/because the left hand side of the ^ operator is evaluated first:
j = (j = j ^ i) ^ (i = i ^ j);
Collapse ^:
j = (j ^= i) ^ (i ^= j);
Symmetrically:
i = (i ^= j) ^ (j ^= i);
I found that my application spends 25% of its time doing this in a loop:
private static int Diff (int c0, int c1)
{
unsafe {
byte* pc0 = (byte*) &c0;
byte* pc1 = (byte*) &c1;
int d0 = pc0[0] - pc1[0];
int d1 = pc0[1] - pc1[1];
int d2 = pc0[2] - pc1[2];
int d3 = pc0[3] - pc1[3];
d0 *= d0;
d1 *= d1;
d2 *= d2;
d3 *= d3;
return d0 + d1 + d2 + d3;
}
}
How can I improve the performance of this method? My ideas so far:
Most obviously, this would benefit from SIMD, but let us suppose I don't want to go there because it is a bit of a hassle.
Same goes for lower level stuff (calling a C library, executing on GPGPU)
Multithreading - I'll use that.
Edit: For your convenience, some test code which reflects the real environment and use case. (In reality even more data are involved, and data are not compared in single large blocks but in many chunks of several kb each.)
public static class ByteCompare
{
private static void Main ()
{
const int n = 1024 * 1024 * 20;
const int repeat = 20;
var rnd = new Random (0);
Console.Write ("Generating test data... ");
var t0 = Enumerable.Range (1, n)
.Select (x => rnd.Next (int.MinValue, int.MaxValue))
.ToArray ();
var t1 = Enumerable.Range (1, n)
.Select (x => rnd.Next (int.MinValue, int.MaxValue))
.ToArray ();
Console.WriteLine ("complete.");
GC.Collect (2, GCCollectionMode.Forced);
Console.WriteLine ("GCs: " + GC.CollectionCount (0));
{
var sw = Stopwatch.StartNew ();
long res = 0;
for (int reps = 0; reps < repeat; reps++) {
for (int i = 0; i < n; i++) {
int c0 = t0[i];
int c1 = t1[i];
res += ByteDiff_REGULAR (c0, c1);
}
}
sw.Stop ();
Console.WriteLine ("res=" + res + ", t=" + sw.Elapsed.TotalSeconds.ToString ("0.00") + "s - ByteDiff_REGULAR");
}
{
var sw = Stopwatch.StartNew ();
long res = 0;
for (int reps = 0; reps < repeat; reps++) {
for (int i = 0; i < n; i++) {
int c0 = t0[i];
int c1 = t1[i];
res += ByteDiff_UNSAFE (c0, c1);
}
}
sw.Stop ();
Console.WriteLine ("res=" + res + ", t=" + sw.Elapsed.TotalSeconds.ToString ("0.00") + "s - ByteDiff_UNSAFE_PTR");
}
Console.WriteLine ("GCs: " + GC.CollectionCount (0));
Console.WriteLine ("Test complete.");
Console.ReadKey (true);
}
public static int ByteDiff_REGULAR (int c0, int c1)
{
var c00 = (byte) (c0 >> (8 * 0));
var c01 = (byte) (c0 >> (8 * 1));
var c02 = (byte) (c0 >> (8 * 2));
var c03 = (byte) (c0 >> (8 * 3));
var c10 = (byte) (c1 >> (8 * 0));
var c11 = (byte) (c1 >> (8 * 1));
var c12 = (byte) (c1 >> (8 * 2));
var c13 = (byte) (c1 >> (8 * 3));
var d0 = (c00 - c10);
var d1 = (c01 - c11);
var d2 = (c02 - c12);
var d3 = (c03 - c13);
d0 *= d0;
d1 *= d1;
d2 *= d2;
d3 *= d3;
return d0 + d1 + d2 + d3;
}
private static int ByteDiff_UNSAFE (int c0, int c1)
{
unsafe {
byte* pc0 = (byte*) &c0;
byte* pc1 = (byte*) &c1;
int d0 = pc0[0] - pc1[0];
int d1 = pc0[1] - pc1[1];
int d2 = pc0[2] - pc1[2];
int d3 = pc0[3] - pc1[3];
d0 *= d0;
d1 *= d1;
d2 *= d2;
d3 *= d3;
return d0 + d1 + d2 + d3;
}
}
}
which yields for me (running as x64 Release on an i5):
Generating test data... complete.
GCs: 8
res=18324555528140, t=1.46s - ByteDiff_REGULAR
res=18324555528140, t=1.15s - ByteDiff_UNSAFE
res=18324555528140, t=1.73s - Diff_Alex1
res=18324555528140, t=1.63s - Diff_Alex2
res=18324555528140, t=3.59s - Diff_Alex3
res=18325828513740, t=3.90s - Diff_Alex4
GCs: 8
Test complete.
Most obviously, this would benefit from SIMD, but let us suppose I don't want to go there because it is a bit of a hassle.
Well avoid it if you want, but it's actually fairly well supported directly from C#. Short of offloading to the GPU, I would expect this to be by far the largest performance winner if the larger algorithm lends itself to SIMD processing.
http://www.drdobbs.com/architecture-and-design/simd-enabled-vector-types-with-c/240168888
Multithreading
Sure, use one thread per CPU core. You can also use constructs like Parallel.For and let .NET sort out how many threads to use. It's pretty good at that, but since you know this is certainly CPU bound you might (or might not) get a more optimal result by managing threads yourself.
As for speeding up the actual code block, it may be faster to use bit masking and bit shifting to get the individual values to work on, rather than using pointers. That has the additional benefit that you don't need an unsafe code block, e.g.
byte b0_leftmost = (c0 & 0xff000000) >> 24;
Besides the already mentioned SIMD options and running multiple operations in parallel, have you tried to benchmark some possible implementation variations on the theme? Like some of the below options.
I almost forgot to mention a very important optimization:
Add a using System.Runtime.CompilerServices;
Add the [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute to your method.
Like this:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int Diff(int c0, int c1)
{
unsafe
{
byte* pc0 = (byte*)&c0;
byte* pc1 = (byte*)&c1;
int sum = 0;
int dif = 0;
for (var i = 0; i < 4; i++, pc0++, pc1++)
{
dif = *pc0 - *pc1;
sum += (dif * dif);
}
return sum;
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int Diff(int c0, int c1)
{
unchecked
{
int sum = 0;
int dif = 0;
for (var i = 0; i < 4; i++)
{
dif = (c0 & 0xFF) - (c1 & 0xFF);
c0 >>= 8;
c1 >>= 8;
sum += (dif * dif);
}
return sum;
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int Diff(int c0, int c1)
{
unsafe
{
int* difs = stackalloc int[4];
byte* pc0 = (byte*)&c0;
byte* pc1 = (byte*)&c1;
difs[0] = pc0[0] - pc1[0];
difs[1] = pc0[1] - pc1[1];
difs[2] = pc0[2] - pc1[2];
difs[3] = pc0[3] - pc1[3];
return difs[0] * difs[0] + difs[1] * difs[1] + difs[2] * difs[2] + difs[3] * difs[3];
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int Diff(int c0, int c1)
{
unsafe
{
int* difs = stackalloc int[4];
difs[0] = (c0 >> 24) - (c1 >> 24);
difs[1] = ((c0 >> 16) & 0xFF) - ((c1 >> 16) & 0xFF);
difs[2] = ((c0 >> 8) & 0xFF) - ((c1 >> 8) & 0xFF);
difs[3] = (c0 & 0xFF) - (c1 & 0xFF);
return difs[0] * difs[0] + difs[1] * difs[1] + difs[2] * difs[2] + difs[3] * difs[3];
}
}
I tried to reduce IL instructions count (looks like it's only option for single threaded, no-SIMD code). This code runs 35% faster than in description on my machine. Also i was thinking that you could try to generate IL instruction by yourself via Emit static class. It can give you more accuracy.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int ByteDiff_UNSAFE_2 (int c0, int c1)
{
unsafe {
byte* pc0 = (byte*) &c0;
byte* pc1 = (byte*) &c1;
int d0 = pc0[0] - pc1[0];
d0 *= d0;
int d1 = pc0[1] - pc1[1];
d0 += d1 * d1;
int d2 = pc0[2] - pc1[2];
d0 += d2 * d2;
int d3 = pc0[3] - pc1[3];
return d0 + d3 * d3;
}
}
I have an array of interleaved signed 24-bit ints (complex numbers) in little endian order that I would like to convert to a complex array of floats or doubles. By interleaved, I mean:
R1 R2 R3 I1 I2 I3 R4 R5 R6 I4 I5 I6 . . .
Where each item is an 8-bit byte, and each three together are a 24-bit int, with R = real and I = imaginary.
What's the most efficient way to do this in C#? The code has to run many times, so I'm trying to squeeze every last cycle out of it I can. I'm hoping for something more efficient than a brute force shift, or and cast.
I wouldn't mind using unsafe code in this case, if it would help.
Here's the baseline, brute-force approach with the second number of the pair commented out, and with sign handling ignored for the moment, to simplify the IDL:
class Program
{
const int Size = 10000000;
static void Main(string[] args)
{
//
// Array of little-endian 24-bit complex ints
// (least significant byte first)
//
byte[] buf = new byte[3 * 2 * Size];
float[] real = new float[Size];
//float[] imag = new float[Size];
//
// The brute-force way
//
int j = 0;
Stopwatch timer = new Stopwatch();
timer.Start();
for (int i = 0; i < Size; i++)
{
real[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
j += 3;
// imag[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
j += 3;
}
timer.Stop();
Console.WriteLine("result = " +
(float)(timer.ElapsedMilliseconds * 1000.0f / Size) +
" microseconds per complex number");
Console.ReadLine();
}
}
and the associated IDL:
IL_0024: ldc.i4.0
IL_0025: stloc.s i
IL_0027: br.s IL_0050
IL_0029: ldloc.1
IL_002a: ldloc.s i
IL_002c: ldloc.0
IL_002d: ldloc.2
IL_002e: ldelem.u1
IL_002f: ldloc.0
IL_0030: ldloc.2
IL_0031: ldc.i4.1
IL_0032: add
IL_0033: ldelem.u1
IL_0034: ldc.i4.8
IL_0035: shl
IL_0036: or
IL_0037: ldloc.0
IL_0038: ldloc.2
IL_0039: ldc.i4.2
IL_003a: add
IL_003b: ldelem.u1
IL_003c: ldc.i4.s 16
IL_003e: shl
IL_003f: or
IL_0040: conv.r4
IL_0041: stelem.r4
IL_0042: ldloc.2
IL_0043: ldc.i4.3
IL_0044: add
IL_0045: stloc.2
IL_0046: ldloc.2
IL_0047: ldc.i4.3
IL_0048: add
IL_0049: stloc.2
IL_004a: ldloc.s i
IL_004c: ldc.i4.1
IL_004d: add
IL_004e: stloc.s i
IL_0050: ldloc.s i
IL_0052: ldc.i4 0x989680
IL_0057: blt.s IL_0029
Late to the party but this looked like fun ;-)
A couple of experiments (using unsafe) below. Method1() is yours. On my laptop, with an AnyCPU Release build, there's a consistent 20%-odd improvement for Method2() and no significant additional benefit for Method3(). (Timed over 100_000_000 iterations.)
I was going for pointers without (explicit) shifting (masking unavoidable).
Some typical results...
result = 0.0075 microseconds per complex number
result = 0.00542 microseconds per complex number
result = 0.00516 microseconds per complex number
result = 0.00753 microseconds per complex number
result = 0.0052 microseconds per complex number
result = 0.00528 microseconds per complex number
Code...
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
namespace SO_20210326
{
// Enable unsafe code
[StructLayout(LayoutKind.Explicit, Pack = 1, Size = 6)]
struct NumPair
{
[FieldOffset(0)] public int r;
[FieldOffset(3)] public int i;
}
class Program
{
const int Size = 100000000;
static void Method1()
{
//
// Array of little-endian 24-bit complex ints
// (least significant byte first)
//
byte[] buf = new byte[3 * 2 * Size];
float[] real = new float[Size];
float[] imag = new float[Size];
//
// The brute-force way
//
int j = 0;
Stopwatch timer = new Stopwatch();
timer.Start();
for (int i = 0; i < Size; i++)
{
real[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
j += 3;
imag[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
j += 3;
}
timer.Stop();
Console.WriteLine("result = " +
(float)(timer.ElapsedMilliseconds * 1000.0f / Size) +
" microseconds per complex number");
}
static void Method2()
{
NumPair[] buf = new NumPair[Size];
float[] real = new float[Size];
float[] imag = new float[Size];
Stopwatch timer = new Stopwatch();
timer.Start();
for (int i = 0; i < Size; i++)
{
real[i] = buf[i].r & 0xffffff00;
imag[i] = buf[i].i & 0xffffff00;
}
timer.Stop();
Console.WriteLine("result = " +
(float)(timer.ElapsedMilliseconds * 1000.0f / Size) +
" microseconds per complex number");
}
static void Method3()
{
unsafe
{
NumPair[] buf = new NumPair[Size];
float[] real = new float[Size];
float[] imag = new float[Size];
Stopwatch timer = new Stopwatch();
timer.Start();
fixed (void* pvalue = &buf[0])
{
var p = (byte*)pvalue;
for (int i = 0; i < Size; i++)
{
real[i] = *(int*)p & 0xffffff00;
p += 3;
imag[i] = *(int*)p & 0xffffff00;
p += 3;
}
}
timer.Stop();
Console.WriteLine("result = " +
(float)(timer.ElapsedMilliseconds * 1000.0f / Size) +
" microseconds per complex number");
}
}
static void Main(string[] args)
{
Method1();
Method2();
Method3();
Console.ReadLine();
}
}
}
Bit shifting should be very fast although you'll need to operate in a bigger unit, i.e. int or long, instead of byte. That avoids multiple shifts to combine the three bytes. But that also means you must do an unsafe cast from byte* buf to uint* buf or ulong* buf. Objects in .NET are aligned to 4 or 8 bytes by default, but in a few cases you might need to take care of the alignment issue yourself
uint* dataIn = (uint*)(ref buf);
uint i = 0;
float real[buf.Length/6], imag[buf.Length/6];
for (var data = dataIn; data < dataIn + buf.Length; data += 3)
{
// Extract (R123, I123) and (R456, I456) at once
// D1--------- D2--------- D3---------
// R1 R2 R3 I1 I2 I3 R4 R5 R6 I4 I5 I6
real[i] = data[0] >> 8;
imag[i] = ((data[0] & 0xff) << 16) | (data[1] >> 16);
real[i + 1] = ((data[1] & 0xffff) << 8) | (data[2] >> 24);
imag[i + 1] = data[2] & 0xffffff;
i++;
}
Doing that in ulong is also similar, but you'll extract 4 complex numbers in one iteration instead of 2 like above. If the size is not a multiple of 12 then the remaining bytes will be extracted at the end, outside the loop.
You can also run the conversion in multiple threads easily. In newer .NET framework SIMD may be another solution because .NET Core has added many SIMD intrinsics