Why is my sine algorithm much slower than the default? - c#

const double pi = 3.1415926535897;
static double mysin(double x) {
return ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
}
static void Main(string[] args) {
Stopwatch sw = new Stopwatch();
double a = 0;
double[] arg = new double[1000000];
for (int i = 0; i < 1000000; i++) {
arg[i] = (pi / 2000000);
}
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + Math.Sin(arg[i]);
}
sw.Stop();
double t1 = (double)(sw.Elapsed.TotalMilliseconds);
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + mysin(arg[i]);
}
sw.Stop();
double t2 = (double)(sw.Elapsed.TotalMilliseconds);
Console.WriteLine("{0}\n{1}\n", t1,t2);
Console.Read();
}
This power series is valid for [0,pi/2] and it is 10 times slower than the built in sine function in release mode. 1ms vs 10ms.
But when I copy paste mysin code into the function I get practically the same time in release and my code is about 4 times faster when in debug mode.
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
double x = arg[i];
a = a + ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
//a = a + mysin(arg[i]);
}
What is the deal here? How do I make this sort of calculations faster? I am guessing the code automatically recognizes that sin algorithm should not be called but copy paste into the loop. How do I make the compiler do the same for me.
One more question, would c++ do the same optimization for its default sin/cos function? If not how would I make sure that it does. Edit: I tested it and my sine function (with 4 extra if conditions added to expand the domain to all real) runs about 25% faster (albeit inaccurate) than the default sin function. And in fact, the copy pasted version runs slower than when I write it as a separate function.

I assume that you tested this on x86, because I cannot repro the numbers on x64. On x64, your code actually appears to be faster.
I disassembled the code for x86/release. The reason for the difference is that your method is just that, a method whereas Math.Sin is compiled to use the x86 fsin instruction directly thus eliminating a function call per invocation.
FWIW, the x64 code is quite different. Math.Sin is translated into clr!COMDouble::Sin.
See FSIN.

Related

Matrix3x2 Performance

In my graphics application, I can represent matrices using either SharpDX.Matrix3x2 or System.Numerics.Matrix3x2. However, upon running both matrices through a performance test, I found that SharpDX's matrices handily defeat System.Numerics.Matrix3x2 by a margin of up to 70% in terms of time. My test was a pretty simple repeated multiplication, here's the code:
var times1 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = SharpDX.Matrix3x2.Identity;
for (var j = 0; j < 10000; j++)
mat *= SharpDX.Matrix3x2.Rotation(13);
sw.Stop();
times1.Add(sw.ElapsedTicks);
}
var times2 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = System.Numerics.Matrix3x2.Identity;
for (var j = 0; j < 10000; j++)
mat *= System.Numerics.Matrix3x2.CreateRotation(13);
sw.Stop();
times2.Add(sw.ElapsedTicks);
}
TestContext.WriteLine($"SharpDX: {times1.Average()}\nSystem.Numerics: {times2.Average()}");
I ran these tests on an Intel i5-6200U processor.
Now, my question is, how can SharpDX's matrices possibly be faster? Isn't System.Numerics.Matrix3x2 supposed to utilise SIMD instructions to execute faster?
The implementation of SharpDX.Matrix3x2 is available here, and as you can see, it's written in plain C#.
It turns out that my testing logic was flawed - I was creating the rotation matrix inside the loop, which meant that I was testing the creation of rotation matrices and multiplication. I revised my testing code to look like this:
var times1 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = SharpDX.Matrix3x2.Identity;
var s = SharpDX.Matrix3x2.Scaling(13);
var r = SharpDX.Matrix3x2.Rotation(13);
var t = SharpDX.Matrix3x2.Translation(13, 13);
for (var j = 0; j < 10000; j++)
{
mat *= s;
mat *= r;
mat *= t;
}
sw.Stop();
times1.Add(sw.ElapsedTicks);
}
var times2 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = System.Numerics.Matrix3x2.Identity;
var s = System.Numerics.Matrix3x2.CreateScale(13);
var r = System.Numerics.Matrix3x2.CreateRotation(13);
var t = System.Numerics.Matrix3x2.CreateTranslation(13, 13);
for (var j = 0; j < 10000; j++)
{
mat *= s;
mat *= r;
mat *= t;
}
sw.Stop();
times2.Add(sw.ElapsedTicks);
}
So that the only thing performed inside the loop was multiplication, and I began to receive results indicating better performance from System.Numerics.Matrix3x2.
Another point: I didn't pay attention to the fact that SIMD optimisations only take effect in 64-bit code. These are my test results before and after changing the platform to x64:
Platform Target | System.Numerics.Matrix3x2 | SharpDX.Matrix3x2
---------------------------------------------------------------
AnyCPU | 168ms | 197ms
x64 | 1.40ms | 1.43ms
When I check Environment.Is64BitProcess under AnyCPU, it returns false - and the "Prefer 32-Bit" box in Visual Studio is greyed out, so I suspect that AnyCPU is just an alias for x86 in this case, which explains why the test is 2 orders of magnitude faster under x64.
There are a few other things you need to consider also with the testing. These are just side notes, and wont affect your current results. I've done some testing like this also.
Some corresponding functions in Sharpdx pass by object, not reference, there are corresponding by reference functions you might want to play with. You've used the operators in your testing (all fine, its a comparable test!). Just in some situations, use of operators is slower than the by reference functions.

Linear regression gradient descent using C#

I'm taking the Coursera machine learning course right now and I cant get my gradient descent linear regression function to minimize. I use: one dependent variable, an intercept, and four values of x and y, therefore the equations are fairly simple. The final value of the Gradient Decent equation varies wildly depending on the initial values of alpha and beta and I cant figure out why.
I've only been coding for about two weeks, so my knowledge is limited to say the least, please keep this in mind if you take the time to help.
using System;
namespace LinearRegression
{
class Program
{
static void Main(string[] args)
{
Random rnd = new Random();
const int N = 4;
//We randomize the inital values of alpha and beta
double theta1 = rnd.Next(0, 100);
double theta2 = rnd.Next(0, 100);
//Values of x, i.e the independent variable
double[] x = new double[N] { 1, 2, 3, 4 };
//VAlues of y, i.e the dependent variable
double[] y = new double[N] { 5, 7, 9, 12 };
double sumOfSquares1;
double sumOfSquares2;
double temp1;
double temp2;
double sum;
double learningRate = 0.001;
int count = 0;
do
{
//We reset the Generalized cost function, called sum of squares
//since I originally used SS to
//determine if the function was minimized
sumOfSquares1 = 0;
sumOfSquares2 = 0;
//Adding 1 to counter for each iteration to keep track of how
//many iterations are completed thus far
count += 1;
//First we calculate the Generalized cost function, which is
//to be minimized
sum = 0;
for (int i = 0; i < (N - 1); i++)
{
sum += Math.Pow((theta1 + theta2 * x[i] - y[i]), 2);
}
//Since we have 4 values of x and y we have 1/(2*N) = 1 /8 = 0.125
sumOfSquares1 = 0.125 * sum;
//Then we calcualte the new alpha value, using the derivative of
//the cost function.
sum = 0;
for (int i = 0; i < (N - 1); i++)
{
sum += theta1 + theta2 * x[i] - y[i];
}
//Since we have 4 values of x and y we have 1/(N) = 1 /4 = 0.25
temp1 = theta1 - learningRate * 0.25 * sum;
//Same for the beta value, it has a different derivative
sum = 0;
for (int i = 0; i < (N - 1); i++)
{
sum += (theta1 + theta2 * x[i]) * x[i] - y[i];
}
temp2 = theta2 - learningRate * 0.25 * sum;
//WE change the values of alpha an beta at the same time, otherwise the
//function wont work
theta1 = temp1;
theta2 = temp2;
//We then calculate the cost function again, with new alpha and beta values
sum = 0;
for (int i = 0; i < (N - 1); i++)
{
sum += Math.Pow((theta1 + theta2 * x[i] - y[i]), 2);
}
sumOfSquares2 = 0.125 * sum;
Console.WriteLine("Alpha: {0:N}", theta1);
Console.WriteLine("Beta: {0:N}", theta2);
Console.WriteLine("GCF Before: {0:N}", sumOfSquares1);
Console.WriteLine("GCF After: {0:N}", sumOfSquares2);
Console.WriteLine("Iterations: {0}", count);
Console.WriteLine(" ");
} while (sumOfSquares2 <= sumOfSquares1 && count < 5000);
//we end the iteration cycle once the generalized cost function
//cannot be reduced any further or after 5000 iterations
Console.ReadLine();
}
}
}
There are two bugs in the code.
First, I assume that you would like to iterate through all the element in the array. So rework the for loop like this: for (int i = 0; i < N; i++)
Second, when updating the theta2 value the summation is not calculated well. According to the update function it should be look like this: sum += (theta1 + theta2 * x[i] - y[i]) * x[i];
Why the final values depend on the initial values?
Because the gradient descent update step is calculated from these values. If the initial values (Starting Point) are too big or too small, then it will be too far away from the final values (Final Value). You could solve this problem by:
Increasing the iteration steps (e.g. 5000 to 50000): gradient descent algorithm has more time to converge.
Decreasing the learning rate (e.g. 0.001 to 0.01): gradient descent update steps are bigger, therefore it converges faster. Note: if the learning rate is too small, then it is possible to step through the global minimum.
The slope (theta2) is around 2.5 and the intercept (theta1) is around 2.3 for the given data. I have created a github project to fix your code and i have also added a shorter solution using LINQ. It is 5 line of codes. If you are curious check it out here.

boost's gmp_rational type is very slow when performing comparison operations

I'm comparing the performance of boost's gmp_rational datatype with the SolverFoundation.Rational type of C#. Performing arithmetic with gmp_rational is much faster than C# SolverFoundation.Rational, except for comparison operations.
I have implemented the following function in C++ and C# and made a comparison of its performance.
typedef mpq_rational NT;
void test()
{
NT x = 3.05325557;
NT y = 2.65684334;
NT z, j, k;
std::clock_t start;
double duration;
start = std::clock();
for(int i = 0; i < 10000; i++)
{
if (i%1000 == 0)
x = 3;
x = x * y;
z = x + y;
j = x + y;
k = x + y;
bool r1 = j > k; // takes very long
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"duration: "<< duration <<'\n';
}
Without the last comparison operation "j > k", the function needs 5.5 seconds. With it, the function needs 33 seconds.
I have implemented the same method in C# and did the same comparison. Without the last comparison operation "j > k", the method needs 19 seconds. With it, the method needs 19.6 seconds. So the C# code is even faster than the C++ code, but I don't understand why.

Math.Pow() vs Math.Exp() C# .Net

Can anyone provide an explanation of the difference between using Math.Pow() and Math.Exp() in C# and .net ?
Is Exp()just taking a number to the Power using itself as the Exponent?
Math.Pow computes x y for some x and y.
Math.Exp computes e x for some x, where e is Euler's number.
Note that while Math.Pow(Math.E, d) produces the same result as Math.Exp(d), a quick benchmark comparison shows that Math.Exp actually executes about twice as fast as Math.Pow:
Trial Operations Pow Exp
1 1000 0.0002037 0.0001344 (seconds)
2 100000 0.0106623 0.0046347
3 10000000 1.0892492 0.4677785
Math.Pow(Math.E,n) = Math.Exp(n) //of course this is not actual code, just a human equation.
More info: Math.Pow and Math.Exp
Math.Exp(x) is ex. (See http://en.wikipedia.org/wiki/E_(mathematical_constant).)
Math.Pow(a, b) is ab.
Math.Pow(Math.E, x) and Math.Exp(x) are the same, though the second one is the idiomatic one to use if you are using e as the base.
Just a quick extension to the Benchmark contribution from p.s.w.g -
I wanted to see one more comparison, for equivalent of 10^x ==> e^(x * ln(10)), or {double ln10 = Math.Log(10.0); y = Math.Exp(x * ln10);}
Here's what I've got:
Operation Time
Math.Exp(x) 180 ns (nanoseconds)
Math.Pow(y, x) 440 ns
Math.Exp(x*ln10) 160 ns
Times are per 10x calls to Math functions.
What I don't understand is why the time for including a multiply in the loop, before entry to Exp(), consistently produces shorter times, unless there's a bug in this code, or the algorithm is value dependent?
The program follows.
namespace _10X {
public partial class Form1 : Form {
int nLoops = 1000000;
int ix;
// Values - Just to not always use the same number, and to confirm values.
double[] x = { 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5 };
public Form1() {
InitializeComponent();
Proc();
}
void Proc() {
double y;
long t0;
double t1, t2, t3;
t0 = DateTime.Now.Ticks;
for (int i = 0; i < nLoops; i++) {
for (ix = 0; ix < x.Length; ix++)
y = Math.Exp(x[ix]);
}
t1 = (double)(DateTime.Now.Ticks - t0) * 1e-7 / (double)nLoops;
t0 = DateTime.Now.Ticks;
for (int i = 0; i < nLoops; i++) {
for (ix = 0; ix < x.Length; ix++)
y = Math.Pow(10.0, x[ix]);
}
t2 = (double)(DateTime.Now.Ticks - t0) * 1e-7 / (double)nLoops;
double ln10 = Math.Log(10.0);
t0 = DateTime.Now.Ticks;
for (int i = 0; i < nLoops; i++) {
for (ix = 0; ix < x.Length; ix++)
y = Math.Exp(x[ix] * ln10);
}
t3 = (double)(DateTime.Now.Ticks - t0) * 1e-7 / (double)nLoops;
textBox1.Text = "t1 = " + t1.ToString("F8") + "\r\nt2 = " + t2.ToString("F8")
+ "\r\nt3 = " + t3.ToString("F8");
}
private void btnGo_Click(object sender, EventArgs e) {
textBox1.Clear();
Proc();
}
}
}
So I think I'm going with Math.Exp(x * ln10) until someone finds the bug...

How can I make this C# loop faster?

Executive Summary: Reed's answer below is the fastest if you want to stay in C#. If you're willing to marshal to C++ (which I am), that's a faster solution.
I have two 55mb ushort arrays in C#. I am combining them using the following loop:
float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}
This code, according to adding DateTime.Now calls before and afterwards, takes 3.5 seconds to run. How can I make it faster?
EDIT: Here is some code that, I think, shows the root of the problem. When the following code is run in a brand new WPF application, I get these timing results:
Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods
So when arrays are walked directly, the time is much faster than if the arrays are inside of another object or container. This code shows that somehow, I'm using an accessor method, rather than accessing the arrays directly. Even so, the fastest I seem to be able to get is half a second. When I run the second listing of code in C++ with icc, I get:
Run time for pointer walk: 0.0743338
In this case, then, C++ is 7x faster (using icc, not sure if the same performance can be obtained with msvc-- I'm not as familiar with optimizations there). Is there any way to get C# near that level of C++ performance, or should I just have C# call my C++ routine?
Listing 1, C# code:
public class ArrayHolder
{
int length;
public ushort[] output;
public ushort[] input1;
public ushort[] input2;
public ArrayHolder(int inLength)
{
length = inLength;
output = new ushort[length];
input1 = new ushort[length];
input2 = new ushort[length];
}
public ushort[] getOutput() { return output; }
public ushort[] getInput1() { return input1; }
public ushort[] getInput2() { return input2; }
}
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
Random random = new Random();
int length = 55 * 1024 * 1024;
ushort[] output = new ushort[length];
ushort[] input1 = new ushort[length];
ushort[] input2 = new ushort[length];
ArrayHolder theArrayHolder = new ArrayHolder(length);
for (int i = 0; i < length; i++)
{
output[i] = (ushort)random.Next(0, 16384);
input1[i] = (ushort)random.Next(0, 16384);
input2[i] = (ushort)random.Next(0, 16384);
theArrayHolder.getOutput()[i] = output[i];
theArrayHolder.getInput1()[i] = input1[i];
theArrayHolder.getInput2()[i] = input2[i];
}
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int number = 44;
float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * (float)input2[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < length; i++)
{
theArrayHolder.output[i] =
(ushort)(theArrayHolder.input1[i] +
(ushort)(b * (float)theArrayHolder.input2[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < length; i++)
{
theArrayHolder.getOutput()[i] =
(ushort)(theArrayHolder.getInput1()[i] +
(ushort)(b * (float)theArrayHolder.getInput2()[i]));
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
}
}
Listing 2, C++ equivalent:
// looptiming.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>
int _tmain(int argc, _TCHAR* argv[])
{
int length = 55*1024*1024;
unsigned short* output = new unsigned short[length];
unsigned short* input1 = new unsigned short[length];
unsigned short* input2 = new unsigned short[length];
unsigned short* outPtr = output;
unsigned short* in1Ptr = input1;
unsigned short* in2Ptr = input2;
int i;
const int max = 16384;
for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
*outPtr = rand()%max;
*in1Ptr = rand()%max;
*in2Ptr = rand()%max;
}
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick1, tick2; // A point in time
LARGE_INTEGER time; // For converting tick into real time
QueryPerformanceCounter(&tick1);
outPtr = output;
in1Ptr = input1;
in2Ptr = input2;
int number = 44;
float b = (float)number/100.0f;
for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
*outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
}
QueryPerformanceCounter(&tick2);
QueryPerformanceFrequency(&ticksPerSecond);
time.QuadPart = tick2.QuadPart - tick1.QuadPart;
std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;
return 0;
}
EDIT 2: Enabling /QxHost in the second example drops the time down to 0.0662714 seconds. Modifying the first loop as #Reed suggested gets me down to
Time elapsed: 00:00:00.3835017
So, still not fast enough for a slider. That time is via the code:
stopwatch.Start();
Parallel.ForEach(Partitioner.Create(0, length),
(range) =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * (float)input2[i]));
}
});
stopwatch.Stop();
EDIT 3 As per #Eric Lippert's suggestion, I've rerun the code in C# in release, and, rather than use an attached debugger, just print the results to a dialog. They are:
Simple arrays: ~0.273s
Contained arrays: ~0.330s
Accessor arrays: ~0.345s
Parallel arrays: ~0.190s
(these numbers come from a 5 run average)
So the parallel solution is definitely faster than the 3.5 seconds I was getting before, but is still a bit under the 0.074 seconds achievable using the non-icc processor. It seems, therefore, that the fastest solution is to compile in release and then marshal to an icc-compiled C++ executable, which makes using a slider possible here.
EDIT 4: Three more suggestions from #Eric Lippert: change the inside of the for loop from length to array.length, use doubles, and try unsafe code.
For those three, the timing is now:
length: ~0.274s
doubles, not floats: ~0.290s
unsafe: ~0.376s
So far, the parallel solution is the big winner. Although if I could add these via a shader, maybe I could see some kind of speedup there...
Here's the additional code:
stopwatch.Reset();
stopwatch.Start();
double b2 = ((double)number) / 100.0;
for (int i = 0; i < output.Length; ++i)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b2 * (double)input2[i]));
}
stopwatch.Stop();
DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < output.Length; ++i)
{
output[i] =
(ushort)(input1[i] +
(ushort)(b * input2[i]));
}
stopwatch.Stop();
LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
unsafe
{
fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
ushort* outP = outPtr;
ushort* in1P = in1Ptr;
ushort* in2P = in2Ptr;
for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
{
*outP = (ushort)(*in1P + b * (float)*in2P);
}
}
}
stopwatch.Stop();
UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
Console.WriteLine("Time elapsed: {0}",
stopwatch.Elapsed);
This should be perfectly parallelizable. However, given the small amount of work being done per element, you'll need to handle this with extra care.
The proper way to do this (in .NET 4) would be to use Parallel.ForEach in conjunction with a Partitioner:
float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length),
(range) =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}
});
This will efficiently partition the work across available processing cores in your system, and should provide a decent speedup if you have multiple cores.
That being said, this will, at best, only speed up this operation by the number of cores in your system. If you need to speed it up more, you'll likely need to revert to a mix of parallelization and unsafe code. At that point, it might be worth thinking about alternatives to trying to present this in real time.
Assuming you have a lot of these guys, you can attempt to parallelize the operation (and you're using .NET 4):
Parallel.For(0, length, i=>
{
image.DataArray[i] =
(ushort)(mUIHandler.image1.DataArray[i] +
(ushort)(b * (float)mUIHandler.image2.DataArray[i]));
});
Of course that is all going to depend on whether or not parallelization of this would be worth it. That statement looks fairly computationally short; accessing indices by number is pretty fast as is. You might get gains because this loop is being run so many times with that much data.

Categories

Resources