I compute Pearson correlation (average user/item rating) many times, using my current code performance is very bad:
public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
for (int a = 0; a < x.Length; a++)
{
sumNum += (x[a] - meanX[a]) * (y[a] - meanY[a]);
denomX += Math.Pow(x[a] - meanX[a], 2);
denomY += Math.Pow(y[a] - meanY[a], 2);
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = Math.Sqrt(denomX) * Math.Sqrt(denomY);
var correlation = sumNum / sumDenom;
return correlation;
}
I am using standard Pearson correlation with MathNet.Numerics, but this is modification of standard and it's not possible to use it. Is there a way to speed it up? How can it be optimizied regarding to time complexity?
Adding some on MSE answer -- changing Pow(x,2) to diff*diff is definitely something you want to do, you may also want to avoid unnecessary bound-checking in your inner-most loop. This may be done using pointers in C#.
Could be done this way:
public unsafe double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
double diffX;
double diffY;
int len = x.Length;
fixed (double* xptr = &x[0], yptr = &y[0], meanXptr = &meanX[0], meanYptr = &meanY[0])
{
for (int a = 0; a < len; a++)
{
diffX = (xptr[a] - meanXptr[a]);
diffY = (yptr[a] - meanYptr[a]);
sumNum += diffX * diffY;
denomX += diffX * diffX;
denomY += diffY * diffY;
}
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = sqrtDenomX * sqrtDenomY;
var correlation = sumNum / sumDenom;
return correlation;
}
The best way to solve your performance problems is probably to avoid computing as many correlations, if possible. If you are using the correlations as part of another computation, it may be possible to use math to remove the need for some of them.
You should also consider whether you will be able to use the square of the Pearson correlation instead of the Pearson correlation itself. That way, you can save your calls to Math.Sqrt(), which are usually quite expensive.
If you do need to take the square root, you should use sqrtDenomX and sqrtDenomY again, rather than recompute the square roots.
The only possible optimizations that I see in your code are in the following code, if you are still looking for better performance then you may want use SIMD vectorization. It will allow you to use the full computation power of the CPU
public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
double diffX;
double diffY;
for (int a = 0; a < x.Length; a++)
{
diffX = (x[a] - meanX[a]);
diffY = (y[a] - meanY[a]);
sumNum += diffX * diffY;
denomX += diffX * diffX;
denomY += diffY * diffY;
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = sqrtDenomX * sqrtDenomY;
var correlation = sumNum / sumDenom;
return correlation;
}
Related
When I using Math.Exp() in C# I have some questions?This code is about Kernel density estimation, and I don't have any knowledge about kernel density estimation. So I look up some wiki and some paper.
I try to write it by C#. The problem is when "distance" is getting higher the result is become 0. It's confuse me and I cannot find any other way to get the right result.
disExp = Math.Pow(Math.E, -(distance / 2 * Math.Pow(h, 2)));
So, can any one help me to get the solution? Or give me some idea about Kernel density estimation on C#. Sorry for poor English.
Try this
public static double[,] KernelDensityEstimation(double[] data, double sigma, int nsteps)
{
// probability density function (PDF) signal analysis
// Works like ksdensity in mathlab.
// KDE performs kernel density estimation (KDE)on one - dimensional data
// http://en.wikipedia.org/wiki/Kernel_density_estimation
// Input: -data: input data, one-dimensional
// -sigma: bandwidth(sometimes called "h")
// -nsteps: optional number of abscis points.If nsteps is an
// array, the abscis points will be taken directly from it. (default 100)
// Output: -x: equispaced abscis points
// -y: estimates of p(x)
// This function is part of the Kernel Methods Toolbox(KMBOX) for MATLAB.
// http://sourceforge.net/p/kmbox
// Converted to C# code by ksandric
double[,] result = new double[nsteps, 2];
double[] x = new double[nsteps], y = new double[nsteps];
double MAX = Double.MinValue, MIN = Double.MaxValue;
int N = data.Length; // number of data points
// Find MIN MAX values in data
for (int i = 0; i < N; i++)
{
if (MAX < data[i])
{
MAX = data[i];
}
if (MIN > data[i])
{
MIN = data[i];
}
}
// Like MATLAB linspace(MIN, MAX, nsteps);
x[0] = MIN;
for (int i = 1; i < nsteps; i++)
{
x[i] = x[i - 1] + ((MAX - MIN) / nsteps);
}
// kernel density estimation
double c = 1.0 / (Math.Sqrt(2 * Math.PI * sigma * sigma));
for (int i = 0; i < N; i++)
{
for (int j = 0; j < nsteps; j++)
{
y[j] = y[j] + 1.0 / N * c * Math.Exp(-(data[i] - x[j]) * (data[i] - x[j]) / (2 * sigma * sigma));
}
}
// compilation of the X,Y to result. Good for creating plot(x, y)
for (int i = 0; i < nsteps; i++)
{
result[i, 0] = x[i];
result[i, 1] = y[i];
}
return result;
}
kernel density estimation C#
plot
how to find the min and max for quadratic equation using c# ??
f(x,y) = x^2 + y^2 + 25 * (sin(x)^2 + sin(y)^2) ,where (x,y) from (-2Pi, 2Pi) ??
in the manual solving I got min is = 0 , max = 8Pi^2 = 78.957 .
I tried to write the code based on liner quadratic code but something goes totally wrong
this code give the min = -4.?? and the max = 96 could you help to know where is my mistake please ??
I uploaded the code to dropbox if anyone can have look : https://www.dropbox.com/s/p7y6krk2gk29i9e/Program.cs
double[] X, Y, Result; // Range array and result array.
private void BtnRun_Click(object sender, EventArgs e)
{
//Set any Range for the function
X = setRange(-2 * Math.PI, 2 * Math.PI, 10000);
Y = setRange(-2 * Math.PI, 2 * Math.PI, 10000);
Result = getOutput_twoVariablesFunction(X, Y);
int MaxIndex = getMaxIndex(Result);
int MinIndex = getMinIndex(Result);
TxtMin.Text = Result[MinIndex].ToString();
TxtMax.Text = Result[MaxIndex].ToString();
}
private double twoVariablesFunction(double x,double y)
{
double f;
//Set any two variables function
f = Math.Pow(x, 2) + Math.Pow(y, 2) + 25 * (Math.Pow(Math.Sin(x), 2) + Math.Pow(Math.Sin(y), 2));
return f;
}
private double[] setRange(double Start, double End, int Sample)
{
double Step = (End - Start) / Sample;
double CurrentVaue = Start;
double[] Array = new double[Sample];
for (int Index = 0; Index < Sample; Index++)
{
Array[Index] = CurrentVaue;
CurrentVaue += Step;
}
return Array;
}
private double[] getOutput_twoVariablesFunction(double[] X, double[] Y)
{
int Step = X.Length;
double[] Array = new double[Step];
for (int Index = 0; Index < X.Length ; Index++)
{
Array[Index] = twoVariablesFunction(X[Index], Y[Index]);
}
return Array;
}
private int getMaxIndex(double[] ValuesArray)
{
double M = ValuesArray.Max();
int Index = ValuesArray.ToList().IndexOf(M);
return Index;
}
private int getMinIndex(double[] ValuesArray)
{
double M = ValuesArray.Min();
int Index = ValuesArray.ToList().IndexOf(M);
return Index;
}
Do you want to compute (sin(x))^2 or sin(x^2)? In your f(x,y) formula it looks like (sin(x))^2, but in your method twoVariablesFunction like sin(x^2).
I'm looking to produce a data histogram from a given dataset. I've read about different options for constructing the histogram and I'm most interested in a method based on the work of
Shimazaki, H.; Shinomoto, S. (2007). "A method for selecting the bin
size of a time histogram"
The above method uses estimation to determine the optimal bin width and distribution, which is needed in my case because the sample data will vary in distribution and hard to determine the bin count and width in advance.
Can someone recommend a good source or a starting point for writing such a function in c# or have a close enough c# histogram code.
Many thanks.
The following is a port I wrote of the Python version of this algorithm from here. I know the API could do with some work, but this should be enough to get you started. The results of this code are identical to those produced by the Python code for the same input data.
public class HistSample
{
public static void CalculateOptimalBinWidth(double[] x)
{
double xMax = x.Max(), xMin = x.Min();
int minBins = 4, maxBins = 50;
double[] N = Enumerable.Range(minBins, maxBins - minBins)
.Select(v => (double)v).ToArray();
double[] D = N.Select(v => (xMax - xMin) / v).ToArray();
double[] C = new double[D.Length];
for (int i = 0; i < N.Length; i++)
{
double[] binIntervals = LinearSpace(xMin, xMax, (int)N[i] + 1);
double[] ki = Histogram(x, binIntervals);
ki = ki.Skip(1).Take(ki.Length - 2).ToArray();
double mean = ki.Average();
double variance = ki.Select(v => Math.Pow(v - mean, 2)).Sum() / N[i];
C[i] = (2 * mean - variance) / (Math.Pow(D[i], 2));
}
double minC = C.Min();
int index = C.Select((c, ix) => new { Value = c, Index = ix })
.Where(c => c.Value == minC).First().Index;
double optimalBinWidth = D[index];
}
public static double[] Histogram(double[] data, double[] binEdges)
{
double[] counts = new double[binEdges.Length - 1];
for (int i = 0; i < binEdges.Length - 1; i++)
{
double lower = binEdges[i], upper = binEdges[i + 1];
for (int j = 0; j < data.Length; j++)
{
if (data[j] >= lower && data[j] <= upper)
{
counts[i]++;
}
}
}
return counts;
}
public static double[] LinearSpace(double a, double b, int count)
{
double[] output = new double[count];
for (int i = 0; i < count; i++)
{
output[i] = a + ((i * (b - a)) / (count - 1));
}
return output;
}
}
Run it like this:
double[] x =
{
4.37, 3.87, 4.00, 4.03, 3.50, 4.08, 2.25, 4.70, 1.73,
4.93, 1.73, 4.62, 3.43, 4.25, 1.68, 3.92, 3.68, 3.10,
4.03, 1.77, 4.08, 1.75, 3.20, 1.85, 4.62, 1.97, 4.50,
3.92, 4.35, 2.33, 3.83, 1.88, 4.60, 1.80, 4.73, 1.77,
4.57, 1.85, 3.52, 4.00, 3.70, 3.72, 4.25, 3.58, 3.80,
3.77, 3.75, 2.50, 4.50, 4.10, 3.70, 3.80, 3.43, 4.00,
2.27, 4.40, 4.05, 4.25, 3.33, 2.00, 4.33, 2.93, 4.58,
1.90, 3.58, 3.73, 3.73, 1.82, 4.63, 3.50, 4.00, 3.67,
1.67, 4.60, 1.67, 4.00, 1.80, 4.42, 1.90, 4.63, 2.93,
3.50, 1.97, 4.28, 1.83, 4.13, 1.83, 4.65, 4.20, 3.93,
4.33, 1.83, 4.53, 2.03, 4.18, 4.43, 4.07, 4.13, 3.95,
4.10, 2.27, 4.58, 1.90, 4.50, 1.95, 4.83, 4.12
};
HistSample.CalculateOptimalBinWidth(x);
Check the Histogram function. If any data elements are unlucky to be equal to a bin boundary (other than the first or last bin), they will be counted in both consecutive bins.
The code needs to check (lower <= data[j] && data[j] < upper) and handle the case that all elements equal to xMax go into the last bin.
A small update to nick_w answer.
If you actually need the bins after. Plus optimized the double loop in histogram function away, plus got rid of linspace function.
/// <summary>
/// Calculate the optimal bins for the given data
/// </summary>
/// <param name="x">The data you have</param>
/// <param name="xMin">The minimum element</param>
/// <param name="optimalBinWidth">The width between each bin</param>
/// <returns>The bins</returns>
public static int[] CalculateOptimalBinWidth(List<double> x, out double xMin, out double optimalBinWidth)
{
var xMax = x.Max();
xMin = x.Min();
optimalBinWidth = 0;
const int MIN_BINS = 1;
const int MAX_BINS = 20;
int[] minKi = null;
var minOffset = double.MaxValue;
foreach (var n in Enumerable.Range(MIN_BINS, MAX_BINS - MIN_BINS).Select(v => v*5))
{
var d = (xMax - xMin)/n;
var ki = Histogram(x, n, xMin, d);
var ki2 = ki.Skip(1).Take(ki.Length - 2).ToArray();
var mean = ki2.Average();
var variance = ki2.Select(v => Math.Pow(v - mean, 2)).Sum()/n;
var offset = (2*mean - variance)/Math.Pow(d, 2);
if (offset < minOffset)
{
minKi = ki;
minOffset = offset;
optimalBinWidth = d;
}
}
return minKi;
}
private static int[] Histogram(List<double> data, int count, double xMin, double d)
{
var histogram = new int[count];
foreach (var t in data)
{
var bucket = (int) Math.Truncate((t - xMin)/d);
if (count == bucket) //fix xMax
bucket --;
histogram[bucket]++;
}
return histogram;
}
I would recommend binary search to speed up the assignment to the class intervals.
public void Add(double element)
{
if (element < Bins.First().LeftBound || element > Bins.Last().RightBound)
return;
var min = 0;
var max = Bins.Length - 1;
var index = 0;
while (min <= max)
{
index = min + ((max - min) / 2);
if (element >= Bins[index].LeftBound && element < Bins[index].RightBound)
break;
if (element < Bins[index].LeftBound)
max = index - 1;
else
min = index + 1;
}
Bins[index].Count++;
}
"Bins" is a list of items of type "HistogramItem" which defines properties like "Leftbound", "RightBound" and "Count".
I am trying to convert a C++ class to C# and in the process learn something of C++. I had never run into a vector<> before and my understanding is this is like a List<> function in C#. During the conversion of the class I re-wrote the code using List futures_price = New List(Convert.ToInt32(no_steps) + 1);. As soon as I run the code, I get a "Index was out of range" error.
Having looked around on SOF, I believe the issue is regarding the parameter being out of index range relating to this, but I do not see a simple solution to solve this with the below code.
In particular, this is the line that is triggering the error: futures_prices[0] = spot_price * Math.Pow(d, no_steps);
Below is the full code:
public double futures_option_price_call_american_binomial(double spot_price, double option_strike, double r, double sigma, double time, double no_steps)
{
//double spot_price, // price futures contract
//double option_strike, // exercise price
//double r, // interest rate
//double sigma, // volatility
//double time, // time to maturity
//int no_steps
List<double> futures_prices = new List<double>(Convert.ToInt32(no_steps) + 1);
//(no_steps+1);
//double call_values = (no_steps+1);
List<double> call_values = new List<double>(Convert.ToInt32(no_steps) + 1);
double t_delta = time/no_steps;
double Rinv = Math.Exp(-r*(t_delta));
double u = Math.Exp(sigma * Math.Sqrt(t_delta));
double d = 1.0/u;
double uu= u*u;
double pUp = (1-d)/(u-d); // note how probability is calculated
double pDown = 1.0 - pUp;
futures_prices[0] = spot_price * Math.Pow(d, no_steps);
int i;
for (i=1; i<=no_steps; ++i) futures_prices[i] = uu*futures_prices[i-1]; // terminal tree nodes
for (i=0; i<=no_steps; ++i) call_values[i] = Math.Max(0.0, (futures_prices[i]-option_strike));
for (int step = Convert.ToInt32(no_steps) - 1; step >= 0; --step)
{
for (i = 0; i <= step; ++i)
{
futures_prices[i] = d * futures_prices[i + 1];
call_values[i] = (pDown * call_values[i] + pUp * call_values[i + 1]) * Rinv;
call_values[i] = Math.Max(call_values[i], futures_prices[i] - option_strike); // check for exercise
};
};
return call_values[0];
}
Here is the original source in C++:
double futures_option_price_call_american_binomial(const double& F, // price futures contract
const double& K, // exercise price
const double& r, // interest rate
const double& sigma, // volatility
const double& time, // time to maturity
const int& no_steps) { // number of steps
vector<double> futures_prices(no_steps+1);
vector<double> call_values (no_steps+1);
double t_delta= time/no_steps;
double Rinv = exp(-r*(t_delta));
double u = exp(sigma*sqrt(t_delta));
double d = 1.0/u;
double uu= u*u;
double pUp = (1-d)/(u-d); // note how probability is calculated
double pDown = 1.0 - pUp;
futures_prices[0] = F*pow(d, no_steps);
int i;
for (i=1; i<=no_steps; ++i) futures_prices[i] = uu*futures_prices[i-1]; // terminal tree nodes
for (i=0; i<=no_steps; ++i) call_values[i] = max(0.0, (futures_prices[i]-K));
for (int step=no_steps-1; step>=0; --step) {
for (i=0; i<=step; ++i) {
futures_prices[i] = d*futures_prices[i+1];
call_values[i] = (pDown*call_values[i]+pUp*call_values[i+1])*Rinv;
call_values[i] = max(call_values[i], futures_prices[i]-K); // check for exercise
};
};
return call_values[0];
};
A List<double> starts out empty until you add items to it. (passing the constructor argument just sets the capacity, preventing costly resizes)
You can't access [0] until you Add() it.
To use it the way you are, use an array instead.
As SLaks says, it's better to use an Array in this situation. C# lists are filled with Add method and values are removed through Remove method... this would be more complicated and memory/performance expensive as you are also replacing values.
public Double FuturesOptionPriceCallAmericanBinomial(Double spotPrice, Double optionStrike, Double r, Double sigma, Double time, Double steps)
{
// Avoid calling Convert multiple times as it can be quite performance expensive.
Int32 stepsInteger = Convert.ToInt32(steps);
Double[] futurePrices = new Double[(stepsInteger + 1)];
Double[] callValues = new Double[(stepsInteger + 1)];
Double tDelta = time / steps;
Double rInv = Math.Exp(-r * (tDelta));
Double u = Math.Exp(sigma * Math.Sqrt(tDelta));
Double d = 1.0 / u;
Double uu = u * u;
Double pUp = (1 - d) / (u - d);
Double pDown = 1.0 - pUp;
futurePrices[0] = spotPrice * Math.Pow(d, steps);
for (Int32 i = 1; i <= steps; ++i)
futurePrices[i] = uu * futurePrices[(i - 1)];
for (Int32 i = 0; i <= steps; ++i)
callValues[i] = Math.Max(0.0, (futurePrices[i] - optionStrike));
for (Int32 step = stepsInteger - 1; step >= 0; --step)
{
for (Int32 i = 0; i <= step; ++i)
{
futurePrices[i] = d * futurePrices[(i + 1)];
callValues[i] = ((pDown * callValues[i]) + (pUp * callValues[i + 1])) * rInv;
callValues[i] = Math.Max(callValues[i], (futurePrices[i] - option_strike));
}
}
return callValues[0];
}
The source data :
static double[] felix = new double[] { 0.003027523, 0.002012256, -0.001369238, -0.001737660, -0.001647287,
0.000275154, 0.002017238, 0.001372621, 0.000274148, -0.000913576, 0.001920263, 0.001186456, -0.000364631,
0.000638337, 0.000182266, -0.001275626, -0.000821093, 0.001186998, -0.000455996, -0.000547445, -0.000182582,
-0.000547845, 0.001279006, 0.000456204, 0.000000000, -0.001550388, 0.001552795, 0.000729594, -0.000455664,
-0.002188184, 0.000639620, 0.000091316, 0.001552228, -0.001002826, 0.000182515, -0.000091241, -0.000821243,
-0.002009132, 0.000000000, 0.000823572, 0.001920088, -0.001368863, 0.000000000, 0.002101800, 0.001094291,
0.001639643, 0.002637323, 0.000000000, -0.000172336, -0.000462665, -0.000136141 };
The variance function:
public static double Variance(double[] x)
{
if (x.Length == 0)
return 0;
double sumX = 0;
double sumXsquared = 0;
double varianceX = 0;
int dataLength = x.Length;
for (int i = 0; i < dataLength; i++)
{
sumX += x[i];
sumXsquared += x[i] * x[i];
}
varianceX = (sumXsquared / dataLength) - ((sumX / dataLength) * (sumX / dataLength));
return varianceX;
}
Excel and some online calculator says the variance is 1.56562E-06
While my function gives me 1.53492394804015E-06. I begin to doubt if the C# has accuracy problem or what. Is there anyone have this kind of problem before?
What you are seeing is the difference between sample variance and population variance and nothing to do with floating point precision or the accuracy of C#'s floating point implementation.
You are calculating population variance. Excel and that web site are calculating sample variance.
Var and VarP are distinct calculations and you do need to be careful about which one you are using. (unfortunately people often refer to them as if they are interchangeable when they are not. The same is true for standard deviation)
Sample variance for your data is 1.56562E-06, population variance is 1.53492394804015E-06.
From some code posted on codeproject awhile back:
Variance in a sample
public static double Variance(this IEnumerable<double> source)
{
double avg = source.Average();
double d = source.Aggregate(0.0, (total, next) => total += Math.Pow(next - avg, 2));
return d / (source.Count() - 1);
}
Variance in a population
public static double VarianceP(this IEnumerable<double> source)
{
double avg = source.Average();
double d = source.Aggregate(0.0, (total, next) => total += Math.Pow(next - avg, 2));
return d / source.Count();
}
Here's an alternate implementation, that is sometimes better-behaved, numerically:
mean = Average(data);
double sum2 = 0.0, sumc = 0.0;
for (int i = 0; i < data.Count; i++)
{
double dev = data[i] - mean;
sum2 += dev * dev;
sumc += dev;
}
return (sum2 - sumc * sumc / data.Count) / data.Count;