Normality test for C#? - c#

We've got a PRNG in our code base, which we think has a bug in it's method to produce numbers which conform to a given normal distribution.
Is there a C# implementation of a normality test which I can leverage in my unit test suite to assert that this module behaves as desired/expected?
My preference would be something that has a signature like:
bool NormalityTest.IsNormal(IEnumerable<int> samples)

Math.Net has distribution functions and random number sampling. It is probably the most widely used math library, very solid.
http://numerics.mathdotnet.com/
http://mathnetnumerics.codeplex.com/wikipage?title=Probability%20Distributions&referringTitle=Documentation

You can try this: http://accord-framework.net/docs/html/T_Accord_Statistics_Testing_ShapiroWilkTest.htm
Scroll down to the example:
// Let's say we would like to determine whether a set
// of observations come from a normal distribution:
double[] samples =
{
0.11, 7.87, 4.61, 10.14, 7.95, 3.14, 0.46, 4.43,
0.21, 4.75, 0.71, 1.52, 3.24, 0.93, 0.42, 4.97,
9.53, 4.55, 0.47, 6.66
};
// For this, we can use the Shapiro-Wilk test. This test tests the null hypothesis
// that samples come from a Normal distribution, vs. the alternative hypothesis that
// the samples do not come from such distribution. In other words, should this test
// come out significant, it means our samples do not come from a Normal distribution.
// Create a new Shapiro-Wilk test:
var sw = new ShapiroWilkTest(samples);
double W = sw.Statistic; // should be 0.90050
double p = sw.PValue; // should be 0.04209
bool significant = sw.Significant; // should be true
// The test is significant, therefore we should reject the null
// hypothesis that the samples come from a Normal distribution.

I do not know of any c-implementation of testing data for normality. I had the same issue and build my own routine. This web-page helped me a lot. It is written for executing the test in Excel but it gives you all the necessary coefficients (Royston) and the logic.

Related

How to solve OutOfMemoryException that is thrown using principal component analysis

I'm working on a project in C# that uses Principal Component Analysis to apply feature reduction/dimension reduction on a [,]matrix. The matrix columns are features (words and bigrams) that have been extracted from a set emails. In the beginning we had around 156 emails which resulted in approximately 23000 terms and everything worked as it was supposed to using the following code:
public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center)
{
// Create Principal Component Analysis of a given source
PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(sourceMatrix, method);
// Compute the Principal Component Analysis
pca.Compute();
// Creates a projection of the information
double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);
// Return PCA Components
return pcaComponents;
}
The components we received were classified later on using Linear Discriminant Analysis' Classify method from the Accord.NET framework. Everything was working as it should.
Now that we have increased the size of out dataset (1519 emails and 68375 terms) we at first were getting some OutOfMemory Exceptions. We were able to solve this by adjusting some parts of our code until we were able to reach the part where we calculate the PCA components. Right now this takes about 45 minutes which is way too long. After checking the website of Accord.NET on PCA we decided to try and use the last example that uses a covariance matrix since it says: "Some users would like to analyze huge amounts of data. In this case, computing the SVD directly on the data could result in memory exceptions or excessive computing times". Therefore we changed our code to the following:
public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center)
{
// Compute mean vector
double[] mean = Accord.Statistics.Tools.Mean(sourceMatrix);
// Compute Covariance matrix
double[,] covariance = Accord.Statistics.Tools.Covariance(sourceMatrix, mean);
// Create analysis using the covariance matrix
var pca = PrincipalComponentAnalysis.FromCovarianceMatrix(mean, covariance);
// Compute the Principal Component Analysis
pca.Compute();
// Creates a projection of the information
double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);
// Return PCA Components
return pcaComponents;
}
This however raises an System.OutOfMemoryException. Does anyone know how to solve this problem?
I think parallelizing your solver is the best bet.
Perhaps something like FEAST would help.
http://www.ecs.umass.edu/~polizzi/feast/
Parallel linear algebra for multicore system
The problem is that the code is using jagged matrices instead of multi-dimensional matrices. The point is that double[,] needs a contiguous amount of memory to be allocated, which may be quite hard to find depending on much space you need. If you use jagged matrices, memory allocations are spread out and space is easier to find.
You can avoid this issue by upgrading to the latest version of the framework and using the new API for statistical analysis instead. Instead of passing your source matrix in the constructor and calling .Compute, simply call .Learn() instead:
public static double[][] GetPCAComponents(double[][] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center)
{
// Create Principal Component Analysis of a given source
PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(method)
{
NumberOfOutputs = dimensions // limit the number of dimensions
};
// Compute the Principal Component Analysis
pca.Learn(sourceMatrix);
// Creates a projection of the information
double[][] pcaComponents = pca.Transform(sourceMatrix);
// Return PCA Components
return pcaComponents;
}

Are there any function in C# that returns p-value associated with a chi-square goodness of fit test?

i'm new here and my English isn't very good, so i'll try to explain as well as possible.
I'm doing a web application in ASP.NET and C# about steganalysis.
I was looking for internet a function that calculates the observed significance level, or p-value in a chi-square test
for my algorithm and I found it in Java:
This is the result of mi search:
chi[block]= chiSquareTest(expectedValues, pod);
chiSquareTest(double[] expected, long[] observed)
Returns the observed significance level, or p-value,
associated with a Chi-square goodness of fit test comparing
the observed frequency counts to those in the expected array.
My question is, Are there any equivalent function in C# that returns the same parameter?
Thank you in advance,
Ana.
The MathNet nugetpackage contains the ChiSquared distribution - you can get the cumulative or inverse cumulative.
ChiSquared c = new ChiSquared(degreesOfFreedom);
return c.CumulativeDistribution(testValue);
I doubt there is an inbuilt function.
You should try looking for a library that contains the function or implement it yourself.
A quick search returned me this
http://www.alglib.net/specialfunctions/distributions/chisquare.php
and
http://www.codeproject.com/Articles/432194/How-to-Calculate-the-Chi-Squared-P-Value
I know this is an old question but I figured I'd post anyway.
The Accord.NET framework has a library and NuGet package Accord.Statistics which has a ChiSquareTest class that can be used in a similar manor as mentioned in your question:
ChiSquareTest chiSquareTest = new ChiSquareTest(observedArray, expectedArray, DOF);
chiSquareTest.PValue; // gets p-value
chiSquareTest.Significant; // true if statistically significant
The only thing is that you'll have to calculate your DOF - degrees of freedom.

Why Math.Pow(x, 2) not optimized to x * x neither compiler nor JIT?

I've encountered non-optimal code in several open source projects, when programmers do not think about what they are using.
There is up to a 10 times performance difference between two cases, because of Math.Pow use Exp and Ln functions in internal, how it is explained in this answer.
The usual multiplication is better than powering in most cases (with small powers), but the best, of course, is the Exponentation by squaring algorithm.
Thus, I think that the compiler or JITter must perform such optimization with powers and other functions. Why is it still not introduced? Am I right?
Read the anwser you've referenced again, it clearly states that CRT uses a pow() function which Microsoft bought from Intel. The example you see using Math.Log and Math.Exp is an example the writer of the article has found in a programming book.
The "problem" with general exponentiation methods is that that they are build to produce the most accurate results for all cases. This often results in sub-optimal performance for certain cases. To increase the preformance of these certain cases, conditional logic must be added which results in performance loss for all cases. Because squaring or cubing a value is that simple to write without the Math.Pow method, there is no need to optimize these cases and taking the extra loss for all other cases.
i would say that would be a bad idea, because both methods do NOT return the same results every time.
here is a small test script
var r = new Random();
var any = Enumerable.Range(0, 1000).AsParallel().All(p =>
{
var d = r.NextDouble();
var pow = Math.Pow(d, 2.0);
var sqr = d * d;
var identical = pow == sqr;
if (!identical)
MessageBox.Show(d.ToString());
return identical;
});
there are different accuracies of both implementations. if a reliable calculation is done, it should be reproducable. if for example just in the release implementation the square optimization would be used, then the debug and release version would return different solutions. that can be quite a mess for error debugging ...

Best practice with Math.Pow

I'm working on a n image processing library which extends OpenCV, HALCON, ... . The library must be with .NET Framework 3.5 and since my experiences with .NET are limited I would like to ask some questions regarding the performance.
I have encountered a few specific things which I cannot explain to myself properly and would like you to ask a) why and b) what is the best practise to deal with the cases.
My first question is about Math.pow. I already found some answers here on StackOverflow which explains it quite well (a) but not what to do about this(b). My benchmark Program looks like this
Stopwatch watch = new Stopwatch(); // from the Diagnostics class
watch.Start();
for (int i = 0; i < 1000000; i++)
double result = Math.Pow(4,7) // the function call
watch.Stop()
The result was not very nice (~300ms on my computer) (I have run the test 10 times and calcuated the average value).
My first idea was to check wether this is because it is a static function. So I implemented my own class
class MyMath
{
public static double Pow (double x, double y) //Using some expensive functions to calculate the power
{
return Math.Exp(Math.Log(x) * y);
}
public static double PowLoop (double x, int y) // Using Loop
{
double res = x;
for(int i = 1; i < y; i++)
res *= x;
return res;
}
public static double Pow7 (double x) // Using inline calls
{
return x * x * x * x * x * x * x;
}
}
THe third thing I checked were if I would replace the Math.Pow(4,7) directly through 4*4*4*4*4*4*4.
The results are (the average out of 10 test runs)
300 ms Math.Pow(4,7)
356 ms MyMath.Pow(4,7) //gives wrong rounded results
264 ms MyMath.PowLoop(4,7)
92 ms MyMath.Pow7(4)
16 ms 4*4*4*4*4*4*4
Now my situation now is basically like this: Don't use Math for Pow. My only problem is just that... do I really have to implement my own Math-class now? It seems somehow ineffective to implement an own class just for the power function. (Btw. PowLoop and Pow7 are even faster in the Release build by ~25% while Math.Pow is not).
So my final questions are
a) am I wrong if I wouldn't use Math.Pow at all (but for fractions maybe) (which makes me somehow sad).
b) if you have code to optimize, are you really writing all such mathematical operations directly?
c) is there maybe already a faster (open-source^^) library for mathematical operations
d) the source of my question is basically: I have assumed that the .NET Framework itself already provides very optimized code / compile results for such basic operations - be it the Math-Class or handling arrays and I was a little surprised how much benefit I would gain by writing my own code. Are there some other, general "fields" or something else to look out in C# where I cannot trust C# directly.
Two things to bear in mind:
You probably don't need to optimise this bit of code. You've just done a million calls to the function in less than a second. Is this really going to cause big problems in your program?
Math.Pow is probably fairly optimal anyway. At a guess, it will be calling a proper numerics library written in a lower level language, which means you shouldn't expect orders of magnitude increases.
Numerical programming is harder than you think. Even the algorithms that you think you know how to calculate, aren't calculated that way. For example, when you calculate the mean, you shouldn't just add up the numbers and divide by how many numbers you have. (Modern numerics libraries use a two pass routine to correct for floating point errors.)
That said, if you decide that you definitely do need to optimise, then consider using integers rather than floating point values, or outsourcing this to another numerics library.
Firstly, integer operations are much faster than floating point. If you don't need floating point values, don't use the floating point data type. This generally true for any programming language.
Secondly, as you have stated yourself, Math.Pow can handle reals. It makes use of a much more intricate algorithm than a simple loop. No wonder it is slower than simply looping. If you get rid of the loop and just do n multiplications, you are also cutting off the overhead of setting up the loop - thus making it faster. But if you don't use a loop, you have to know
the value of the exponent beforehand - it can't be supplied at runtime.
I am not really sure why Math.Exp and Math.Log is faster. But if you use Math.Log, you can't find the power of negative values.
Basically int are faster and avoiding loops avoid extra overhead. But you are trading off some flexibility when you go for those. But it is generally a good idea to avoid reals when all you need are integers, but in this case coding up a custom function when one already exists seems a little too much.
The question you have to ask yourself is whether this is worth it. Is Math.Pow actually slowing your program down? And in any case, the Math.Pow already bundled with your language is often the fastest or very close to that. If you really wanted to make an alternate implementation that is really general purpose (i.e. not limited to only integers, positive values, etc.), you will probably end up using the same algorithm used in the default implementation anyway.
When you are talking about making a million iterations of a line of code then obviously every little detail will make a difference.
Math.Pow() is a function call which will be substantially slower than your manual 4*4...*4 example.
Don't write your own class as its doubtful you'll be able to write anything more optimised than the standard Math class.

Taking Logarithms of relatively small numbers in different languages/architectures/operating systems

In Java I run:
System.out.println(Math.log(249.0/251.0));
Output: -0.008000042667076265
In C# I run: <- fixed
Math.Log (x/y); \\where x, y are almost assuredly 249.0 and 251.0 respectively
Output: -0.175281838 (printed out later in the program)
Google claims:
Log(249.0/251.0)
Output: -0.00347437439
And MacOS claims about the same thing (the first difference between google and Snow Leopard is at about 10^-8, which is negligible.
Is there any reason that these results should all vary so widely or am I missing something very obvious? (I did check that java and C# both use base e). Even mildly different values of e don't seem to account for such a big difference. Any suggestions?
EDIT:
Verifying on Wolfram Alpha seems to suggest that Java is right (or that Wolfram Alpha uses Java Math for logarithms...) and that my C# program doesn't have the right input, but I am disinclined to believe this because taking (e^(google result) - 249/251) gives me an error of 0.0044 which is pretty big in my opinion, suggesting that there is a different problem at hand...
You're looking at logarithms with different bases:
Java's System.out.println(Math.log(249.0/251.0)); is a natural log (base e)
C#'s Math.Log (x,y); gives the log of x with base specified by y
Google's Log(249.0/251.0) gives the log base 10
Though I don't get the result you do from C# (Math.Log( 249.0, 251.0) == 0.998552147171426).
You have a mistake somewhere in your C# program between where the log is calculated and where it is printed out. Math.Log gives the correct answer:
class P
{
static void Main()
{
System.Console.WriteLine(System.Math.Log(249.0/251.0));
}
}
prints out -0.00800004266707626

Categories

Resources